Skip to main content

DocuMine Documentation

Component mapping

Component mappings allow you to map extracted entities to specific values or descriptions, ensuring data consistency and accuracy in reports and exports. They are particularly useful when your usecase requires extensive transformations, such as replacing internal code with more descriptive names across large datasets. By defining component mappings, you can automate these transformations and avoid having to hardcode every possible value in your component rules.

You will need to follow a series of steps to make your mappings work. This chapter will walk you through a relevant example that demonstrates the entire process—from setting up the environment to uploading the CSV file and configuring the respective component rule.

In the example used below, we will map project numbers and years to specific descriptions to be used in the exports and reports, illustrating how component mappings can replace technical data with more meaningful information to ensure the extracted components and reports can be easily understood.

Set up environment

The imports and global variables detailed below are essential for enabling the use of component mappings within your component rules.

import com.iqser.red.service.redaction.v1.server.service.components.ComponentMappingService;
global ComponentMappingService componentMappingService;

Insert the lines detailed above into the imports section of the component rule file, which is located at the beginning of the file, above the queries section. You can add them anywhere within the imports section. Once added, ensure you press “Save changes” to confirm the update.

Package_import_1_2.png

Imports and global variables

Mapping example

In this example, project numbers and years are mapped to specific descriptions for inclusion in exports and reports. The goal is to create a mapping that applies only when the text contains a particular project number and year.

Scenario:

The scenario involves documents that reference Test Guidelines for Environmental Impact Assessment. These guidelines are cited in various formats throughout the document, such as

  • Test Guideline 425

  • 425 Test Guideline (2017)

  • 2017 - Test Guideline 425

The goal is to standardize the output of these references, ensuring a consistent citation format in the exported reports. This is done by mapping the guideline number and publication year to a normalized reference format: Guideline number, title, date.

Entities for mapping

Two entities representing the relevant guideline references are extracted from the documents:

  • Project Guideline Number

  • Project Guideline Year

Upload CSV file

Prepare a CSV file containing your mappings. The file should map each unique combination of guideline number and year to the appropriate standardized description.

File requirements:

  • Supported encoding: UTF-8, UTF-16, UTF_16BE, UTF_16LE, ISO-8859-1, US-ASCII; Default: UTF-8

  • Delimiter: can be chosen freely; Default is “,” (comma)

The file content might look as follows:

number,year,description
402,1987,Nº 402: Environmental Impact Assessment (24/02/1987)
402,2017,Nº 402: Environmental Impact Assessment (09/10/2017)
403,1981,Nº 403: Renewable Energy Report (12/05/1981)

The number and the year columns are mapped to the specific description fields. E. g., "project number 402" and "year 1987" are mapped to the descritpion "N° 402: Environmental Impact Assessment (24/02/1987)" description.

This mapping ensures each unique combination of project number and years is associated with the correct description. The extracted component will then correspond to this description.

If columns in the CSV file contain commas that should appear as part of the final output, we recommend using semicolons as the delimiter and configure the mapping accordingly during the upload. This ensures that commas within the column values are displayed correctly and not misinterpreted as field separators, preserving the integrity of the CSV structure:

number;year;description
403;1981;Nr 403: Renewable Energy Report, (12/05/1981)
425;2008;Nr 403: Renewable Energy Report, (12/05/1981)
207;1984;Nr 403: Renewable Energy Report, (12/05/1981)

To upload the CSV file, go to User menu > Settings > Dossier Template > Component Mapping. Create a new mapping as the basis for your upload. For detailed information on how to proceed, please see Component mapping (admin manual).

For this example, we have named the mapping GuidelineMapping and will refer to it by this name in the rule.

ComponentMapping_upload.png

Component mapping in "Component mappings" tab of the dossier template

Enter component rule

Go to the component rule editor and enter the respective component rule referencing the component mapping.

Component mapping rule:

 rule "TestGuideline.0.1: match project number and year with guideline mappings"
    salience 1
    when
        $guidelineNumber: Entity(type == "project_guideline_number", $number: value)
        $guidelineYear: Entity(type == "project_guideline_year", $year: value)
    then
        componentMappingService.from("GuidelineMapping")
            .where("number = " + $number)
            .where("year = " + $year)
            .select("description")
            .findAny()
            .ifPresent(description ->         
                componentCreationService.create(
                    "TestGuideline.0.0",
                    "Test_Guidelines_1",
                    description,
                    "Project Number and guideline year mapped!",
                    List.of($guidelineNumber, $guidelineYear)
                )
            );
    end

Rule explanation:

The component rule identifies entities of type project_guideline_number and project_guideline_year.

Upon locating these entities, it references the CSV Mapping file (GuidelineMapping) to find matching entries (see where clauses) and selects the corresponding value from the “description” field. The retrieved data is then used to update the value for the TEST_GUIDELINES component.

Querying data:

A simple query string could look like this:

number = 10

In this case:

  • number: must match one of the column labels

  • =: represents the operator

  • 10: is the value being queried.

Please note:

  • Strings containing whitespaces must be enclosed in single quotes. For example:

    name = 'John Doe'
  • If the query contains an apostrophe ('), it must be escaped using a backlash (\). For example:

    name = Peter\\'s<p>name = 'John Doe'
  • The escape character \ needs to be escaped as well.

Upload test file(s)

Once the rule is in place, upload one or more test files to check if the mapping works correctly.

Component_Mapping_result_1_2.png

Mapping result

The annotations in the document and the annotations list show that the Project Guideline and Project Guideline Number entities have been extracted. In the components table on the left, these values are replaced by the final output value defined in the mapping CSV.

Stop when first match is found

Documents may contain multiple references or project numbers and years. If you want the rule to stop processing after finding the first match, you can modify the rule to include an early exit condition.

Add the no Component(name == "Test_Guidelines") condition to the rule. It ensures the process stops as soon as the first match is found.

Adjusted component rule:

 rule "TestGuideline.0.0: match project number and year with guideline mappings"
    salience 1
    when
        not Component(name == "Test_Guidelines_1")
        $guidelineNumber: Entity(type == "project_guideline_number", $number: value)
        $guidelineYear: Entity(type == "project_guideline_year",  $year: value)
    then
        componentMappingService.from("GuidelineMapping")
            .where("number = " + $number)
            .where("year = " + $year)
            .select("description")
            .findAny()
            .ifPresent(description ->         
                componentCreationService.create(
                    "TestGuideline.0.0",
                    "Test_Guidelines",
                    description,
                    "Project Number and guideline year mapped!",
                    List.of($guidelineNumber, $guidelineYear)
                )
            );
    end
Performance best practices

To optimize performance when querying data from a CSV file, always aim to query the first column of the CSV first. This significantly speeds up the lookups.

The rule engine typically tries every possible combination of conditions, which can result in unnecessary processing. To avoid this, you can use an early existence check in the "when" block. This reduces the number of combinations the engine needs to process.

For instance, the existsByFirstColumn($number) method ensures that the first column (the "number" column) is queried first. This check is performed before evaluating the rest of the rule, ensuring the engine only processes relevant matches.

 rule "TestGuideline.0.0: match project number and year with guideline mappings"    
    salience 1    
    when        
        $guidelineNumber: Entity(type == "project_guideline_number", $number: value)              
        eval(componentMappingService.from("GuidelineMapping").existsByFirstColumn($number))        
        $guidelineYear: Entity(type == "project_guideline_year", $year: value)    
    then        
        componentMappingService.from("GuidelineMapping")            
            .where("number = " + $number)            
            .where("year = " + $year)            
            .select("description")            
            .findAny()            
            .ifPresent(description ->                
               componentCreationService.create(                    
                    "TestGuideline.0.0",                    
                    "Test_Guidelines",                    
                    description,                    
                    "Project Number and guideline year mapped!",                    
                    List.of($guidelineNumber, $guidelineYear)
                )
            );
    end    
Further query operators

In addition to the standard equality operator (=), several other operators are available for querying and mapping data in your component rules. These operators allow you to handle more complex conditions, perform pattern matching, and even apply fuzzy logic where necessary.

Available operators:

  • matches

    Use this operator to perform a pattern match using regular expressions (RegEx). It is useful for cases where you want to handle variations in data formatting or matching multiple patterns in a string. You can separate different values using the pipe symbol (|) to perform an "OR" match across multiple values.

    Examplex with RegEx:

     rule "TestGuideline.0.0: match project number and year with guideline mappings"        
        salience 1        
        when            
            $guidelineNumber: Entity(type == "project_guideline_number", $number: value)        
        then            
            componentMappingService.from("GuidelineMapping")                
                .where("number = " + $number)                
                .where("year matches 20\\\\d{2}")                
                .select("description")                
                .findAny()                
                .ifPresent(description ->                    
                    componentCreationService.create(
                           "TestGuideline.0.0",
                            "Test_Guidelines",                        
                            description,                        
                            "Project Numbers with years after 2000 mapped.",                        
                            List.of($guidelineNumber)
                    )
                );
        end 
    

    Due to Java's string handling, the escape character \ needs to be escaped itself, leading to four escape characters in this case.

    Example for OR operation:

     rule "TestGuideline.0.0: match project number and year with guideline mappings"    
        salience 1    
        when        
            $guidelineNumber: Entity(type == "project_guideline_number", $number: value)        
            $guidelineYear: Entity(type == "project_guideline_year", $year: value)        
            $guidelineYear: Entity(type == "project_guideline_year2", $year2: value)    
        then        
            componentMappingService.from("GuidelineMapping")            
                .where("number = " + $number)            
                .where("year matches " + $year + "|" + $year2)            
                .select("description")            
                .findAny()            
                .ifPresent(description ->                
                    componentCreationService.create(                    
                        "TestGuideline.0.0",                    
                        "Test_Guidelines",                    
                        description,                   
                        "Project Number and guideline year mapped!",                    
                        List.of($guidelineNumber, $guidelineYear)
                    )
                );
        end
    
  • soundslike

    Use this operator to verify whether a word sounds similar to a given value, based on English pronunciation. It uses the Soundex algorithm, which maps each letter of a word to a sound code for fuzzy matching. This is especially useful for handling misspellings or alternate spellings that still sound alike.

    Example:

    // Match firstName "Jon" or "John":
    /persons[ firstName soundslike "John" ]
  • contains

    Use this operator to check whether a field that is a collection or array contains the specified value.

    name contains 'John Doe'