Extract with a RegEx
The following rule aims to identify and extract numbers in a predefined format from paragraphs containing a particular term.
It targets sections whose headlines contain a particular word (here: “Reference”); using a regular expression, it identifies and creates “glp_number” entities for numbers adhering to a predefined format.
You need a representation of the entity within DocuMine that your rule can refer to. For further information, please see Create entity. In the given example, the entity is called "GLP Revision"; you need the entity’s technical name to write the respective extraction rule (here: glp_revision).
Code example:
rule "T.6.0" when $paragraph: Paragraph(containsString("Rev")) $entity: TextEntity( type=="glp_reference" && occursInNode($paragraph) ) then entityCreationService .byRegex( "Rev\\.\\s[\\d]{1,3},\\s[a-zA-Z]{3,20}\\/\\d{4}", "glp_revision", EntityType.ENTITY, $paragraph ) .forEach(entity -> entity.apply("T.6.0", "GLP Rev. found.") ); end
The following provides a detailed breakdown of the rule syntax:
Syntax | Explanation |
---|---|
rule "T.6.0" | Name of the rule Each rule must have a unique name. For further information, please see Rule naming. |
$paragraph: Paragraph(containsString("Rev")) | Filters for paragraphs that contain the string "Rev". |
entityCreationService | Invokes the class responsible for creating entities. |
.byRegex("Rev\.\s[\d]{1,3},\s[a-zA-Z]{3,20}\/\d{4}", "glp_revision", EntityType.ENTITY, $paragraph) | Invokes a method that creates “glp_revision” entities matching a regular expression pattern. |
.forEach(entity -> entity.apply("T.3.0", "Study initiation date found.") | Applies the "T.6.0" identifier and the message "GLP Rev. found" to each entity created. |
Notice
For further information about the methods listed in the table, please refer to the Javadoc.