Extract paragraph with specific word and headline
The following rule aims to identify and extract paragraphs within a document whose headlines contain a particular word (here: "references") and whose body contains a particular word (here: GLP). It targets the respective paragraphs and creates “glp_reference” entities for each matching paragraph from its semantic node.
You need a representation of the entity within DocuMine that your rule can refer to. For further information, please see Create entity. In the given example, the entity is called "GLP References"; you need the entity’s technical name to write the respective extraction rule (here: glp_references).
Code example:
rule "T.2.0" when $paragraph: Paragraph( getHeadline().containsStringIgnoreCase("references") && containsString("GLP") ) then entityCreationService .bySemanticNode($paragraph,"glp_reference", EntityType.ENTITY) .ifPresent(entity -> entity.apply("T.2.0","Reference paragraph found.") ); end
The following provides a detailed breakdown of the rule syntax:
Syntax | Explanation |
---|---|
rule "T.2.0" | Name of the rule Each rule must have a unique name. For further information, please see Rule naming. |
$paragraph: Paragraph(getHeadline().containsStringIgnoreCase("references") | Filters for paragraph elements whose headline contains the word "references", ignoring the capitalization of the word (case insensitive). |
&& containsString("GLP") | Defines an additional property to filter for: the paragraph must contain the string “GLP.” |
entityCreationService | Invokes the class responsible for creating entities. |
.bySemanticNode($paragraph,"glp_reference", EntityType.ENTITY) | Invokes the “bySemanticNode” method to create an entity named "glp_reference" containing the provided paragraph. |
.ifPresent(entity -> entity.apply("T.2.0","Reference paragraph found.") | Applies the "T.2.0" identifier and the message "Reference paragraph found." to each entity created. |
Notice
For further information about the methods listed in the table, please refer to the Javadoc.