Skip to main content

DocuMine Documentation

Extract with a RegEx

The following rule aims to identify and extract numbers in a predefined format from paragraphs containing a particular term.

It targets sections whose headlines contain a particular word (here: “Reference”); using a regular expression, it identifies and creates “glp_number” entities for numbers adhering to a predefined format.

You need a representation of the entity within DocuMine that your rule can refer to. For further information, please see Create entity. In the given example, the entity is called "GLP Revision"; you need the entity’s technical name to write the respective extraction rule (here: glp_revision).

Code example:

rule "T.6.0"
    when
        $paragraph: Paragraph(containsString("Rev"))
        $entity: TextEntity(
            type=="glp_reference"
            && occursInNode($paragraph)
        )
    then
        entityCreationService
            .byRegex(
                "Rev\\.\\s[\\d]{1,3},\\s[a-zA-Z]{3,20}\\/\\d{4}",
                "glp_revision",
                EntityType.ENTITY,
                $paragraph
            )
            .forEach(entity -> entity.apply("T.6.0", "GLP Rev. found.")
        );
    end

The following provides a detailed breakdown of the rule syntax:

Syntax

Explanation

rule "T.6.0"

Name of the rule

Each rule must have a unique name. For further information, please see Rule naming.

$paragraph: Paragraph(containsString("Rev"))

Filters for paragraphs that contain the string "Rev".

entityCreationService

Invokes the class responsible for creating entities.

.byRegex("Rev\.\s[\d]{1,3},\s[a-zA-Z]{3,20}\/\d{4}", "glp_revision", EntityType.ENTITY, $paragraph)

Invokes a method that creates “glp_revision” entities matching a regular expression pattern.

.forEach(entity -> entity.apply("T.3.0", "Study initiation date found.")

Applies the "T.6.0" identifier and the message "GLP Rev. found" to each entity created.

Notice

For further information about the methods listed in the table, please refer to the Javadoc.