Gene Regulation Network (GRN)
Goals in IE
Assess the performance of information extraction systems to extract a genic regulation network.
Motivation in biology
The gene regulation network task aims at evaluating the quality of the extraction of gene interaction by IE systems with respect to the goals in biology. The automatic design of gene regulation network is one of the main challenges in Biology, because it is a crucial step forward in understanding the cellular regulation system.
The goal is to retrieve all the genic interactions of the reference network –at least one occurrence per interaction– independently of where they are mentioned in the literature.
Compared to the BI task of BioNLPST’11, the evaluation will measure the capability of the IE systems to reconstruct the reference regulatory network. The corpus is an extension of the BioNLP-ST 2011 BI corpus derived from the LLL challenge corpus. For GRN, it has been expanded in order to cover more extensively the regulation network of a specific cellular function in Bacillus subtilis: the sporulation. This phenomenon is an adaptation of the bacteria to scarce resource conditions (e.g. low nutrients), it has been thoroughly studied in the past and the regulation network is stable and suffers no controversy.
The annotation was revised and enriched by a joint effort of the Bibliome team of MIG Laboratory at the Institut National de Recherche Agronomique (INRA) and the Laboratoire d'Informatique de Paris Nord at the Université Paris 13. The annotation has been carried and validated by a senior bioinformatics/Bacillus subtilis specialist, and by a bioinformatics/NLP engineer by using AlvisAE, the Annotation Editor. The annotation guidelines will be available in English.
The following picture is the regulation network corresponding to the training data (click for larger image).
Representation and Task Setting
The GRN task is a relation extraction task that follows the BioNLP-ST 2013 frame of representation. The participants are provided a manually curated annotation of the training corpus including entities, events and relations, including genic interactions. For training, the participants are provided the genic regulation network that can be reconstructed with interactions mentioned in sentences of the training corpus.
The network is a directed graph where vertexes represent genes, and arcs represent interactions between genes extracted from the text. The arcs are labeled with an interaction type following two distinct axes:
Inhibition: the Agent reduces the effect of the Target
Activation: the Agent increases the effect of the Target
Requirement: the Agent is necessary to the Target
Binding: the Agent binds to the Target, this includes Protein-DNA binding and excludes Protein-Protein binding mechanisms
Transcription: the Agent interacts with the Target by affecting its transcription by the RNA polymerase
When no mechanism or effect can be inferred, then the arc is labelled Regulation.
Text-bound entity types
All text-bound entities are given as input in train and test phases, except for event triggers (Action) that are only given in the train phase. For genic entities, only those belonging to Bacillus subtilis are annotated; genes and proteins of other organisms are not annotated.
Action: trigger words for bio-molecular events ("transcription").
Gene: names of genes.
GeneFamily: names of ortholog gene families.
mRNA: transcribed messenger RNAs.
Operon: name of operons.
PolymeraseComplex: mention of RNA polymerase complexes, either vegetative or bound to a sigma factor.
Promoter: mention of gene promoters.
Protein: name of a protein.
ProteinComplex: mention of protein complexes formed by several proteins that bind together.
ProteinFamily: name of ortholog protein families.
Regulon: name or mention of regulons.
Site: mention of a site or position on the bacterial chromosome.
Event and relation types
All event and relations are given in the train phase. In the test phase they are not given, however participants are only evaluated on the prediction of Interaction.* relations. The other events and relations are provided as a guidance during the training of the systems.
The following types of event and arguments are given, along with the valid types for each argument.
Action_Target (event): generic bio-molecular events.
Target: Gene, Operon, GeneFamily, mRNA, Protein, ProteinComplex, ProteinFamily, PolymeraseComplex, Promoter, Regulon
Bind_to (relation): binding between a proteic entity and a site on the chromosome. This relation excludes Protein-to-Protein binding.
DNA: Gene, Site
Protein: Protein, ProteinComplex, PolymeraseComplex
Master_of_Promoter (relation): the control of the transcription from a specific promoter by a proteic entity.
Protein: Protein, ProteinComplex, PolymeraseComplex, Gene (in case of metonymy)
Master_of_Regulon (relation): the control of the activity of an entire regulon by a protein.
Master: Protein, Gene (in case of metonymy)
Member_of_Regulon (relation): membership of a genic entity to a regulon.
Member: Gene, Operon, Protein, ProteinComplex
Promoter_of (relation): relation between a gene and its promoter.
Gene: Gene, Operon
Site_of (relation): position of a genic entity on the chromosome.
Site: Site, Promoter
Entity: Gene, Operon, Promoter
Transcription_by (event): transcription by a specific RNA polymerase.
Agent: PolymeraseComplex, Protein, ProteinComplex, ProteinFamily, Gene, GeneFamily
Transcription_from (event): transcription from a specific promoter.
Site: Promoter, Site
Interaction.* (relation): interaction between two genic entities, events or relations. The '*' is replaced by the type of the interaction (Regulation, Inhibition, Activation, Requirement, Binding or Transcription).
Agent: Gene, Operon, GeneFamily, mRNA, Protein, ProteinComplex, PolymeraseComplex, ProteinFamily, Action_Target
Target: Gene, Operon, GeneFamily, mRNA, Protein, ProteinComplex, PolymeraseComplex, ProteinFamily, Action_Target, Transcription_by, Transcription_from, Interaction.Transcription, Interaction.Activation
In the annotated corpus, genic entities that can potentially interact are assigned a Gene Identifier. This identifier is the name of the gene, operon or family denoted by the entity. For instance, the Gene Identifier for Protein entities is the name of the gene that encodes for the annotated protein. The provided Gene Identifier saves the participants the inconvenience of searching through nomenclatures of B. subtilis genes.
Inference of the regulation network
The genic regulation network corresponding to a corpus is inferred from the set of Interaction relations (manually annotated or predicted). The inferrence is done in two steps: resolution of Interaction relations, and Removal of redundant arcs. The training data is distributed with a script that automatically performs these two steps.
Step 1: Resolution of Interaction relations
The Agent and the Target of an Interaction relation are not necessarily an entity with a Gene Identifier. They can be secondary events or relations (Action_Target, Transcription_by, or even another Interaction), or auxiliary entities (Promoter). The resolution of an Interaction aims to look for the entity with a Gene Identifier in order to infer the node concerned by the Interaction relation. The resolution of Interaction arguments is performed with the following rules:
If the Agent (or Target) is an entity, then the agent (or target) node is the Gene Identifier of the entity. If the entity does not have a Gene Identifier, there is no node (and thus no arc).
If the Agent (or Target) is an event, then the agent (or target) node is the annotation referenced by the entity.
If the Agent (or Target) is a relation, then the agent (or target) nodes are the two arguments of the relation.
If the Target is a Promoter and this promoter is the argument of a Promoter_of relation, then the target node is the other argument of the Promoter_of relation. I.e. if A interacts with P, and P is a promoter of B, then A interacts with B.
If the Agent is a Promoter and this promoter is the argument of a Master_of_Promoter relation, the the agent is the other argument of the Master_of_Promoter relation. I.e. if A is the master of promoter P, and P interacts with B, then A interacts with B.
These rules are applied iteratively. In other words the resolution of Interaction arguments is a traversal of the graph of annotations; event and relation arguments are walked through, and Promoter entities are walked through according to rules 4 and 5.
If the resolution of the Agent or the Target yields more than one node, then the Interaction resolves to as many arcs as the cartesian product of resolved nodes. For instance, if both the Agent and the Target resolve to two nodes, the the Interaction relation resolves into four arcs.
Step 2: Removal of redundant arcs
In this step, arcs with the same Agent, Target and type are simplified into a single arc. This means that if the same interaction is annotated several times in the corpus, then it will resolve into a single arc. In terms of prediction, this also means that predicting only one of the interactions in the corpus is enough to reconstruct the arc.
Moreover Interaction types are ordered according to the following hierarchy:
For a given arc, if there is another arc for the same node pair with a more specialized type, then it is removed. For instance, the arcs (A, Regulation, B) and (A, Transcription, B) are simplified into (A, Transcription, B). Indeed the former arc conveys no additional information in comparison with the latter.
Submission and Evaluation
Participants can submit predictions in two ways:
a set of .a2 files, in the same way as the other tasks, containing predictions of events and relations. These submissions will be evaluated by comparing the network inferred from their predictions against the reference network. Or,
a .sif file containing a predicted network. These submissions will be evaluated by comparing the provided network against the reference network. For such submissions, we have no means to assess the contribution of the extraction from the text in the prediction, so we kindly ask participants that submit .sif files to inform us about external resources used to make the prediction in order to satisfy our scientific curiosity.
The predicted network is compared to the reference network using a Slot Error Rate [Makhoul et al, 1999]:
S: number of substitutions.
I: number of insertions.
D: number of deletions.
N: number of arcs in the reference.
Since this measure is an error rate, the lower is the better: a SER of zero means a perfect prediction. The SER has no upper bound but a value below 1 is expected for decent predictions.
The participants are provided a script that performs Interaction resolution and evaluation against a reference.
Illustrative examples in BioNLP format
Illustrative examples can be downloaded here: Sample Data
e-mail: robert (dot) bossy (at) jouy (dot) inra (dot) fr