Tasks‎ > ‎

Bacteria Biotopes (BB)

Downloads

Training data

Development data

Test data

Validation/Evaluation tool

Test evaluation service

Test evaluation results

The BioNLP-ST 2013 is completed. Thanks to all participating teams.
The final results are given in the tables below for each sub-task. Detailed results are available in the BB Test Results page.

Task 1

ParticipantRank
SER
 LIPN1
 0.661
 Boun2 0.676
 LIMSI3 0.678
 IRISA-TexMex4 0.932

Task 2

ParticipantRank
Recall
Precision
F1
 TEES-2.1 1 0.280.82
0.42
 IRISA-TexMex  2 0.360.46
0.40
 Boun 3 0.210.38
0.27
 LIMSI 4 0.040.19
0.06

Task 3

ParticipantRank
Recall
Precision
F1
 TEES-2.1 10.12
0.18
0.14
 LIMSI 20.04
0.12
0.06



Good luck.

Goals in IE

    1. Promote Information Extraction on the subject of microorganisms ecology.

    2. Assess the performance of automatic categorization systems.

    3. Assess the performance of relation extraction on this subject by different methods.


    Motivation in biology

    There is surprisingly no comprehensive database of natural environment location of bacteria, although this is a critical information for studying the interaction mechanisms of the bacteria with its environment at a molecular level. The literature on bacteria ecology is abundant. As in 2011, the extraction of habitat mentions and their attachment to bacteria from textual data would fill this need. Moreover, once the habitats are identified, they must be normalized to be compared. Ontologies of habitats such as EnvO and OntoBiotope are available. There is a nice opportunity to develop and evaluate methods for normalizing the habitat descriptions with these ontologies. This task is more generally related to semantic annotation of entities with ontologies.

    The knowledge tackled by this task is the habitats where bacteria live, and the environment properties of bacteria. This information is a particularly interesting in the fields of food processing and safety, health sciences and waste processing. There are also fundamental research that requires this knowledge like metagenomics or phylogeography / phyloecology. There is currently no database that supply the habitats of bacteria in a comprehensive way. Moreover the efforts for normalizing the habitats are just beginning. The diversity of habitats is such that several ongoing projects aim at building habitat ontologies (EnvO, OntoBiotope).


    Representation and Task Setting

    The BB task consists in:

    • Entity recognition of bacteria taxa and bacteria habitats.

    • Bacteria habitat categorization through the OntoBiotope-Habitat ontology.

    • Extraction of localization relations between bacteria and habitats.

    We propose three sub-tasks:
    1. entity detection and categorization
    2. localization relation extraction
    3. localization extraction without gold entities
    The three sub-tasks are independant, the evaluation metrics will be distinct for each one. Thus participants may opt to submit to one, two or three sub-tasks.
    Sub-tasks 1 and 2 are elementary since they have a single prediction target (categorizations and relations, respectively). Sub-task 3 requires the prediction of relation as well as the boundaries of their arguments, so they may require more complex prediction systems.

    Sub-task 1: entity detection and categorization


    In sub-task 1, participants must detect the boundaries of bacteria habitat entities and, for each entity, assign one or several concepts of the OntoBiotope ontology. It contains 1,700 concepts organized in a hierarchy of is-a relations. It is available for download in OBO format.

    In the training phase, participants are provided with document texts along with manually annotated habitat entities and their concept attributions. In the test phase, participants are provided with document texts alone, their systems must predict habitat entities and their concept attributions.

    Entity types. There is a single entity type: Habitat that denote mentions of potential bacteria habitats. These entities may be named entities, but they are mostly noun phrases, adjectives or subordinates.

    Normalization types. There is a single normalization type: OntoBiotope that links Habitat entities to one or several concepts in the OntoBiotope ontology. All Habitat entities are associated to at least one concept.

    Evaluation.
    The evaluation is based on a 1-to-1 pairing between reference entities and predicted entities. This pairing maximizes a score S defined as similarity between a reference and a prediced entity:

    S = J . W

    • J is the Jaccard index between the reference and predicted entity as defined in [Bossy et al, 2012]. J measures the boundaries accuracy of the predicted entity.
    • W is the semantic similarity between ontolgy concepts attributed to the reference entity and to the predicted entity. We use the semantic similarity described in [Wang et al, 2006]. This similarity is exclusively based on the is-a relationships between concepts, we set the wis-a parameter to 0.65 in order to penalize favor ancestor/descendent predictions rather than sibling predictions.
    Submissions will be evaluated using the Slot Error Rate (SER):

    SER = (S+I+D) / N

    • S: number of substitutions, it is set to the sum of the S similarity scores defined above.
    • I: number of insertions: the number of predicted entities that could not be paired.
    • D: number of deletions: the number of reference entities that could not be paired.
    • N: number of entities in the reference.

    Sub-task 2: localization relation extraction


    In sub-task 2, participants must detect localization and part-of relations between bacteria entities and habitat and geographical places entities.

    In the training phase, participants are provided with document texts along with manually annotated bacteria, habitat and geographical entities, and localization and part-of relations. In the test phase, participants are provided with document texts and entities. Participants must then predict between which entities there is a relation.

    Entity types
    • Bacteria entities denote names of bacteria taxa or strain.
    • Habitat entities denote mentions of potential bacteria habitats. These entities may be named entities, but they are mostly noun phrases, adjectives or subordinates.
    • Geographical entities denote names of geographical or organization places.
    Relation types
    • Localization relations link Bacteria entities with Habitat or Geographical entites, they denote a mention of a bacteria living in a specific place.
    • PartOf relations link two Habitat entities, one represents a living organism and the other a part of the living organism (e.g. organ).
    Coreferences
    Coreference chains of entities have been also annotated. They are represented by Equiv annotations. They will not be provided in the test phase even though their prediction is not expected. They are used during the evaluation of submissions.

    Evaluation
    Submissions will be evaluated with Recall/Precision of predicted relations against gold relations. Coreferences represent equivalent entities with regards to relation arguments. That means that a relation would be a match if its arguments are in the same coreference chains as the reference arguments. Also redundant relations will not be penalized.

    Sub-task 3: relation extraction without gold entities


    The sub-task 3 is similar to sub-task 2, but participants must also predict the boundaries of entities (Bacteria, Habitat and Geographical).

    In the training phase, participants are provided with document texts along with manually annotated bacteria, habitat and geographical entities, and localization and part-of relations. In the test phase, participants are provided with document texts only.

    The entity and relation types are exactly the same as for sub-task 2.

    Evaluation
    Submissions will be evaluated with Recall/Precision of predicted relations against gold relations. Coreferences will be used in exactly the same way as for sub-task 2. The accuracy of entity boundaries will be factored in the scores in a relaxed way.
    This sub-task is the same as BioNLP-ST 2011 BB task [Bossy et al, 2012], the evaluation results will be comparable so we may assess the progress made since two years ago.


    Illustrative examples in BioNLP format

    Illustrative examples can be downloaded here: Sample Data


    Task corpus

    The corpus is an extension of the BioNLP-ST 2011 BB corpus. Each document is centered around a species, a genre or a family of bacteria; it contains general information about their classification, ecology and interest in human activities. The corpus is a set of web page documents intended for a general audience  that give general information about bacteria species in common language. These documents were taken from relevant public web sites. There are more than 20 source sites but the most important sources are:

    2,040 documents were extracted among which 85 were randomly selected for the BioNLPST 2013. The documents have been annotated in double-blind mode  by bioinformaticians of the Bibliome team of MIG Laboratory at the Institut National de Recherche Agronomique (INRA)  by using the AlvisAE Annotation Editor. The guidelines will be available soon.

    The three sub-tasks share the training and development set. Sub-tasks 1 and 3 will share the test set.


    Contact

    • e-mail: robert (dot) bossy (at) jouy (dot) inra (dot) fr