Bacteria Biotopes (BB)

Downloads

Training data

Development data

Test data

Validation/Evaluation tool

Test evaluation service

Test evaluation results

The BioNLP-ST 2013 is completed. Thanks to all participating teams.

The final results are given in the tables below for each sub-task. Detailed results are available in the BB Test Results page.

Task 1

Task 2

Task 3

Good luck.

Goals in IE

    1. Promote Information Extraction on the subject of microorganisms ecology.

    2. Assess the performance of automatic categorization systems.

    3. Assess the performance of relation extraction on this subject by different methods.

Motivation in biology

There is surprisingly no comprehensive database of natural environment location of bacteria, although this is a critical information for studying the interaction mechanisms of the bacteria with its environment at a molecular level. The literature on bacteria ecology is abundant. As in 2011, the extraction of habitat mentions and their attachment to bacteria from textual data would fill this need. Moreover, once the habitats are identified, they must be normalized to be compared. Ontologies of habitats such as EnvO and OntoBiotope are available. There is a nice opportunity to develop and evaluate methods for normalizing the habitat descriptions with these ontologies. This task is more generally related to semantic annotation of entities with ontologies.

The knowledge tackled by this task is the habitats where bacteria live, and the environment properties of bacteria. This information is a particularly interesting in the fields of food processing and safety, health sciences and waste processing. There are also fundamental research that requires this knowledge like metagenomics or phylogeography / phyloecology. There is currently no database that supply the habitats of bacteria in a comprehensive way. Moreover the efforts for normalizing the habitats are just beginning. The diversity of habitats is such that several ongoing projects aim at building habitat ontologies (EnvO, OntoBiotope).

Representation and Task Setting

The BB task consists in:

    • Entity recognition of bacteria taxa and bacteria habitats.

    • Bacteria habitat categorization through the OntoBiotope-Habitat ontology.

    • Extraction of localization relations between bacteria and habitats.

We propose three sub-tasks:

    1. entity detection and categorization

    2. localization relation extraction

    3. localization extraction without gold entities

The three sub-tasks are independant, the evaluation metrics will be distinct for each one. Thus participants may opt to submit to one, two or three sub-tasks.

Sub-tasks 1 and 2 are elementary since they have a single prediction target (categorizations and relations, respectively). Sub-task 3 requires the prediction of relation as well as the boundaries of their arguments, so they may require more complex prediction systems.

Sub-task 1: entity detection and categorization

In sub-task 1, participants must detect the boundaries of bacteria habitat entities and, for each entity, assign one or several concepts of the OntoBiotope ontology. It contains 1,700 concepts organized in a hierarchy of is-a relations. It is available for download in OBO format.

In the training phase, participants are provided with document texts along with manually annotated habitat entities and their concept attributions. In the test phase, participants are provided with document texts alone, their systems must predict habitat entities and their concept attributions.

Entity types. There is a single entity type: Habitat that denote mentions of potential bacteria habitats. These entities may be named entities, but they are mostly noun phrases, adjectives or subordinates.

Normalization types. There is a single normalization type: OntoBiotope that links Habitat entities to one or several concepts in the OntoBiotope ontology. All Habitat entities are associated to at least one concept.

Evaluation.

The evaluation is based on a 1-to-1 pairing between reference entities and predicted entities. This pairing maximizes a score S defined as similarity between a reference and a prediced entity:

S = J . W

    • J is the Jaccard index between the reference and predicted entity as defined in [Bossy et al, 2012]. J measures the boundaries accuracy of the predicted entity.

    • W is the semantic similarity between ontolgy concepts attributed to the reference entity and to the predicted entity. We use the semantic similarity described in [Wang et al, 2006]. This similarity is exclusively based on the is-a relationships between concepts, we set the wis-a parameter to 0.65 in order to penalize favor ancestor/descendent predictions rather than sibling predictions.

Submissions will be evaluated using the Slot Error Rate (SER):

    • S: number of substitutions, it is set to the sum of the S similarity scores defined above.

    • I: number of insertions: the number of predicted entities that could not be paired.

    • D: number of deletions: the number of reference entities that could not be paired.

    • N: number of entities in the reference.

Sub-task 2: localization relation extraction

In sub-task 2, participants must detect localization and part-of relations between bacteria entities and habitat and geographical places entities.

In the training phase, participants are provided with document texts along with manually annotated bacteria, habitat and geographical entities, and localization and part-of relations. In the test phase, participants are provided with document texts and entities. Participants must then predict between which entities there is a relation.

Entity types

    • Bacteria entities denote names of bacteria taxa or strain.

    • Habitat entities denote mentions of potential bacteria habitats. These entities may be named entities, but they are mostly noun phrases, adjectives or subordinates.

    • Geographical entities denote names of geographical or organization places.

Relation types

    • Localization relations link Bacteria entities with Habitat or Geographical entites, they denote a mention of a bacteria living in a specific place.

    • PartOf relations link two Habitat entities, one represents a living organism and the other a part of the living organism (e.g. organ).

Coreferences

Coreference chains of entities have been also annotated. They are represented by Equiv annotations. They will not be provided in the test phase even though their prediction is not expected. They are used during the evaluation of submissions.

Evaluation

Submissions will be evaluated with Recall/Precision of predicted relations against gold relations. Coreferences represent equivalent entities with regards to relation arguments. That means that a relation would be a match if its arguments are in the same coreference chains as the reference arguments. Also redundant relations will not be penalized.

Sub-task 3: relation extraction without gold entities

The sub-task 3 is similar to sub-task 2, but participants must also predict the boundaries of entities (Bacteria, Habitat and Geographical).

In the training phase, participants are provided with document texts along with manually annotated bacteria, habitat and geographical entities, and localization and part-of relations. In the test phase, participants are provided with document texts only.

The entity and relation types are exactly the same as for sub-task 2.

Evaluation

Submissions will be evaluated with Recall/Precision of predicted relations against gold relations. Coreferences will be used in exactly the same way as for sub-task 2. The accuracy of entity boundaries will be factored in the scores in a relaxed way.

This sub-task is the same as BioNLP-ST 2011 BB task [Bossy et al, 2012], the evaluation results will be comparable so we may assess the progress made since two years ago.

Illustrative examples in BioNLP format

Illustrative examples can be downloaded here: Sample Data

Task corpus

The corpus is an extension of the BioNLP-ST 2011 BB corpus. Each document is centered around a species, a genre or a family of bacteria; it contains general information about their classification, ecology and interest in human activities. The corpus is a set of web page documents intended for a general audience that give general information about bacteria species in common language. These documents were taken from relevant public web sites. There are more than 20 source sites but the most important sources are:

2,040 documents were extracted among which 85 were randomly selected for the BioNLPST 2013. The documents have been annotated in double-blind mode by bioinformaticians of the Bibliome team of MIG Laboratory at the Institut National de Recherche Agronomique (INRA) by using the AlvisAE Annotation Editor. The guidelines will be available soon.

The three sub-tasks share the training and development set. Sub-tasks 1 and 3 will share the test set.

Contact

    • e-mail: robert (dot) bossy (at) jouy (dot) inra (dot) fr