File Formats

The BioNLP Shared Task (ST) 2013 data uses standoff formats similar to those of the BioNLP Shared Task 2009 and BioNLP Shared Task 2011 file formats. In the standoff representation, the texts of the documents are kept separate from annotations, which are connected to specific spans of texts through character offsets. The annotations are associated with their texts by the file naming convention that their base name (file name without suffix) is the same: for example, the file PMID-1000.a1 contains annotations for the file PMID-1000.txt.

The BioNLP Shared Task 2013 file formats are identified by file name suffixes (".txt", "a1", etc.) and described in detail in the following.

General annotation structure

All annotation file formats follow the same basic structure: Each line contains one annotation, and each annotation is given an ID that appears first on the line, separated from the rest of the annotation by a single TAB character. The rest of the structure varies by annotation type. Examples of annotation for an entity (T1), an event trigger (T2), an event (E1), an event modification (M1) and a relation (R1) are shown in the following.

T1 Protein 0 7 RFLAT-1

T2 Positive_regulation 53 62 activates

E1 Positive_regulation:T2 Theme:E2 Cause:T1

M1 Speculation E1

R1 Subunit-Complex Arg1:T1 Arg2:T3

Detailed descriptions of these annotations are given below.

Text-bound annotations

Text-bound annotations are an important category of annotation used in many of the file formats. Text-bound annotation identifies a specific span of text as an entity mention or event trigger and assigns it a type.

T1 Protein 0 7 RFLAT-1

T2 Positive_regulation 53 62 activates

All text-bound annotations follow the same structure. As in all annotations, the ID occurs first and is delimited from the rest of the line with a TAB character. The primary annotation is given as a SPACE-separated triple (type, start-offset, end-offset). The start-offset is the index of the first character of the annotated span in the text (".txt" file), i.e. the number of characters in the document preceding it. The end-offset is the index of the first character after the annotated span. Thus, the character in the end-offset position is not included in the annotated span. For reference, the text spanned by the annotation is included, separated by a TAB character.

A number of tasks in BioNLP ST 2013 involve discontinuous text-bound annotations, that is, annotations that mark more than one continuous span of characters. The format for these annotations is a straightforward extension of that for continuous ones, with the offsets of the continuous character spans of the annotation separated from each other by semicolon characters (";") and the texts spanned by these catenated by single space characters (" ") to form the reference text. For example, the text "alpha and beta actin" can be marked as follows:

T1 Protein 0 5;15 20 alpha actin

T2 Protein 10 20 beta actin

(here, the offsets [0:5] span the text "alpha" and [15:20] span "actin".)

Annotation ID conventions

All annotations IDs consist of a single upper-case character identifying the annotation type and a number. The initial ID characters relate to annotation types as follows:

    • T : text-bound annotation (entity / event trigger)

    • E : event

    • M : event modification

    • R : relation

    • N : normalization (external reference)

Additionally, an asterisk ("*") can be used as a placeholder for an ID in special cases in the gold data, but should not be used in system output.

Main task file formats

These file formats are relevant to participation in any of the main tasks.

Text files (.txt)

These files contain text from the original documents.

RFLAT-1: a new zinc finger transcription factor that activates RANTES gene expression in T lymphocytes.

RANTES (Regulated upon Activation, Normal T cell Expressed and Secreted) is a chemoattractant cytokine (chemokine) important in the generation of inflammatory infiltrate and human immunodeficiency virus entry into immune cells. RANTES is expressed late (3-5 days) after activation in T lymphocytes.

The texts are given as plain text files with ASCII characters and UNIX-style newline convention. The titles of documents and sections are separated from body text by a newline. However, in most tasks sentence segmentation is not provided, and abstract and section content text is given as a single long line without newlines.

Input annotation files (.a1)

These files contain annotations given as input in each task. In most tasks, the primary category of annotation found in these files are annotations for entity mentions. All entity annotations are given a unique ID and are defined by type (e.g. Protein or Chemical) and the span(s) of characters containing the entity mention.

T1 Protein 0 7 RFLAT-1

T2 Protein 63 69 RANTES

T3 Protein 105 111 RANTES

T4 Protein 113 176 Regulated upon Activation, Normal T cell Expressed and Secreted

[...]

Note that the .a1 annotation files with human-created "gold standard" annotations will be provided to participants for both training and test data. The extraction of information identified in these files is thus not necessary for participation.

Target annotation files (.a2)

These files contain annotation for events, relations, and other related information that is the target for extraction in each task.

Event annotations

Event annotations are given a unique ID and are defined by type (e.g. Binding or Localization), event trigger (the text stating the event) and arguments.

T13 Positive_regulation 53 62 activates

T14 Gene_expression 75 85 expression

T15 Gene_expression 343 352 expressed

T16 Phosphorylation 600 614 phosphorylated

[...]

E1 Positive_regulation:T13 Theme:E2 Cause:T1

E2 Gene_expression:T14 Theme:T2

E3 Gene_expression:T15 Theme:T5

E4 Phosphorylation:T16 Theme:T8

The event triggers, annotations marking the word or words stating each event, are text-bound annotations and their format is identical to that for entities. The IDs of triggers must not overlap with those of entities.

As for all annotations, the event ID occurs first, separated by a TAB character. The event trigger is specified as TYPE:ID and identifies the event type and its trigger through the ID. By convention, the event type is specified both in the trigger annotation and the event annotation. The event trigger is separated from the event arguments by SPACE. The event arguments are a SPACE-separated set of ROLE:ID pairs, where ROLE is one of the event- and task-specific argument roles (e.g. Theme, Cause, Site) and the ID identifies the entity or event filling that role. Note that several events can share the same trigger and that while the event trigger should be specified first, the event arguments can appear in any order.

Event annotations are a primary extraction target in the main tasks. Participants will be provided by human-created gold standard event annotations for the training and development data, but will need to create both event trigger and event annotations for the test data.

Relation annotations

Similarly to event annotations, relation annotations are given a unique ID and are defined by type and arguments.

R1 Subunit-Complex Arg1:T11 Arg2:T32

R2 Subunit-Complex Arg1:T10 Arg2:T32

R3 Protein-Component Arg1:T22 Arg2:T34

R4 Protein-Component Arg1:T22 Arg2:T36

The format is otherwise identical to that applied for Events, with the exception that the annotation does not identify a specific piece of text expressing the relation ("trigger" or "text binding").

Additional entity annotations

The target (.a2) annotation files for some main tasks contain annotation identifying additional entities that are relevant to events but not among the given core entities found in the .a1 files. These annotations identify, for example, the cellular component to which a protein is moved in a Localization event or the domain that is bound in a Binding event. The annotations are specified as text-bound annotations, that is, their format is identical to that for the entities in the .a1 files (see above).

These annotations are only provided for training and development data, not test data, and they are a target of extraction. Systems participating in (sub)tasks involving these entities will thus need to extract them and include them in the output.

Event modification annotations

The target (.a2) annotation files for some main tasks contain an additional class of annotation identifying additional aspects of other annotations, such as events that are are stated speculatively or in a negative context.

M1 Speculation E1

M2 Negation E2

Event modification annotations begin with an ID, separated by TAB from the modification type (e.g. Speculation or Negation), which is in turn separated by SPACE from the ID of the annotation that the modification applies to.

Entity equivalence annotations

The target (.a2) annotation files contain an additional class of annotation identifying equivalence stated through simple local abbreviations and other aliasing between given entities, such as between interleukin-2 and IL-2 in the text "interleukin-2 (IL-2)".

* Equiv T3 T4

Equiv annotations are given a placeholder "*" in place of an ID, separated by TAB. The primary annotation consists of the relation type ("Equiv") and a set of two or more ID numbers separated by SPACE. These annotations specify that the listed IDs are mutually interchangeable so that any other annotation (e.g. an event) referencing such an ID would be interpreted identically if this ID was replaced with any other in the set.

Note that while Equiv annotations will not be provided for test data, they are not extraction targets in the task and participating systems should not output Equiv annotations.

File naming conventions

All files in the shared task follow the same naming convention, with the suffixes identifying the file format (see above) and the base name the text source the file relates to, as follows:

ID_SYSTEM - ID_NUMBER - SECTION_SPECIFICATION - SUBSECTION_NUMBER

Where

    • ID_SYSTEM identifies the system from which IDs are drawn, e.g. "PMID" for files for which the original source is PubMed or "PMC" for files for which the source is PubMed Central.

    • ID_NUMBER is the ID number within the ID system, e.g. "1234567" for a file with PubMed/PMC ID 1234567.

    • SECTION_SPECIFICATION identifies the top-level section of the document that the file relates to, consisting of

      • SECTION_NUMBER a running two-digit section number, "01" for the first top-level section etc. By convention, files relating to the title and abstract are given the number "00".

      • SECTION_TITLE the title of the section, as in the original document except with space replaced by underscore, e.g. "Materials_and_Methods". By convention, files relating to the title and abstract are given the title "TIAB".

    • SUBSECTION_NUMBER a running two-digit subsection number ("01" for the first subsection etc.) in the top-level section that the file relates to. The number is incremented by one for each subsection, sub-subsection or similar, thus "flattening" sub-subsection or further structure. For top-level sections with no subsections this string is empty. For text before the first subsection in top-level sections with subsections, the number is "00".

If the document has no sections, both SECTION-SPECIFICATION and SUBSECTION_NUMBER are empty.

Thus, for example,

    • PMID-123456: entire content (i.e. title and abstract) of the PubMed document with PMID 123456

    • PMC-1234567-00-TIAB : title and abstract of PubMed Central document with PMC ID 1234567

    • PMC-1234567-01-Introduction: the 1st top-level section, "Introduction", of PubMed Central document with PMC ID 1234567. The section has no subsections.

    • PMC-1234567-04-Results-07: 7th sequential subsection (or sub-subsection etc.) of the 4th top-level section, "Results", of PubMed Central document with PMC ID 1234567.

    • PMC-1234567-04-Results-00: text before first subsection of the 4th top-level section, "Results", of PubMed Central document with PMC ID 1234567.

Note that in cases where a top-level section has no text before the first subsection, files with SUBSECTION_NUMBER "00" would have no text content to refer to and are thus not included.