Supporting Resources

The goal of the supporting resources for the BioNLP Shared Task 2013 is to provide the task participants with annotations from state-of-the-art automated tools in order to minimise the time-investment necessary to participate in the shared task and to allow for participants to experiment on how to leverage automated analyses provided by existing Natural Language Processing systems. In order to facilitate this, the shared task organisers issued an open call to the Natural Language Processing community for supporting resources.

Resources and Formats

This section describes the supporting resources that are available for the shared task. The files to be downloaded can be found at the bottom of the page.

BioC

BioC attempts to the address the BioCreative IV interoperability track and provides BioNLP ST 2013 participants with tokenisation, lemmatisation, sentence splitting, chunking and PoS-tagging in a unified BioC XML format to jump-start event extraction efforts.

  • MedPost, an open-source shallow parser that has been incorporated into the LingPipe NLP package and has been developed to suit the needs of biomedical texts

  • BioLemmatizer, an open-source lemmatiser specifically tailored for biomedical texts

In addition to the above resources BioC includes starter code in Java and C++ that demonstrates how to read and utilise the BioC XML data.

  • bioc.tar.gz (BioC data for the training and development sets for all tasks)

  • bioc_test.tar.gz (BioC data for the test set for all tasks)

I you have questions regarding BioC, please see the BioC homepage and/or contact: Don Comeau (comeau <at> ncbi.nlm.nih.gov), Rezarta Islamaj (rezarta.islamaj <at> nih.gov), Haibin Liu (haibin.liu <at> nih.gov) and John Wilbur (wilbur <at> ncbi.nlm.nih.gov)

If you make use of the BioC supporting resources, please cite any/all relevant publications below:

Larry Smith, Thomas Rindflesch, and W. John Wilbur. MedPost: a part-of-speech tagger for biomedical text. Bioinformatics 20, 14 (2004), 2320-2321.

Haibin Liu, Tom Christiansen, William A Baumgartner Jr, and Karin Verspoor. BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics, 2012, 3:3.

BioYaTeA

BioYaTeA is a term extractor developed by INRA. It is an extended version of YaTeA (Aubin and Hamon, 2006) adapted to the biomedical domain. The relevant extracted terms are noun-phrases. Participles and prepositional attachments are handled. BioYaTeA provides the term lemma, PoS-tags, internal syntactic structures (constituent analysis) and positions in the document. The output formats are the tabular format and XML-BioYaTeA format. The data for all tasks can be downloaded here.

BioYaTeA can be downloaded from the CPAN BioYaTea page.

YaTeA can be downloaded from the CPAN YaTeA page.

    • bionlp_st_2013_BB_bioyatea.tar.gz (training and development sets)

    • bionlp_st_2013_BB_test_bioyatea.tar.gz (test set)

    • bionlp_st_2013_CG_bioyatea.tar.gz (training and development sets)

    • bionlp_st_2013_CG_test_bioyatea.tar.gz (test set)

    • bionlp_st_2013_GE_bioyatea.tar.gz (training and development sets)

    • bionlp_st_2013_GE_test_bioyatea.tar.gz (test set)

    • bionlp_st_2013_GRN_bioyatea.tar.gz (training and development sets)

    • bionlp_st_2013_GRN_test_bioyatea.tar.gz (test set)

    • bionlp_st_2013_GRO_bioyatea.tar.gz (training and development sets)

    • bionlp_st_2013_GRO_test_bioyatea.tar.gz (test set)

    • bionlp_st_2013_PC_bioyatea.tar.gz (training and development sets)

    • bionlp_st_2013_PC_test_bioyatea.tar.gz (test set)

If you have any questions regarding BioYaTeA, please contact: Wiktoria Golik from INRA (wiktoria.golik <at> jouy.inra.fr) and/or Thierry Hamon from Lim&Bio (thierry.hamon <at> univ-paris13.fr)

If you make use of BioYaTeA supporting resource, please cite:

Golik Wiktoria, Bossy Robert, Ratkovic Zorana and Nédellec Claire. (To appear in 2013). Improving Term Extraction with Linguistic Analysis in the Biomedical Domain. Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’13), Special Issue of the journal Research in Computing Science, ISSN 1870-4069, www.micai.org/rcs, 24-30 March, Samos, Greece, 2013.

If you make use of YaTeA tool, please cite,

Aubin S. and Hamon T. (2006). Improving Term Extraction with Terminological Resources, in T. Saloski, F. Ginter, S. Pyysalo, T. Pahikkala (eds.), Proc. of the Advances in Natural Language Processing, FinTAL’06, LNAI 4139, Springer, p. 380-387, 2006.

Cocoa

Cocoa is a dense annotator for biological text and in addition to the automated annotations provided below it can be used through a web API described at npjoint.com. The annotations covers over 20 different semantic categories, among them macromolecules, chemicals, protein/DNA parts, complexes and organisms. The annotations are available in the brat stand-off format (.ann files) which for entity annotations is compatible with that of the BioNLP Shared Task 2013.

You can find the output from the Cocoa annotator downloads below with one archive for each task. Cocoa does not as-of-yet have a relevant publication to cite if you make use of the annotations, thus, please site the URL npjoint.com if you make use of the Cocoa system output for the shared task.

If you have any questions regarding Cocoa, please contact: S. V. Ramanan (ramanan <at> npjoint.com)

Syntactic Analyses

The application of syntactic analyses is common-place for Information Extraction (IE) systems. For this purpose, the organisers of the BioNLP Shared Task 2013 are providing some fundamental supporting resources in the form of sentence splitting, tokenisation and syntactic parses. The procedures used to generate these resources is identical to the one used in 2011 for the BioNLP Shared Task 2011 supporting resources, apart from upgrading the versions of the software used if a newer and improved version had been made available.

Please see the files below for checksums for all syntactic analyses files:

If you have any questions regarding the syntactic analyses, please contact: Pontus Stenetorp (pontus <at> stenetorp.se)

The full processing pipeline for the resources provided by the organisers is available here.

Sentence Splitting

Sentence splitting pre-processing was done using the machine learning-based Genia Sentence Splitter (GeniaSS) and post-processed using a set of heuristics to correct common errors. The output format is a simple text format where sentences are separated by newlines.

Tokenisation

Tokenisation was carried out using the GTB-tokenize.pl script that attempts to mimic the tokenisation used by the Genia Treebank. The output format is a simple text format where the syntactic tokens are separated by space.

McCCJ

McCCJ denotes the BLLIP Parser using the self-trained biomedical model by David McClosky. If you make use of the McCCJ parses, please cite, McClosky D. (2010). Any Domain Parsing: Automatic Domain Adaptation for Parsing. Ph.D. Thesis, Brown.

Stanford Parser

The Stanford Parser is a widely adopted parser. If you make use of the parses from the Stanford Parser, please cite, Klein, D. and Manning, C. (2002). Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems.

Enju

The Enju Parser is a robust deep parser. If you make use of the Enju parses, please cite, Miyao, Y. and Tsujii, J. (2008). Feature forest models for probabilistic HPSG parsing. Computational Linguistics.

Supporting Resources Providers