The CHEMDNER corpus of chemicals and drugs and its annotation principles

Krallinger, Martin and Rabal, Obdulia and Leitner, Florian and Vázquez, Miguel and Salgado, David and Lu, Zhiyong and Leaman, Robert and Lu, Yanan and Ji, Donghong and Batista-Navarro, Riza Theresa and Sayle, Roger A. and Lowen, Daniel M. and Rak, Rafa and Huber, Torsten and Rocktäschel, Tim and Matos, Srgio and Campos, David and Tahg, Buzhou and Xu, Hua and Munkhdalai, Tsendsuren and Ryu, Keun Ho and Romanan, S.V. and Nathan, Senthil and Žitnik, Slavko and Bajec, Marko and Weber, Lutz and Irmer, Matthias and Saber A., Akhondi and Kors, Jan A. and Xu, Shuo and An, Xin and Sikdar, Utpal Kumar and Ekbal, Asif and Yoshioka, Masaharu and Dieb, Thaer M. and Choi, Miji and Verspoor, Karin and Khabsa, Madian and Giles, C. Lee and Liu, Hongfang and Ravikumar, Komandur Elayavilli and Lamurias, Andre and Couto, Francisco M. and Dai, Hong-Jie and Tsai, Richard Tzong-Han and Ata, Caglar and Can, Tolga and Usié, Anabel and Alves, Rui and Segura-Bedmar, Isabel and Martínez, Paloma and Oyarzábal, Julen and Valencia, Alfonso (2015). The CHEMDNER corpus of chemicals and drugs and its annotation principles. "Journal of Cheminformatics", v. 7 (n. 1); pp.. ISSN 1758-2946.


Title: The CHEMDNER corpus of chemicals and drugs and its annotation principles
  • Krallinger, Martin
  • Rabal, Obdulia
  • Leitner, Florian
  • Vázquez, Miguel
  • Salgado, David
  • Lu, Zhiyong
  • Leaman, Robert
  • Lu, Yanan
  • Ji, Donghong
  • Batista-Navarro, Riza Theresa
  • Sayle, Roger A.
  • Lowen, Daniel M.
  • Rak, Rafa
  • Huber, Torsten
  • Rocktäschel, Tim
  • Matos, Srgio
  • Campos, David
  • Tahg, Buzhou
  • Xu, Hua
  • Munkhdalai, Tsendsuren
  • Ryu, Keun Ho
  • Romanan, S.V.
  • Nathan, Senthil
  • Žitnik, Slavko
  • Bajec, Marko
  • Weber, Lutz
  • Irmer, Matthias
  • Saber A., Akhondi
  • Kors, Jan A.
  • Xu, Shuo
  • An, Xin
  • Sikdar, Utpal Kumar
  • Ekbal, Asif
  • Yoshioka, Masaharu
  • Dieb, Thaer M.
  • Choi, Miji
  • Verspoor, Karin
  • Khabsa, Madian
  • Giles, C. Lee
  • Liu, Hongfang
  • Ravikumar, Komandur Elayavilli
  • Lamurias, Andre
  • Couto, Francisco M.
  • Dai, Hong-Jie
  • Tsai, Richard Tzong-Han
  • Ata, Caglar
  • Can, Tolga
  • Usié, Anabel
  • Alves, Rui
  • Segura-Bedmar, Isabel
  • Martínez, Paloma
  • Oyarzábal, Julen
  • Valencia, Alfonso
Item Type: Article
Título de Revista/Publicación: Journal of Cheminformatics
Date: 2015
ISSN: 1758-2946
Volume: 7
Freetext Keywords: Named entity recognition; BioCreative; Text mining; Chemical entity recognition; Machine learning; Chemical indexing; ChemNLP
Faculty: E.T.S. de Ingenieros Informáticos (UPM)
Department: Inteligencia Artificial
Creative Commons Licenses: Recognition - No derivative works - Non commercial

Full text

PDF - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (2MB) | Preview


The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at:

More information

Item ID: 41177
DC Identifier:
OAI Identifier:
DOI: 10.1186/1758-2946-7-S1-S2
Official URL:
Deposited by: Memoria Investigacion
Deposited on: 07 Nov 2016 13:27
Last Modified: 07 Nov 2016 13:27
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM