J Am Med Inform Assoc 18:459-465 doi:10.1136/amiajnl-2011-000108
  • Research and applications

Anaphoric relations in the clinical narrative: corpus creation

  1. Rebecca S Crowley3
  1. 1Childrens Hospital Boston Informatics Program and Harvard Medical School, Boston, Massachusetts, USA
  2. 2Division of Biomedical Informatics, University of California San Diego, California, USA
  3. 3Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA
  1. Correspondence to Dr Guergana Savova, Children's Hospital Boston Informatics Program, Harvard Medical School, 300 Longwood Avenue, Boston, MA 02114, USA; guergana.savova{at}
  • Received 13 January 2010
  • Accepted 20 February 2011
  • Published Online First 1 April 2011


Objective The long-term goal of this work is the automated discovery of anaphoric relations from the clinical narrative. The creation of a gold standard set from a cross-institutional corpus of clinical notes and high-level characteristics of that gold standard are described.

Methods A standard methodology for annotation guideline development, gold standard annotations, and inter-annotator agreement (IAA) was used.

Results The gold standard annotations resulted in 7214 markables, 5992 pairs, and 1304 chains. Each report averaged 40 anaphoric markables, 33 pairs, and seven chains. The overall IAA is high on the Mayo dataset (0.6607), and moderate on the University of Pittsburgh Medical Center (UPMC) dataset (0.4072). The IAA between each annotator and the gold standard is high (Mayo: 0.7669, 0.7697, and 0.9021; UPMC: 0.6753 and 0.7138). These results imply a quality corpus feasible for system development. They also suggest the complementary nature of the annotations performed by the experts and the importance of an annotator team with diverse knowledge backgrounds.

Limitations Only one of the annotators had the linguistic background necessary for annotation of the linguistic attributes. The overall generalizability of the guidelines will be further strengthened by annotations of data from additional sites. This will increase the overall corpus size and the representation of each relation type.

Conclusion The first step toward the development of an anaphoric relation resolver as part of a comprehensive natural language processing system geared specifically for the clinical narrative in the electronic medical record is described. The deidentified annotated corpus will be available to researchers.


  • Funding The work was funded by grant R01 CA127979.

  • Competing interests None.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Related Article

Free Sample

This recent issue is free to all users to allow everyone the opportunity to see the full scope and typical content of JAMIA.
View free sample issue >>

Access policy for JAMIA

All content published in JAMIA is deposited with PubMed Central by the publisher with a 12 month embargo. Authors/funders may pay an Open Access fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication.

All content older than 12 months is freely available on this website.

AMIA members can log in with their JAMIA user name (email address) and password or via the AMIA website.

Navigate This Article