J Am Med Inform Assoc doi:10.1136/amiajnl-2012-001576
  • Research and applications

Use of a support vector machine for categorizing free-text notes: assessment of accuracy across two institutions

  1. Dean F Sittig4
  1. 1Department of General Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
  2. 2Partners HealthCare, Boston, Massachusetts, USA
  3. 3Harvard Medical School, Boston, Massachusetts, USA
  4. 4The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA
  5. 5Boston University School of Medicine, Boston, Massachusetts, USA
  1. Correspondence to Dr Adam Wright, Department of General Medicine, Brigham and Women's Hospital, 1620 Tremont St, Boston, MA 02115, USA; awright5{at}
  • Received 15 December 2012
  • Revised 31 January 2013
  • Accepted 2 March 2013
  • Published Online First 30 March 2013


Background Electronic health record (EHR) users must regularly review large amounts of data in order to make informed clinical decisions, and such review is time-consuming and often overwhelming. Technologies like automated summarization tools, EHR search engines and natural language processing have been shown to help clinicians manage this information.

Objective To develop a support vector machine (SVM)-based system for identifying EHR progress notes pertaining to diabetes, and to validate it at two institutions.

Materials and methods We retrieved 2000 EHR progress notes from patients with diabetes at the Brigham and Women's Hospital (1000 for training and 1000 for testing) and another 1000 notes from the University of Texas Physicians (for validation). We manually annotated all notes and trained a SVM using a bag of words approach. We then used the SVM on the testing and validation sets and evaluated its performance with the area under the curve (AUC) and F statistics.

Results The model accurately identified diabetes-related notes in both the Brigham and Women's Hospital testing set (AUC=0.956, F=0.934) and the external University of Texas Faculty Physicians validation set (AUC=0.947, F=0.935).

Discussion Overall, the model we developed was quite accurate. Furthermore, it generalized, without loss of accuracy, to another institution with a different EHR and a distinct patient and provider population.

Conclusions It is possible to use a SVM-based classifier to identify EHR progress notes pertaining to diabetes, and the model generalizes well.

Related Article

Free Sample

This recent issue is free to all users to allow everyone the opportunity to see the full scope and typical content of JAMIA.
View free sample issue >>

Access policy for JAMIA

All content published in JAMIA is deposited with PubMed Central by the publisher with a 12 month embargo. Authors/funders may pay an Open Access fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication.

All content older than 12 months is freely available on this website.

AMIA members can log in with their JAMIA user name (email address) and password or via the AMIA website.

Navigate This Article