Collection of Cancer Stage Data by Classifying Free-text Medical Reports
- Iain A McCowan,
- Darren C Moore,
- Anthony N Nguyen,
- Rayleen V Bowman,
- Belinda E Clarke,
- Edwina E Duhig,
- Mary-Jane Fry
- Affiliations of the authors: CSIRO e-Health Research Centre, (IAM, DCM, ANN), Brisbane, Australia, Department of Medicine, University of Queensland (RBV), Brisbane, Australia, Department of Anatomical Pathology, The Prince Charles Hospital (BEC, EED), Brisbane, Australia, Queensland Cancer Control Analysis Team, Queensland Health (MJF), Brisbane, Australia
- Correspondence: Iain McCowan, PhD, CSIRO e-Health Research Centre, PO Box 10842 Adelaide Street, Brisbane QLD 4000, Australia; e-mail: <iain.mccowan{at}csiro.au>
- Received 19 April 2006
- Accepted 2 August 2007
Abstract
Cancer staging provides a basis for planning clinical management, but also allows for meaningful analysis of cancer outcomes and evaluation of cancer care services. Despite this, stage data in cancer registries is often incomplete, inaccurate, or simply not collected. This article describes a prototype software system (Cancer Stage Interpretation System, CSIS) that automatically extracts cancer staging information from medical reports. The system uses text classification techniques to train support vector machines (SVMs) to extract elements of stage listed in cancer staging guidelines. When processing new reports, CSIS identifies sentences relevant to the staging decision, and subsequently assigns the most likely stage. The system was developed using a database of staging data and pathology reports for 710 lung cancer patients, then validated in an independent set of 179 patients against pathologic stage assigned by two independent pathologists. CSIS achieved overall accuracy of 74% for tumor (T) staging and 87% for node (N) staging, and errors were observed to mirror disagreements between human experts.
Footnotes
-
The research in this article was done in partnership with the Queensland Cancer Control Analysis Team (QCCAT) within Queensland Health.
-
↵1 All 95% confidence intervals reported in this article are calculated using the Wilson procedure.
-
↵3 It was not feasible to resolve disagreements on detailed staging factors in the post-trial consensus meeting.









