rss
J Am Med Inform Assoc 2006;13:488-496 doi:10.1197/jamia.M2082
  • Focus on Automated Categorization Technique
  • Model Formulation

Advancing Biomedical Image Retrieval: Development and Analysis of a Test Collection

  1. William R Hersh,
  2. Henning Müller,
  3. Jeffery R Jensen,
  4. Jianji Yang,
  5. Paul N Gorman,
  6. Patrick Ruch
  1. Affiliations of the authors: Department of Medical Informatics & Clinical Epidemiology (WRH, JRJ, JY, PNG), Oregon Health & Science University, Portland, OR; Medical Informatics Service (HM, PR), University & Hospitals of Geneva, Geneva, Switzerland
  1. Correspondence and reprints: William R. Hersh, MD, Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, 3181 SW Sam Jackson Park Rd., BICC, Portland, OR 97239: e-mail: <hersh{at}ohsu.edu>
  • Received 12 February 2006
  • Accepted 12 June 2006

Abstract

Objective Develop and analyze results from an image retrieval test collection.

Methods After participating research groups obtained and assessed results from their systems in the image retrieval task of Cross-Language Evaluation Forum, we assessed the results for common themes and trends. In addition to overall performance, results were analyzed on the basis of topic categories (those most amenable to visual, textual, or mixed approaches) and run categories (those employing queries entered by automated or manual means as well as those using visual, textual, or mixed indexing and retrieval methods). We also assessed results on the different topics and compared the impact of duplicate relevance judgments.

Results A total of 13 research groups participated. Analysis was limited to the best run submitted by each group in each run category. The best results were obtained by systems that combined visual and textual methods. There was substantial variation in performance across topics. Systems employing textual methods were more resilient to visually oriented topics than those using visual methods were to textually oriented topics. The primary performance measure of mean average precision (MAP) was not necessarily associated with other measures, including those possibly more pertinent to real users, such as precision at 10 or 30 images.

Conclusions We developed a test collection amenable to assessing visual and textual methods for image retrieval. Future work must focus on how varying topic and run types affect retrieval performance. Users’ studies also are necessary to determine the best measures for evaluating the efficacy of image retrieval systems.

Introduction

Image retrieval is a poor stepchild to other forms of information retrieval (IR). Whereas a broad spectrum of Internet users, from lay persons to biomedical professionals, routinely perform text searching,1 fewer (though a growing number) search for images on a regular basis. Image retrieval systems generally take two approaches to indexing and retrieval of data. One is to perform indexing and retrieval of the textual annotations associated with images.2 A number of commercial systems employ this approach, such as Google Images (images.google.com) and Flickr (www.flickr.com). A second approach, called visual or content-based, is to employ image processing techniques to features in the images, such as color, texture, shape, and segmentation.3

Each approach to indexing and retrieval of images has its limitations. Little research has assessed the optimal approaches or limitations to text-based indexing of images. Greenes has noted one problem particular to biomedicine, which is the “findings-diagnosis continuum” that leads images to be described differently based on the amount of diagnostic inference the interpreter of the images is applying.4 Joergensen5 and Le Bozec and colleagues6 have also described other limitations of purely textual indexing of images for retrieval, such as the inability to capture synonymy, conceptual relationships, or larger themes underlying their content. One effort to improve the discipline of image indexing has been the Health Education Assets Library (HEAL) project, which aims to standardize the metadata associated with all medical digital objects, but its adoption remains modest at this time.7

Visual indexing and retrieval also have their limitations. In a recent review article of content-based image retrieval applied in biomedicine, Müller and colleagues noted that image processing algorithms to automatically identify the conceptual content of images have not been able to achieve the performance of IR and extraction systems applied to text.3 Visual image indexing systems have only been able to discern primitive elements of images, such as color (intensity and sets of color or levels of grey), texture (coarseness, contrast, directionality, linelikeness, regularity, and roughness), shape (types present), and segmentation (ability to recognize boundaries).

Another problem plaguing all image retrieval research has been the lack of robust test collections and realistic query tasks that allow comparison of system performance.3 8 A few initiatives exist for certain types of visual information retrieval (e.g., TRECVID for retrieval of video news broadcasts),9 but none focus on the biomedical domain.

The lack of useful test collections is one of the motivations for the ImageCLEF initiative, which aims to build test collections for image retrieval research. ImageCLEF has a lineage from several of the “challenge evaluations” that have been developed over the years to assess performance of IR systems. The foci within these initiatives is usually driven by the interests of the participating research groups. ImageCLEF arose from the Cross-Language Evaluation Forum (CLEF, www.clef-campaign.org), a challenge evaluation for IR from diverse languages,10 when a group of researchers developed an interest in evaluating retrieval of images annotated in a variety of different languages. Some participants in ImageCLEF expressed an interest in retrieval of biomedical images, which led to the image retrieval task described in this paper. CLEF itself is an outgrowth of the Text Retrieval Conference (TREC, trec.nist.gov), the original forum for evaluation of text retrieval systems. TREC and CLEF, along with their outgrowths, operate on an annual cycle of test collection development and distribution, followed by a conference where results are presented and analyzed.

The goals of TREC and CLEF are to build realistic test collections that simulate real-world retrieval tasks and enable researchers to assess and compare system performance.11 The goal of test collection construction is to assemble a large collection of content (documents, images, etc.) that resemble collections used in the real world. Builders of test collections also seek a sample of realistic tasks to serve as topics that can be submitted to systems as queries to retrieve content. The final component of test collections is relevance judgments that determine which content is relevant to each topic. A major challenge for test collections is to develop a set of realistic topics that can be judged for relevance to the retrieved items. Such benchmarks are needed by any researcher or developer in order to evaluate the effectiveness of new tools.

Test collections usually measure how well systems or algorithms retrieve relevant items. The most commonly used evaluation measures are recall and precision. Recall is the proportion of relevant documents retrieved from the database whereas precision is the proportion of relevant documents retrieved in the search. Often there is a desire to combine recall and precision into a single aggregate measure. Although many approaches have been used for aggregate measures, the most frequently used one in TREC and CLEF has been the mean average precision (MAP).12 In this measure, which can only be used with ranked output from a search engine, precision is calculated at every point at which a relevant document is obtained. The average precision for a topic is then calculated by averaging the precision at each of these points. MAP is then calculated by taking the mean of the average precision values across all topics in the run. MAP has been found to be a stable measure for combining recall and precision, but suffers from its value arising from being a statistical aggregation and having no real-world meaning.13

Test collections have been used extensively to evaluate IR systems in biomedicine. A number of test collections have been developed for document retrieval in the clinical domain.14 15 More recently, focus has shifted to the biomedical research domain in the TREC Genomics Track.16 Test collections are also being used increasingly for image retrieval outside of medicine.17 This paper provides an extended analysis of the results reported in the ImageCLEF 2005 overview paper.17

Methods

As noted above, test collections consist of three components: content items that actual users are interested in retrieving, topics that represent examples of their real information needs, and relevance judgments that denote which content is relevant (i.e., should be retrieved) to which topic. For the content of our collection, we set out to develop one of realistic size and scope. We aimed to use collections that already existed and did not intend to modify them (e.g., improve them with better metadata) other than organizing them into a common structure for the experiments. As such, we used the original annotations, which were not necessarily created for image retrieval. We obtained four collections of images that varied in both subject matter and existing annotation. Consistent with the nature of CLEF, they were annotated in different languages.

Tables 1 and 2 describe the collections used in the 2005 task. The Casimage collection consists of clinical case descriptions with multiple association images of a variety of types, including radiographs, gross images, and microscopic images.18 While most of the case descriptions are in French, some are in English and a small number contain both languages. The Mallinckrodt Institute of Radiology (MIR) collection consists of nuclear medicine images, annotated around cases in English.19 The Pathology Education Instructional Resource (PEIR) is a large collection of pathology images (gross and microscopic) that are tagged using the HEAL format in English.20 PathoPIC is another pathology collection that has all images annotated in longer German and shorter English versions.21

Table 1

Collection Origin and Types for ImageCLEFmed 2005 Library

Table 2

Items and Sizes of Collections in ImageCLEFmed 2005 Library

Images and annotations were organized into a single library, which was structured as shown in Figure 1 The entire library consists of multiple collections. Each collection is organized into cases that represent one or more related images and annotations. Each case consists of a group of images and an optional annotation. Each image is part of a case and has optional associated annotations, which consist of metadata and/or a textual annotation.

Figure 1

Structure of test collection library.

We developed 25 topics for the test collection consisting of a textual information needs statement and an index image. The topics were classified based on topic categories reflecting whether they were more amenable to retrieval by visual, textual, or mixed algorithms. Eleven topics were visually oriented,1 2 3 4 5 6 7 8 9 10 11 11 topics were mixed,12 13 14 15 16 17 18 19 20 21 22 and three topics were semantically oriented.23 24 25 Because the images were variously annotated in English, German, or French, the topics were translated into all three languages. (See Figure 2 for an example of one topic and the Appendix, available as a JAMIA on-line supplement at www.jamia.org, for all the topics.)

Figure 2

Example of visually (left) and semantically (right) oriented topics from the test collection.

The experimental process was conducted by providing each group with the collection and topics. They then carried out runs, consisting of the same retrieval approach applied to all 25 topics. Groups were allowed to submit as many runs as possible, but were required to classify them based on whether the run used manual modification of topics (automatic vs. manual) and whether the system used visual retrieval, text retrieval, or both (visual vs. textual vs. mixed). The two categories of topic modification and three categories of retrieval system type led to six possible run categories to which a run could belong (automatic-visual, automatic-textual, automated-mixed, manual-visual, manual-textual, and manual-mixed).

For systems using textual techniques, runs were designated as using manual modification if the topics were processed in any way by humans before being entered as queries into systems. Otherwise the processing of topics was deemed to be automatic, and could consist of such techniques, for example, as (automatically) mapping text into controlled terminologies, expanding words with synonyms, or translating words into different languages. Systems could use either the translations provided in the topic statements or translate across languages using their own approaches. Any manual translation of topics would require the run to be categorized as manual.

The final component of the test collection was the relevance judgments. As with most challenge evaluations, the collection was too large to judge every image for each topic. So, as is commonly done in IR research, we developed “pools” of images for each topic consisting of the top-ranking images in the runs submitted by participants.12 There were 13 research groups who took part in the task and submitted a total of 134 official runs. To create the pool for each topic, the top 40 images from each submitted run were combined, with duplicates omitted. This resulted in pools with an average size of 892 images (range 470–1167). For the 25 topics, a total of 21,795 images were in the pools for relevance judgments.

The relevance assessments were performed by physicians who were also graduate students in the OHSU biomedical informatics program. A simple interface was used from previous ImageCLEF relevance assessments. Nine judges, all medical doctors except for one image processing specialist with medical knowledge, performed the relevance judgments. All of the images for a given topic were assessed by a single judge. The number of topics assessed by each judge varied depending on how much time he or she had available, but varied from four to eight topics. Some judges also performed duplicate assessment of other topics. Half of the images for 20 of the 25 topics were judged in duplicate, 9,279 in all.

Once the relevance judgments were done, we could then calculate the results of the experimental runs submitted by ImageCLEF participants. We used the trec_eval evaluation package (available from trec.nist.gov), which takes the output from runs (a ranked list of retrieved items for each topic) and a list of relevance judgments for each run (called qrels) to calculate a variety of relevance-based measures on a per-topic basis that are then averaged over all the topics in a run. The trec_eval package includes MAP (our primary evaluation measure), binary preference (B-Pref),22 precision at the number of relevant images (R-Prec), and precision at various levels of output from 5 to 1,000 images (e.g., precision at 5 images, 10 images, etc. up to at 1,000 images). We also released the judgments so participants could perform additional runs and determine their results.

Although 134 runs were submitted for official scoring, many of these runs consisted of minor variations on the same technique, e.g., substitution of one term-weighting algorithm with another. We therefore limited our analysis of results to the best-performing run in a given run category from each group, for a total of 27 runs. Although this reduced our overall statistical power, it prevented groups that submitted multiple runs representing minor changes to algorithms from being over-represented in the statistical analysis.

Because our analysis was not hypothesis-driven, we limited our statistical analysis to an overall repeated measures analysis of variance (ANOVA) of MAP for the 27 runs as well as calculation of inter-rater relevance judgment agreement using the kappa statistic. Statistical analyses were performed using SPSS, version 12.0. Posthoc pairwise comparisons for the repeated measures ANOVA were done using the Sidak adjustment. For inter-rater agreement, the kappa statistic was calculated in two ways: with three categories (relevant, partially relevant, and not relevant) and with two categories (using the official category of relevance based on images judged as fully relevant).

Results

Run Analysis

A total of 13 research groups submitted 134 runs. Table 3 lists the research groups, the number of runs submitted, and their general approaches. It also contains citations to each group’s individual paper for more details.23 24 25 26 27 28 29 30 31 32 33 34Table 4 shows the 27 best runs in each run category submitted by each group. Figure 3 shows the MAP for all 27 analyzed runs with 95% confidence intervals. The ANOVA analysis of MAP on the reduced set of 27 runs indicated that at least some runs were significantly different from others (p<0.001). Posthoc pair-wise comparison of MAP showed that significant difference from the top run IPALI2R_Tian started from I2Rfus.txt, about one-third down the rank. Figure 3 shows the rest of the performance measures for each run.

Table 3

Research Groups, Runs Submitted, General Approaches, Citation

Table 4

Best Runs from Each Group in Each Run Category Sorted by Mean Average Precision (MAP)

Figure 3

MAP for each run, sorted from highest to lowest, with 95% confidence intervals.

Figure 4

All results from Table 4 , sorted by MAP.

Also shown are results from other evaluation measures, including R-Prec, binary preference (B-Pref), and precision at 10, 30, and 100 images (P10, P30, and P100 respectively).

It can be seen that the best results came from the automatic-mixed run category. However, it can also be seen that some performance statistics do not follow the same trend as MAP. For example, the OHSUmanvis run outperforms all but the top few runs in precision at 10 and 30 images. Conversely, the SinaiEn_okapi_nofb_Topics run took a dip with those measures relative to others with comparable MAP.

Topic Analysis

Our next analysis looked at differences by topic. Table 5 shows the results for each topic as well as averages for all topics and by topic categories. We again only used the best runs from each group for each run category to calculate these values in order to keep those completing larger numbers of runs within a run category from biasing the average. As seen in Table 5, a large diversity of results were obtained from the different topics. We do note that selecting which runs to use for this analysis could impact the results and, as such, note that this analysis should be used mainly to note the differences among the topics rather than the performance of systems on any particular one.

Table 5

Retrieval Results for Each Topic (Averaged Across All Runs) as Well as Topic Categories (Visual, Mixed, and Textual)

Figure 5 plots the number of relevant images and MAP per topic on the same graph, showing a modest association between these measures. Figure 6shows the best run in each run category plotted versus the various topic categories of visual, mixed, and semantic. It can be seen that visual retrieval techniques performed poorly compared to semantic queries, bringing down their overall performance.

Figure 5

Number of relevant images vs. MAP for the 25 topics based on results from each group’s best run in each run category.

Figure 6

MAP for the best performing run in each run category (denoted to the right of the graph) for each topic category. These results demonstrate that textual systems were more resilient for visual topics than visual systems were for textual topics.

Impact of Variable Relevance Judgments

We also assessed the impact of variation in relevance judgments. Table 6 shows the overlap of judgments between the original and duplicate judges. Judges were more often in agreement at the ends (not relevant, relevant) than the middle (partially relevant) of the scale. For the 9,279 duplicate judgments using three categories, the kappa score was 0.679 (p<0.001). The kappa statistic for strict relevance was 0.74, indicating “good” agreement.

Table 6

Overlap of Relevance Judgments

We also looked at how different relevance judgments impacted MAP. In addition to the official “strict” relevance, we also assessed “lenient” relevance, where partially relevant images were also considered relevant. We also combined the 9,279 duplicate judgments with the official ones using AND (both judgments had to be relevant for the image to be considered relevant) and OR (only one judgment had to be relevant for the image to be considered relevant) with both strict and lenient relevance. As shown in Figure 7 different judgments led to modest absolute changes in MAP but performance relative to other runs was largely unchanged.

Figure 7

The impact of varying relevance judgments. The values of MAP are shown for each run with different sets of relevance judgments from the official Strict method to those using more lenient and/or incorporating duplicates judgments into the analysis, as described in the text.

Discussion

The ImageCLEF 2005 biomedical task developed a large test collection and attracted research groups who brought a diverse set of approaches to a common goal of efficacious image retrieval. Not only did these groups learn from their own experiments, but other researchers will subsequently be able to improve image retrieval by using the test collection that will now be available.

A variety of conclusions can be drawn from the experiments performed in ImageCLEF 2005. First, it was clear for most research groups that systems mixing visual and textual approaches performed better than those using either approach alone. In addition, our experiments also showed that systems employing textual approaches are more resilient to difficult visually oriented topics than visual systems are to difficult textually oriented topics. In other words, based on these results, image retrieval systems that use visual techniques should also incorporate text retrieval capabilities for maximum performance.

A final conclusion was that MAP may not be the best measure for the image retrieval task. MAP measures the full range of retrieval results for a topic from low to high recall. In the image retrieval task, however, users may be more precision-oriented than recall-oriented. In other words, users may only want a small to moderate number of relevant images, and not every last relevant one. This is in distinction to, say, someone carrying out a systematic review who needs to retrieve every last relevant document in a text retrieval system. The problem with MAP versus other measures is exemplified in the OHSUmanvis run. This run achieves very high precision at 10 and 30 images but much lower MAP than other runs with comparable precision at these levels. As such, this run may be desirable from the user’s standpoint, even though the MAP is lower. Clearly, further research is necessary to identify which measures are most important to the image retrieval tasks of real users.

This work has a number of limitations. First, like all test collections, the topics were artificial and may not be realistic or representative of how real users would employ an image retrieval system. Likewise, the annotation of the images may not be representative of how image annotation is done generally or represent best practice. And as with all test collections, the pools generated for relevance assessment only represent images retrieved by the techniques of the participating research groups. As such, there could have been other retrieval techniques that would retrieve other images that may be relevant.

We have a number of future plans, starting with ImageCLEF 2006. Because of the diversity of images and annotations, we plan to keep the same image collection and library structure for ImageCLEF 2006. We will, however, develop new topics. We plan to develop equal numbers of textual, visual, and mixed topics so we can better explore the differences among topic categories. Later on, we will enlarge the collection itself.

Additional future plans include carrying out user experiments on two fronts: one to see how users interact and perform with real systems using this collection and also to better elicit user information needs to develop even more realistic topics. With these experiments, we will also aim to assess performance measures to determine which are more representative for real tasks. This will be done by assessing which measures are best associated with the information needs of real users in specific searching situations.

We have created a large image retrieval test collection that will enable future research in this area of growing importance to biomedicine. We have also identified some observations that warrant further study to optimize the performance of such systems. The growing prevalence of images used for a variety of biomedical tasks makes imperative the development of better image retrieval systems and an analysis of how they are used by real users. The ImageCLEF test collections, with both system-oriented and user-oriented research around them, will contribute to further advances in this active research area.

Footnotes

  • Instructions for obtaining the data described in this paper can be obtained from the ImageCLEFmed Web site (http://ir.ohsu.edu/image/).

References

Access policy for JAMIA

All content published in JAMIA is deposited with PubMed Central by the publisher with a 12 month embargo. Authors/funders may pay an Unlocked fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication.

All content older than 12 months is freely available on this website.

AMIA members can log in with their JAMIA user name (email address) and password or via the AMIA website.