Seeking Health Information Online: Does Wikipedia Matter?
- aFaculty of Medicine, Katholieke Universiteit Leuven, Belgium
- bDepartment of Molecular Microbiology, Washington University, School of Medicine, St Louis, MO, United States
- Correspondence: Michaël R. Laurent, Rooistraat 8, B-3012 Wilsele, Belgium; e-mail: < >
- Received 4 November 2008
- Accepted 1 April 2009
Objective To determine the significance of the English Wikipedia as a source of online health information.
Design The authors measured Wikipedia∗s ranking on general Internet search engines by entering keywords from MedlinePlus, NHS Direct Online, and the National Organization of Rare Diseases as queries into search engine optimization software. We assessed whether article quality influenced this ranking. The authors tested whether traffic to Wikipedia coincided with epidemiological trends and news of emerging health concerns, and how it compares to MedlinePlus.
Measurements Cumulative incidence and average position of Wikipedia® compared to other Web sites among the first 20 results on general Internet search engines (Google®, Google UK®, Yahoo®, and MSN®), and page view statistics for selected Wikipedia articles and MedlinePlus pages.
Results Wikipedia ranked among the first ten results in 71–85% of search engines and keywords tested. Wikipedia surpassed MedlinePlus and NHS Direct Online (except for queries from the latter on Google UK), and ranked higher with quality articles. Wikipedia ranked highest for rare diseases, although its incidence in several categories decreased. Page views increased parallel to the occurrence of 20 seasonal disorders and news of three emerging health concerns. Wikipedia articles were viewed more often than MedlinePlus Topic (p = 0.001) but for MedlinePlus Encyclopedia pages, the trend was not significant (p = 0.07–0.10).
Conclusions Based on its search engine ranking and page view statistics, the English Wikipedia is a prominent source of online health information compared to the other online health information providers studied.
This paper evaluates the rate of occurrence of the English edition of Wikipedia, a large online collaborative encyclopedia, among the top results from leading general Internet search engines for health queries derived from three large online health information resources. We compared Wikipedia∗s occurrence and mean position to other Web sites, and examined which factors influence the Web site∗s position. We also investigated whether traffic to Wikipedia articles correlated with epidemiological factors, and how page views statistics compared to MedlinePlus, a major governmental online health encyclopedia.
Although the Internet is a copious source of health information, how and how often Internet users look for health information online and what the impact of this behavior has on the patient–physician relationship, remains unclear.1 2 3 4 5 6 7 Online health information lookup is more frequent in certain populations, such as men and those with higher levels of education and health literacy, and less frequent in demographic groups such as elderly people.1 8 9 Despite the existence of search engines designed to retrieve information from Web sites that have been assigned quality labels (such as the one assigned by the Health on the Net Foundation), general search engines (of which Google is the market leader in many Western countries) appear to be the most popular starting points for online health information searches.4 5 8 10 11 Importantly, the first page of general search engine results is significantly more likely to be accessed by (inexperienced) health information seekers, with an exponential decline thereafter.10 11 The rank of a Web site among search engine results depends on factors such as the specific search engine algorithm, the number of times the Web site is accessed from the results page (by the demographic group that uses the search engine), and search engine optimization strategies that aim to influence ranking.12
Wikipedia (http://www.wikipedia.org) is an open-access multilingual online encyclopedia that invites contributions from its users. It is operated as a charity by the non-profit Wikimedia Foundation. Wikipedia ranks as the eighth most accessed Web site on the Internet, according to Internet traffic information from Alexa, Inc.13 With now more than 2.5 million articles, the English Wikipedia is the most prominent example of a wiki website. Wikis use a relatively simple editing syntax and a public record of all edits to facilitate collaboration between multiple contributors. While not a specific medical Internet encyclopedia like MedlinePlus14 or NHS Direct Online,15 Wikipedia contains articles on many medical topics.16 In 2006, a small study using eight commercially obtained popular health-related search terms on Google and Yahoo (total of 16 searches) found that user-generated content appeared on the first page of results in 12 cases, ten of which included results from Wikipedia. For the search terms “diabetes” and “bipolar disorder”, search engine statistics revealed that consumers visited Wikipedia in 2.79 and 7.17% of cases, respectively.17
The frequency with which Wikipedia and many other health information resources feature on the first pages of search engine results has yet to be accurately determined. We believe such rankings are important because they provide insights into which Web sites are likely to be visited by online health information seekers. The aim of our study was to determine how often the English Wikipedia appears among the top search engine results for health-related queries (with the average rank as a secondary outcome), and how this compares to the position of governmental, non-governmental and commercial health information Web sites. Furthermore, we evaluated factors involved in Wikipedia∗s position by determining whether the quality of Wikipedia articles, as rated by its contributors, influenced this position (to test whether articles rated as more developed were ranked higher), and compared how Wikipedia ranked among search engine results when rare diseases were used as keywords, versus keywords containing more common health terms.
In addition, we investigated traffic trends for articles on medical conditions or pathogens that show seasonal variation. Our hypothesis was that if consumers use general search engines to seek health information online, they would commonly be exposed to Wikipedia content. If they would also access these search engine results, then traffic to articles with a season-specific topic would be predicted to increase parallel to endemic occurrence of the condition or its pathogen. In addition, news of a sudden infectious disease outbreak or other emerging health concern should cause a sudden increase in traffic to relevant articles. Finally, we compared page view statistics of MedlinePlus Topic and Encyclopedia pages to page views of the corresponding Wikipedia articles.
Software and Search Engines
We used a search engine optimization tool (Advanced Web Ranking, version 6.2, by Caphyon, Ltd, Craiova, Romania, available from http://www.advancedwebranking.com/index.html) to check the position of the English Wikipedia versus other domains among the first 20 results retrieved from Google (http://www.google.com), Google UK (http://www.google.co.uk), Yahoo (http://www.yahoo.com) and MSN (http://www.msn.com). The software determines the number of times the entered domains were found as the first result or among the first five, ten or twenty results. We entered these absolute cumulative incidences into a spreadsheet application (OpenOffice.org Calc version 2.4.0 by the OpenOffice.org community and Sun Microsystems, Inc, Santa Clara, CA, United States) to calculate relative cumulative incidences. We used the search engine optimization software to determine the position of a given domain among search engine results; we used these data to test whether the mean position was higher for community-rated quality articles on the English Wikipedia (see below, section “Influence of community-rated article quality”).
First, we extracted 1726 keywords from the health topic index of MedlinePlus, a health information service from the United States Library of Medicine and the National Institutes of Health (keywords listed at http://www.nlm.nih.gov/medlineplus/all_healthtopics.html; we used the Aug 12, 2008 update). The keywords included the titles of all MedlinePlus Health Encyclopedia entries, as well as common synonyms, related search terms and abbreviations (for example, “Attention Deficit Hyperactivity Disorder”, “ADHD”, “ADD” and “Hyperactivity” were all included). Some of the keywords (such as “Alaska Native Health”) are specific to the United States in context, and the spelling of keyword terms is in American English. When search terms consisted of exactly the same words (e.g., “Allergy, Food” and “Food Allergy”), we removed one version from the list. A few obviously irrelevant keywords like “Teens page” were removed as well, but no other terms were removed (for example, “Terrorist Attacks” and “Tornadoes” were not rejected as keywords).
We then derived a second set of all 966 keywords in the online alphabetical topic index of NHS Direct Online, a health information service from the British National Health Service (keywords listed at http://www.nhsdirect.nhs.uk/encyclopaedia/a-z/, retrieved on Aug 22, 2008). These keywords were in British English and included terms specifically used in the UK (e.g., “A&E” and “NHS”). Both sets of keywords represent a large number of common medical conditions, symptoms, diagnostic tests, anatomical and physiological terms, treatments, procedures, prevention topics and other medical terms.
Finally, we derived 1,173 keywords from the online alphabetical index of the U.S. National Organization of Rare Diseases (NORD; keywords listed at http://www.rarediseases.org/search/rdblist.html, retrieved on Sept 12, 2008). This list unexpectedly contained keywords referring to conditions that are not rare, such as cataracts, carpal tunnel syndrome, colon cancer, and prostate cancer. We used the software tool to check the position of selected Web sites for the first two sets of keywords between Aug 19 and 23, 2008, and for the third set of keywords on Sept 12 and 13, 2008 (the full list of keywords is listed in Appendix 1, available as an online data supplement at http://www.jamia.org).
Web Sites and Domains
We selected 27 Web sites or groups of Web sites for comparison to Wikipedia based on their ranking on manual searches during the preparation of this paper. This involved using keywords from MedlinePlus as queries on Google and including those content providers that frequently appeared among the first 20 results, as based on manual ranking calculations (data not shown). Some Web sites were included because they belong to notable organizations (e.g., the World Health Organization or the Health on the Net Foundation), even though they frequently ranked lower than some other commercial Web sites that were not included because of a narrow scope (e.g., http://www.drugs.com or http://www.babycenter.com).
The software tool we used did not recognize subdomains (e.g., the results for http://nlm.nih.gov were not shown under http://nih.gov). Accordingly, after selecting the online health information providers, we grouped Internet domains belonging to a single organization so that they would be processed together. We manually searched for additional subdomains to ensure maximal clustering. We created a “U.S. government” cluster which consisted of 123 “.gov” Internet domains (ranging from http://www.4woman.gov to http://www.womenshealth.gov). These domains were identified among Google∗s first 200 results for “site:*.gov health”, and additional listing of all domains of United States National Institutes of Health centers and institutes. We created an “http://about.com” cluster with 96 subdomains (ranging from http://acne.about.com to http://yoga.about.com) based on the list of available subdomains on the provider∗s Web site. As well as analyzing these Web sites separately, we grouped together commercial Web sites belonging to WebMD (http://www.webmd.com, http://www.emedicine.com, http://www.emedicinehealth.com and http://www.medscape.com). Furthermore, we paired http://www.familydoctor.org with http://www.aafp.org, as both are maintained by the American Academy of Family Physicians. Finally, we clustered three domains related to the British Broadcasting Company (BBC), four domains and subdomains related to the Cleveland Clinic and two related to the Mayo Clinic (all domain clusters are listed in Appendix 2, available as an online data supplement at http://www.jamia.org). The software tool automatically allocates the ranking from the highest listed domain to the cluster to which it belongs.
Influence of Community-rated Article Quality
The English Wikipedia allows groups of editors collaborating in a certain area of knowledge (called a WikiProject18 19) to assess the quality of articles in their field. Of the possible quality ratings (which, from lowest to highest rating, are termed Stub, Start, C-class, B-class, Good Article, A-class and Featured Article), only two are applied after a formal review process: Featured Article (which are meant to exemplify Wikipedia∗s best work) and Good Articles (which are judged by similar but less stringent criteria). We identified all health-related Featured and Good Articles (hereafter referred to as “quality articles”) via their respective categories and index pages (see http://en.wikipedia.org/wiki/Category:Medicine_articles_by_quality for an overview). For these quality articles, we looked for equivalent MedlinePlus keywords; 49 out of 1726 keywords (2.8%) had corresponding quality articles on the English Wikipedia. Using the search engine optimization tool described above, we tested whether these quality articles were listed more frequently among the first 20 search engine results for MedlinePlus keywords, and if they had higher mean positions compared to non-quality Wikipedia articles.
Page Views of Season-related Articles and Emerging Health Concerns
We selected ten Wikipedia articles on conditions more common in winter, or on pathogens causing an illness that is more common during winter: frostbite, hypothermia, carbon monoxide poisoning, common cold, pneumonia, bronchiolitis, norovirus, influenza, rhinovirus and seasonal affective disorder. In the same manner, we selected articles related to the summer: hyperthermia, sunburn, hay fever, insect bites and stings, bee sting, Lyme disease, Rocky Mountain spotted fever, hemolytic-uremic syndrome, harvest mite and West Nile virus. We retrieved available information on daily page views for these articles from http://stats.grok.se, and compared daily page views from Jun to Jan 2008. To complement this study of seasonal epidemiological influences, we also studied whether emerging health concerns influenced the number of page views of the relevant Wikipedia article. Here, we studied three examples for which the United States Centers for Disease Control and Prevention and/or the Food and Drug Administration issued alerts. The first concerns melamine-contaminated infant formula from China, which first received broad media attention on Sept 12, 2008. The second example involves the Salmonella Saintpaul outbreak, which led several groceries and restaurants to stop offering tomatoes on Jun 9, 2008. The third example involves an intoxication with the protein toxin ricin which was announced on Feb 29, 2008. We therefore retrieved page view statistics for the Wikipedia articles “Melamine”, “Salmonella”, and “Ricin”.
Traffic to Medlineplus Compared to Wikipedia
We obtained page view statistics for the 20 most visited MedlinePlus Topic and Encyclopedia pages, both for Jan and Jun 2008 (personal communication, Kitendaugh P, Head of Reference and Web Services, Public Services Division of the U.S. National Library of Medicine). We then compared the MedlinePlus page view statistics to those obtained from http://stats.grok.se for the corresponding Wikipedia article (if one existed). To determine the corresponding Wikipedia article, we entered the MedlinePlus term into Wikipedia∗s search box and used the first applicable result.
Statistical analyses were performed using the GraphPad Prism software (version 5.01, by GraphPad Software, Inc, La Jolla, CA, United States). We defined P < 0.05 as statistically significant. Differences between proportions were determined using contingency tables and the χ2 test (or Fisher∗s exact test for the smaller sample of quality articles). Differences between means were determined using two-sided Student∗s t tests. To assess whether news of emerging health concerns influenced traffic to Wikipedia articles, we used a one-sample t test to determine whether the mean number of daily page views was different from the highest number during that month. Page views of MedlinePlus and corresponding Wikipedia articles were compared using a paired t test.
Incidence and Mean Position of Studied Domains Among Search Engine Results
Table 1 shows the domains with the highest cumulative incidences for queries on Google. Among Google∗s results for queries from MedlinePlus, the English Wikipedia ranked highest among the first results and the first five results. The United States government cluster tied with the English Wikipedia among the first ten results, and surpassed it among the first 20 results. Google more commonly listed results from the English Wikipedia than from MedlinePlus or NHS Direct Online, even when they were the source of the search terms. When using only rare diseases as keywords, the English Wikipedia outranked all other sites, but its incidence showed only small and inconsistent differences compared to its results for MedlinePlus queries. On queries from NHS Direct Online and NORD, the Medscape cluster had higher incidences than the United States government cluster. Table 2 allows comparison of results for NHS Direct Online and MedlinePlus queries on Google U.K. Except among first results, NHS Direct Online was listed more frequently when its keywords were used. The BBC ranked third on Google U.K. Overall, the English Wikipedia ranked among the first ten results in between 70.8 and 84.7% of cases across search engines and keywords. For more detailed incidence statistics on all Web sites and search engines, see Tables 3–6, available as online data supplements at http://www.jamia.org. For MedlinePlus keywords, the English Wikipedia∗s mean ranking across search engines was the third position (2.77–3.58 across search engines), which was higher than other domains (see Table 7, available as an online data supplement at http://www.jamia.org).
Influence of Community-rated Article Quality
When the MedlinePlus keywords related to the 49 quality Wikipedia articles were used as queries, the Wikipedia articles were found among the first ten results in all cases on Google and Google UK, and in 47 cases (96%) on Yahoo and MSN. This was significantly more frequently than the top ten and top 20 incidences of Wikipedia for all MedlinePlus keywords (compared to the top 20 incidences of 86% on Yahoo and 85% on MSN, p = 0.04 and p = 0.02, respectively).
Epidemiological Influences on Wikipedia Article Page Views
Figure 1 shows the relative amount of page views for Jun compared to the mean number of daily page views in Jan for ten conditions or pathogens that occur more commonly during either winter or summer months. All these articles had significant differences in daily traffic between these two months (t test for all p < 0.0001 except for hypothermia, p = 0.0002). Figure 2 shows the daily page views of three Wikipedia articles related to emerging health threats. These increases could not be attributed to a normal variance (one-sample t test of highest value compared to mean over days before incident: all three p < 0.0001).
Page Views of MedlinePlus Versus Wikipedia
Wikipedia∗s articles were viewed more frequently than the corresponding MedlinePlus Topic pages (p = or < 0.001); there was a non-significant trend towards higher page views for Wikipedia compared to MedlinePlus Encyclopedia pages (p = 0.068 and p = 0.097 for Jan and Jun 2008, respectively). Complete page view statistics are given in Table 8, available as an online data supplement at http://www.jamia.org.
The aim of this paper was to determine the relative position of the English Wikipedia and other Web sites containing health information in a search engine-based approach. The results show that if the first page of results of a general search engine lists ten Web sites, Wikipedia can be found among those results in more than 70% of cases. This confirms preliminary findings by others.17 Wikipedia had a higher average position than any other reference in this study. Our findings on resources other than Wikipedia confirm previous findings using Internet audience measurement services, which did not include Wikipedia∗s medical content.20
Wikipedia ranked higher with quality articles, although this is not necessarily a causal relationship since these quality articles covered more common health topics, and we have observed that Wikipedia was more prominent among search results for common health terms in some categories. Wikipedia∗s good results for rare diseases compared to other online health resources also suggest that it has articles on a wide range of conditions. The results pertaining short- and long-term epidemiological influences on article traffic create a link between search engine results and page viewing. Others have previously observed the relationship between search engine activity and news coverage.21 A study on Google Flu showed that search engine queries related to influenza-like illness correlated with the epidemiological data from the United States Centers for Disease Control and Prevention.22 These findings were replicated for queries submitted to a Swedish medical Web site.23 We believe that these studies support our assumptions that firstly, Internet activity can be used as a surrogate marker for consumer behavior, and secondly, that online health information seekers often use search engines to find individual health Web sites, which underscore the importance of Wikipedia as a prominent source of information in such searches.
Our study has several strengths. The software tool we used allowed us to check a large set of keywords on multiple search engines while avoiding observer bias. The use of a broad set of keywords from governmental online health information initiatives was important to avoid selection bias, and additionally it might make these data useful from a policy-making point of view; we provide data that might be relevant to the search engines position of the government-sponsored health information Web sites MedlinePlus and NHS Direct Online.
However, our study design does have limitations. We did not perform a weighted analysis based on often-used health-related keywords (such as “Diabetes”);3 17 so in our study, each keyword was given equal importance. Some keywords were listed together with their abbreviations, which results in multiple counting. Nevertheless, this could mimic how people use search engines, with some people using an abbreviation while others might know the full term. Indeed, consumers appear to differ widely in the queries they use to find specific information,24 which also limits the generalizability of our search terms. Because we wanted to avoid selection biases, we also retained keywords that returned several non-medical Web sites as search engine results (for words like “Walkers” or abbreviations like “CFS” for chronic fatigue syndrome), which favors Wikipedia because it contains more than just medical information. Unexpectedly, the conditions listed by the NORD contained some fairly common disorders (see Methods); we did not remove these, again to avoid selection bias. We made a personal selection of commercial, non-profit and governmental Web sites based on manual searches on Google for comparison to Wikipedia, but our list is by no means exhaustive and has the serious drawback of possible selection bias. However, the software allows storage of the data set and post hoc analysis for additional domains. We also created several clusters to pool the impact of a single content provider who might use multiple domains; we cannot completely exclude that some less prominent United States government Web sites were not listed and might not have been counted, although we believe that any such Web sites were unlikely to have a major impact on the results. It should also be noted that we did not study sponsored search engines results, which might influence consumers. We have tried to make a simple dichotomy (quality vs. non-quality articles) based on Wikipedia∗s community article rating system, but we emphasize that this is a system that has not yet been externally validated as a true measure of quality (compared to expert review). The examples we have provided of real-life epidemiological changes correlating with page views of the relevant Wikipedia article are illustrative, although this remains indirect evidence. There may be other seasonal disorders or pathogens that do not follow this pattern, and disease outbreaks that do not result in increased article traffic. The latter examples may be confounded by Wikipedia∗s role as a source of news, as disease outbreaks straddle the border between health information and news.
With regards to the generalizability of these results, it should again be stressed that not all online health information seekers are patients, and that not all patients seek health information online. Obviously, this study says little about consumers with a native language different from English25 or using search engines popular in other countries (like http://Baidu.com in the People∗s Republic of China and http://Guruji.com in India). In this aspect, the results from British versus American English keywords are not mutually exchangeable. Furthermore, this Internet study provides no evidence on the level of trust that patients assign to the health information they read on Wikipedia. However, a recent survey indicated a shift of priorities for both non-professional and professional Internet users from trustworthiness and accuracy of information to availability and ease of finding information.5 Finally, differences could exist between medical specialties with regards to the importance of both general health information Web sites and Web sites devoted to a specific topic.
Although several medical scientists and policy makers have highlighted the potential use of wikis to foster collaboration on easily-accessible health information for the community,26 27 28 29 30 and Wikipedia is the most prominent example of a wiki, we could identify no previous research specifically focusing on Wikipedia as a source of health information for consumers. Thus, it appears that Wikipedia provides an important area for future research on sources of online health information. However, we also found misconceptions about Wikipedia in the scientific literature: for example, a recent study examining search engine results for obstetric queries misclassified it as a commercial instead of a non-profit Web site.31 Importantly, the open editing policy offers no way of assessing the expertise of contributors, resulting in fears of inaccuracies. This may be one reason why doctors are creating wikis where only they can contribute (such as http://Ganfyd.org, http://RadiologyWiki.org or http://WikiSurgery.com).32 33 34 Instead of creating new wikis, Wikipedia itself could be used by doctors, as well as patient groups and associations, to collaboratively edit articles on the topics they value.17 35 As we have shown here, these articles are among the top results on general search engines, thus providing a free platform to disseminate information globally. Until now, doctors have lagged behind biomedical scientists in realizing the potentials of wikis.18 36 37 38 39 39 40 41 42 43 44 45 46 47 48 49 50 51 However, in Feb 2009, Medpedia, Inc, in collaboration with several prominent medical faculties, the NHS, the American College of Physicians and other partners, launched its open access medical encyclopedia running on the same software and under the same license as Wikipedia. It will have content for the public as well as for experts, and allow for discussion of the subject. Contrary to Wikipedia where anyone can contribute regardless of qualifications, only experts are allowed to contribute to Medpedia (although others may suggest changes), which might alleviate quality concerns. We believe this wiki may address some of the concerns that discourage the medical community from contributing to Wikipedia.
Although this Internet study showed that Internet consumers are likely to be exposed to Wikipedia through search engine results for health-related keywords, examining quality of health information present in Wikipedia was beyond the scope of this article. Thus, further studies are urgently needed to determine whether Wikipedia articles are of sufficient quality to support patient-provider communication. Assessment of the quality of Wikipedia∗s freely editable content is difficult, since its articles are inherently in a constant state of flux. Examples of flagrant mistakes have been reported in the media, as well as an analysis that found that Wikipedia contains a similar numbers of mistakes compared to Encyclopedia Brittannica.52 Clauson et al (2008) compared drug information from Wikipedia to the Medscape Drug Reference, and concluded that although Wikipedia was less complete (especially regarding dosing information, which is explicitly discouraged by Wikipedia guidelines), no factual errors were found, and it “may be a useful point of engagement for consumers” for supplemental drug information.53 Although articles in the English Wikipedia are increasingly being referenced with articles from leading scientific journals,16 54 and articles have been shown to improve over time,53 Wikipedia itself makes no claim to correctness, and the medical disclaimer aptly describes the situation: “Wikipedia contains articles on many medical topics; however, no warranty whatsoever is made that any of the articles are accurate.” However, while consumers appear to rarely check the source and quality of the information they find online;4 9 at least with a well-known brand like Wikipedia, they know that they should remain skeptical. Indeed, while Wikipedia contributors have classified over 14,000 of their articles as dealing with medical topics: only around 50 of them have been confirmed as top quality (“Featured articles”).55 This implies that Wikipedia is still a long way from achieving the idea of its founder Jimmy Wales, who imagined a world where every human being had free access to the sum of all human knowledge in his or her language.56 Maybe doctors, like researchers, “should read Wikipedia cautiously and amend it enthusiastically”,57 thus fulfilling the proposed new professional obligation of making their knowledge and expertise freely available on the Internet.58
Our study shows that Wikipedia is a prominent health information Web site based on its position among search engine results for health-related queries. Despite several calls to adopt the principles that underlie its success, there is virtually no research on Wikipedia∗s role as a source of health information. Observational studies in different settings are needed to document how often consumers seek health information online, and studies in a clinical setting are needed to estimate the impact of this behavior on the patient–physician relationship.
The authors thank S.C. Grover, MD, FRCPC and J.F. De Wolff, MD, MRCP for critically reviewing our manuscript.
This manuscript has not been submitted to any institutional review boards and has not received any funding or support.
The authors contribute as volunteers to the English Wikipedia (where both are administrators) and other Web sites of the Wikimedia Foundation. M. Laurent is a member of the WikiProject Medicine, and T. Vickers is Director of the Molecular and Cellular Biology WikiProject.