Metrics associated with NIH funding: a high-level view
- 1SciTech Strategies, Albuquerque, New Mexico, USA
- 2National Institutes of Health, Office of Research Information Systems (OD OER ORIS), Research Triangle Park, North Carolina, USA
- Correspondence to Dr Kevin W Boyack, SciTech Strategies, 8421 Manuel Cia Pl. NE, Albuquerque, NM 87122, USA;
- Received 24 February 2011
- Accepted 2 April 2011
- Published Online First 27 April 2011
Objective To introduce the availability of grant-to-article linkage data associated with National Institutes of Health (NIH) grants and to perform a high-level analysis of the publication outputs and impacts associated with those grants.
Design Articles were linked to the grants they acknowledge using the grant acknowledgment strings in PubMed using a parsing and matching process as embodied in the NIH Scientific Publication Information Retrieval & Evaluation System system. Additional data from PubMed and citation counts from Scopus were added to the linkage data. The data comprise 2 572 576 records from 1980 to 2009.
Results The data show that synergies between NIH institutes are increasing over time; 29% of current articles acknowledge grants from multiple institutes. The median time lag to publication for a new grant is 3 years. Each grant contributes to approximately 1.7 articles per year, averaged over all grant types. Articles acknowledging US Public Health Service (PHS, which includes NIH) funding are cited twice as much as US-authored articles acknowledging no funding source. Articles acknowledging both PHS funding and a non-US government funding source receive on average 40% more citations that those acknowledging PHS funding sources alone.
Conclusion The US PHS is effective at funding research with a higher-than-average impact. The data are amenable to further and much more detailed analysis.
- Bibliometric analysis
- research evaluation
- citation impact
- citation analysis
- input–output studies
- text mining
- science mapping
- machine learning
Although science-policy studies have been conducted for decades, interest in such studies is currently on the rise in the USA, as well as other countries. This is evidenced by the number of recent workshops highlighting ‘science of science policy’ as well as the establishment and funding of a Science of Science and Innovation Policy program at the National Science Foundation (NSF). Despite the long historical interest in science policy, quantitative input–output studies establishing the impact of programs at different agencies and institutes have been very difficult owing to the fact that data explicitly linking articles with the grants from which they were funded have been lacking. One place where these data do exist is the PubMed database, which has been indexing grant numbers for US Public Health Service (PHS) grants since at least 1981.
The fact that PubMed records contain grant acknowledgements does not, however, mean that they have been systematically used for research evaluation. In fact, the opposite is true. Although these data exist, they were not systematically mined, standardized, and placed in a publicly available database for bulk download until just recently. The National Institutes of Health (NIH) Research Portfolio Online Reporting Tool Expenditures and Results (RePORTER) website was made available in mid-2009. This site allows one to query a wide variety of NIH funding and publication data. The ExPORTER extension (http://projectreporter.nih.gov/exporter/ExPORTER_Catalog.aspx) to make selected data available for bulk download was introduced several months later. Data have been added to the ExPORTER site at intervals over the past year—data back to 1985 are currently available, including tables for projects, links (articles to projects), articles, and patents.
Prior to the introduction of RePORTER and ExPORTER, it was possible to obtain funding and publication data for input–output studies. Grant data were available from CRISP and RaDiUS (Computer Retrieval of Information on Scientific Projects and Research and Development in the United States databases, now both defunct), and publication and grant acknowledgment data were available from PubMed. However, significant efforts were required to obtain, clean, and link such data, and were thus a limiting factor in the number and scope of input–output studies.1
This paper reports work on two separate tasks. First, we report on work carried out at NIH to generate and provide a clean, standardized source of article-to-grant linkage data from grant acknowledgments in PubMed. Second, we have linked those data to records from Scopus to add citation counts, and have used the combined data for a first set of high-level analyses to show distributions and metrics related to NIH grants. The balance of this paper proceeds as follows. First, relevant literature on linking of grant and article data will be briefly reviewed. The NIH process to link grant and article data will then be described. A high-level characterization of the linked data and various metrics associated with those data are then presented. The paper concludes with a discussion of benefits, limitations, and suggestions for future work.
Perhaps the most comprehensive input–output studies were carried out in the 1980s by Computer Horizons Inc. For example, McAllister and colleagues studied the relationship between R&D expenditures and publication outputs for US colleges and universities2 and US medical schools3 on a large scale using aggregated funding amounts, and publication and citation counts. Bourke and Butler4 reported on the efficacy of different modes of funding research in biological sciences in Australia. Their work aggregated funding to the sector level and concluded that impact was correlated with researcher status. Butler5 followed this work up with a study of funding acknowledgment, finding that, although acknowledgment data on the whole accurately reflected the total research output of a funding body, there was no ability to track research back to the grant level. This inability to track research back to an individual grant precludes analyses of research vitality at the finest levels. Additional studies using aggregated results are also available in the literature.6–8
Far fewer studies are available in which actual linking of grant data to individual articles has been reported. Computer Horizons Inc. mined and maintained funding data from the acknowledgments in journal articles, and used them for a variety of studies for the US NIH in the 1980s.9 However, neither their grant–article linkage data nor their reports to NIH are readily available. Lewison and colleagues10–12 used citation data from the Science Citation Indexes and acknowledgment data from the UK Research Outputs Database to study national-level impacts in various biomedical fields. Although they mention looking up articles and extracting funding information, no grant–article level analysis is reported. Boyack and colleagues linked grants to individual articles through common author/PI and institution using data supplied by the National Institute on Aging,13 and showed that citation impact increased with grant size.14 They also showed funding profiles for NIH and NSF on a map of science,15 using grant data from 1999 and article data from 2002 linked through author/PI and institution. Recently, Zhao16 examined a set of 266 articles in library and information science journals, and found that articles acknowledging grant funding were cited over 40% more on average than those without a grant acknowledgment. Lyubarova et al17 investigated the impact of NIH funding on cardiovascular research and found that the mean journal impact factor of NIH-funded research was significantly higher than that of non-NIH-funded research for most article types.
NIH has recently replaced its CRISP database with the Research Portfolio Online Reporting Tools Expenditures and Results (RePORTER) tool. Much of the data in RePORTER come from the Scientific Publication Information Retrieval & Evaluation System (SPIRES), which was begun by one of us (Jordan) while at the National Institute of Environmental Health Sciences (NIEHS) in 2001. The initial version of the SPIRES database, completed in spring 2002, was limited to data for three NIH institutes—NIEHS, National Institute of Child Health and Human Development, and National Institute on Aging—and linked articles acknowledging grants from these three institutes back to 1995. The initial system was based on parsing grant numbers from PubMed records, and then matching those grant numbers to a table of known grant numbers. The initial version of SPIRES was updated monthly, with grant numbers being parsed from PubMed records and matched to tables of grant numbers using SQL.
Prior to the creation of SPIRES, multiple attempts had been made to create accurate linkages between publications and NIH grant numbers. These had all been based on matching of author names. These efforts failed primarily due to the large number of similar names and the fact that in earlier years, PubMed stored author names in the format of last name plus first initial.
In 2005, SPIRES became part of the NIH Electronic Research Administration project, with initial funding provided by NIEHS to expand the database to include all NIH institutes. Over time, the process was upgraded to contain all NIH publications from PubMed from 1980 to the present, update the database daily by adding new publications and updating revised publication records, automate the database maintenance process and include a new, improved grant number parser, and provide a scoring system for rating the quality of the grant number to publication matches. In 2008, publication data from the NIH Manuscript System and the NIH Intramural database were added. Throughout its history, the SPIRES system has been available only within NIH.
SPIRES parsing and matching process
SPIRES uses the following steps to explicitly link articles to grants through matching of grant acknowledgment strings to known grant numbers:
All publication records in PubMed since 1980 that contain either the full NIH institute abbreviation (eg, NCI for the National Cancer Institute, NHLBI for the National Heart, Lung and Blood Institute, etc) or a two-letter PHS organization code (eg, CA for NCI, HL for NHLBI, etc) are downloaded in bulk from PubMed. In PubMed XML, these values are found in the <Grant><Agency> and <Grant><Acronym> fields, respectively.
Various data elements, including PubMed IDs (PMID), grant agency, acronym, and number (field <Grant><GrantID>), are extracted and loaded into a relational database.
The values extracted from <GrantID> are processed through a parsing engine that attempts to decode the string by finding three valid NIH project number components—a two-letter organization code (eg, CA, HL, etc), the research activity code (eg, R01, P30), and the five- or six-digit serial number. This is not trivial because the <GrantID> values appear in a variety of formats (see table 1). Grant-type prefixes (eg, 1, 3, 5) and suffixes (support years) are ignored, since they are seldom included in grant acknowledgement strings.
SPIRES then matches what is referred to as the core project number against a list of known project numbers dating from 1972 to the present. The core project number consists of the three components mentioned above: the activity code, two-letter PHS organization code, and grant serial number. Although this list of core project numbers is maintained in the SPIRES system, a similar list can be constructed from grant data in the RePORTER system.
As matches are made, the match quality is assessed using a match case scoring system. This scoring system is based on two conditions: the particular components of the NIH project number that could be unambiguously matched, and the number of core project numbers that are either fully or partially matched. Full unambiguous matches can be obtained for grant strings that contain all three components; these are given a score of 5, as shown in table 2. For cases where the activity code is not available (which is a large fraction of cases, as shown in table 1) only the organization code and serial number can be matched. In these cases, the activity code is inferred from the list of known project numbers; a score of 4 is assigned if only one possible project number can be found.
Matches (PMID and core project numbers) are placed in a table along with the match case scores. Matches with scores of 4 and 5 have been made publicly available in RePORTER linkage tables; matches with scores of 3 or less have not been included in the public files.
Match case scores of 4 or 5 represent extremely accurate matches that almost invariably are matched to the correct core grant number. Match scores of 1 or 2 indicate cases where not enough data were provided to result in any match. Combined, records with scores of 1 or 2 represent less than 1% of the grant acknowledgment strings, while records with scores of 4 or 5 represent approximately 74% of the data. Records with a match score of 3 are more problematic. A match score of 3 indicates that based on the decoded NIH organization code and serial number, more than one grant number match was found. This is possible because for many years, NIH would recycle a series of serial numbers across multiple activity codes. Therefore, cases occur where there is a match against a T32 training grant and an R01 research project grant. In most of these cases, only one of these can be correct, and it is usually not possible from these limited data (without PI names, project titles, etc) to know which of the two is correct. However, there are other situations where match case 3 records are subsequently upgraded to a score of 4 based on known pairings of activity codes. For example, a pair of case 3 matches involving the activity codes R01 and R37 invariably represents a single research project because the activity code was changed at some point in the history of the project. The SPIRES matching engine now contains a number of these ‘score upgrade rules’ based on known pairings of activity codes, which has resulted in an overall reduction in the number of case 3 matches. Currently, case 3 matches represent 26% of the total SPIRES data.
Linking citation data
The final data source used was Scopus, a citation database produced by Elsevier. Scopus is available at the many institutions worldwide that license use of these data. Scopus purports to index all Medline records (and presumably most PubMed records). We matched Scopus records to PubMed records at the article level using a variety of information (eg, journal, volume, page, publication year, author, title), thus creating a map of PMID to Scopus article ids. These were used to link citation counts to PubMed articles where available.
Results and analysis
As mentioned above, the ultimate output from the SPIRES linking process is a table of PMID-to-project number matches (with ancillary data such as grant start and end dates) that can be the seed for a variety of additional analyses. Full Medline records for each linked article were also available, and data from those records were used for some of the analysis.
We did not limit our analysis to the match case 4 and 5 records that are publicly available, but included the match case 3 records as well for completeness. We did, however, limit the data in other ways. First, records were limited to those matching NIH grants (as opposed to grants from other PHS agencies such as CDC). Second, records with activity codes starting with N or Z were removed to focus on extramural research grants. Third, although the SPIRES matching process uses a list of project numbers that were assigned both to funded projects and to applications that were never funded, we limited this analysis to those matching project numbers where funding was actually awarded. Fourth, records were limited to projects and articles between the years of 1980 and 2009.
One additional processing step was made to further merge records with a match case score of 3. Although, as mentioned above, NIH at times recycled serial numbers across multiple activity codes, this recycling did not occur within Major Activity Code (indicated by the first letter of the activity code). For example, serial numbers might have been recycled between R type and T type grants, but not within R type grants. We thus deduplicated the set of match records such that each record had a unique combination of ‘PMID+activity code first letter+organization code+serial number.’
Overall results are shown in table 3, along with numbers of unique matched grants and unique matched articles associated with each NIH institute. Note, however, that many articles acknowledge multiple grants, both within and across institutes, and also many grants are acknowledged by multiple articles. Thus, we report the number of matches per grant, meaning that on average each grant ‘contributes to’ a certain number of articles rather than that there are a certain number of ‘articles per grant’ produced. Note also that the sum of the number of unique articles for the individual institutes (1 779 893) is higher than the overall number of unique articles (1 386 410) because some articles are associated with multiple institutes. This is also true for the numbers of multi-institute articles.
As shown in table 3, each grant acknowledged in this dataset has contributed on average to 12.7 articles. Values vary widely by NIH institute, from a low of 4.2 for National Institute of Nursing Research to a high of 33.4 for National Center for Research Resources (NCRR). Values should not be strictly compared at the institute level because the different institutes each have very different grant portfolios consisting of different distributions of grant types (ie, research, equipment, training, etc), different topic mixes, different dollar amounts, and different goals (ie, research, clinical, public health, etc).
The final column of table 3 shows the percentage of articles that reference grants from multiple institutes. For articles citing NCRR grants, 67% of those articles also cite grants from other NIH institutes. This is not surprising in that NCRR grants include equipment and infrastructure grants along with regional research facilities, and thus are logically overlapped by research grants that use those facilities. By way of comparison, of the institutes with at least 4000 grants, 25% of the articles citing NEI grants also cite grants from other institutes. In general there is a high degree of interlinkage between the NIH institutes—the average fraction of multi-institute articles across the NIH institutes is 41%. However, if one considers the NIH as a whole, and deduplicates across institutes, the fraction of total articles that reference grants from multiple institutes is only 24% (see the row labeled ‘Unique’ in table 3).
Table 4 shows statistics from the same data broken down by article publication year. The fraction of multi-institute articles has risen over time; the number was relatively steady at around 22% from 1980 through 2000, and has increased steadily since then to a rate of 28.5% in 2009. This increase may simply reflect the dramatic increase in NIH funding over that time period spilling over into increased overlap or synergy between grants. The number of matches per article (or unique grants acknowledged per article) has also risen slightly since 2000, from around 1.7 to 2.1. The changes in matches per article and fraction of multi-institute articles have mirrored each other closely over the past 30 years.
Table 5 gives statistics by initial grant year, and gives rise to a number of interesting observations. Articles referencing multiple grants are counted for each grant they reference, and are counted for the first year the grant was active. For example, an article published in 2000 that references a grant that was active from 1991 to 1998 appears in the 1991 numbers. Table 5 shows that the average grant durations in years and the average numbers of papers published per grant have been decreasing slightly over time. These decreases are undoubtedly due in part to the fact that many grants that started years ago are still active; a future look at these same quantities that includes the next several years' data will show larger values. Nevertheless, the decreases cannot be explained solely by active grants. There appears to be a slight trend toward shorter grant durations. In addition, the data from 1980 to 1989 show that the average number of articles contributed to by each grant (for those grants that produced articles) was between 14 and 15. These values are larger than the 12.7 that was shown in table 3. However, the 12.7 value was based on a combination of older and newer grants, and is thus a low estimate.
Several quantities from table 5 have been plotted in figure 1. Grant durations and articles per grant have already been mentioned above. The fraction of grants with articles (figure 1 upper right) acknowledging them was at or above 70% from 1980 to 1985, dipped to around 60% for the years 1986–1995, with a low of 58% in 1990, and was then once again relatively constant at over 70% from 1996 to 2003. This suggests that roughly 30% of awarded grants never produce an article. We suspect that there are many valid reasons for this (possible factors may include small exploratory grants, training grants, lack of coverage in PubMed, etc) but have not looked into this question any further. The rapid decrease in the fraction of grants with publications from 2003 to the present is an artefact of the time lag between the time a grant starts and the first date at which results from that grant are published. The recent tail of this curve suggests that roughly two-thirds of those grants that will end up with publications will have published something within the first 3 years, and that around 95% of such grants will have produced a publication within 5 years. This is consistent with a recent analysis by Börner et al8 showing that around 65% of grants produce an article within 3 years.
Börner et al8 also state that ‘it is generally accepted that publications from earlier years were less likely to cite their grant support.’ This could explain the dip in the grants-with-articles curve at 1990. However, if this is true, then figure 1 also suggests that articles written prior to 1990 were more likely to cite grant support and that the dip at 1990 could be due to laxness of acknowledgment at that time period that was not historical.
The number of articles published per grant per year (figure 1 lower left) has been decreasing slowly but steadily since the late 1980s. The slope of the curve decreased monotonically from 1988 to 1997 and was followed by a flat period of about 4 years where each grant produced roughly 1.7 articles per year. The value decreases again with each initial grant year after 2001, suggesting that long-duration grants reach a publication steady state at around 8 years. Although we have not broken these data down by publication year and grant type, the data would support such an analysis which would tell us much more about publication time lags. We leave this work to a future study.
The final quantity shown in figure 1 is the average number of citations per article as a function of initial grant year. Citation counts were added to each record by finding the article in Scopus, as mentioned earlier. Citation counts were based on the 1996–2008 Scopus annual files, and are thus current as of the end of 2008. This quantity is relatively constant through the 1980s at around 35 citations per article, rising to a peak of 39 citations per article in 1993, and then dropping off almost linearly for grant years from 1997 to 2008. This drop does not indicate that the articles from more recent grants will be less cited than those citing earlier grants, but rather reflects the fact that newer grants are accompanied by newer articles that have had less time to be cited.
Principal investigator–author order
We are also interested in principal investigators (PI) and where they appear as authors of articles. The author orders for the PI's for each of the grant–article matches for papers published since 1996 were found in the PubMed data, and the percentages are shown in figure 2. Over the entire time period, it can be clearly seen that if the PI was listed as an author on the article, it was more often as last author (38.6%) than as the first (10.7%) or a middle (18.0%) author. This correlates well with the historical convention in biomedical publication for the leading author to be listed last. The largest shift in authorship over the time period was a decrease in first authorship (from 12% to 9%) and corresponding increase in middle authorship (from 15% to 20%).
Figure 2 also shows that the PI was not listed as an author on 27% of the matched articles. This is not surprising in that many grants are certainly large enough that not all work is directly overseen or published by the PI. In this analysis, only overall PIs were considered. If component PIs had been considered, the fraction of articles not authored by a PI would undoubtedly have been lower. This fact also shows why any attempted matching of grants to articles based on matching of authors to PIs cannot be fully successful. Note also that combinations of author orders (eg, first+last) are also shown in figure 2. These reflect cases where multiple people were PIs on the same grant at different times, and more than one of those PIs co-authored the article. Since we do not know the exact time lag between publication and when the particular work reported on in article was funded, we have not limited each grant to a single PI, but report all possible combinations.
Research support type and impact
In addition to the grant string information, PubMed contains more general information about research support. The article-type field, in addition to listing article types, lists the following funding-related tags:
Research Support, NIH, Extramural
Research Support, NIH, Intramural
Research Support, US Gov't, PHS
Research Support, US Gov't, Non-PHS
Research Support, Non-US Gov't
Most other Research Support tag variations are typographical errors that can be mapped back to these 5 tags. Since NIH is the largest grant awarding arm within the US Health and Human Services agency (and thus the US PHS), we have combined the first 3 of these tags (NIH, Extramural, NIH, Intramural, and US Gov't, PHS) into a single group denoting funding from the US PHS, listed as (P) in the following discussion. US Gov't, Non-PHS (O) includes other US funding agencies such as the NSF, DOE, NASA, etc. Non-US Gov't (N) includes government funding from other countries as well as industry, non-profits, foundations, and university endowments from around the world. Some articles acknowledge multiple funding types and thus have multiple tags.
Figure 3 shows the fractional distribution of 2 167 393 PubMed articles by funding type and publication year. This analysis has been limited to those articles that can be justified as having a US address. This includes articles with US first author addresses in PubMed and those additional articles to which NIH grant strings have been matched. The latter criterion assumes that NIH funding goes primarily to US institutions and addresses. In addition to the numbers listed in Table 5, there are 5 229 721 PubMed articles over the same time period that do not reference a US first author address and that do not reference a matched NIH grant string. We assume that these articles represent contributions from non-US countries. Of these, 1 740 170 articles (33%) have a Research Support, Non-US Gov't funding tag, while the remaining 3 489 551 articles have no acknowledgment of a funding source.
Figure 3 shows that the fraction of US articles with no funding acknowledgment has decreased from 47% to about 41% over the past decade. The fraction of articles funded exclusively by the US PHS is currently around 18%, while another 17% are joint funded by the US PHS and non-US government sources (P+N). The total contribution of the US PHS to the US biomedical publication output, summed over the four types (P, P+O, P+N, P+O+N), is currently just over 41%. The fraction of articles funded exclusively by non-US government sources is 10% of the total. Other funding sources and combinations of funding sources each account for less than 4% of the total.
It is difficult to know if the grants indexed in PubMed comprise the majority of the actual grant-to-article relationships or not. Figure 3 suggests that around 45% of the US articles indexed in PubMed since 1996 have no acknowledgment of funding. Lewison10 reported that 46% of nearly 13 000 UK gastroenterology papers had no acknowledged funding source, but that 85% of those were from UK National Health Service hospitals, and thus had an implied funding source by association. Further, Lewison, Dawson, and Anderson18 found that while 39% of papers in the UK Research Outputs Database did not contain acknowledgments of funding, 7/8 of those could not be expected to have them. Lewison's studies thus suggest that a fraction of 40–45% of articles without acknowledgments is reasonable and does not suggest that authors are ignoring the proper acknowledgment of grant funding from public sources.
By contrast, Cronin and Franks19 examined over 1000 articles from the journal Cell and found that over 90% of them had financial acknowledgments. We note that of the 840 942 articles associated with NIH or PHS funding types in figure 3, over 85% of them are in the grant–article match list. This leaves around 15% of the articles noted to have received NIH or PHS funding, but for which the actual grant information was not indexed; these could be considered as false negatives. Taken in total, these studies suggest that biomedical researchers do, for the most part, acknowledge government funding in a consistent and representative, but not totally complete, manner.
Combining the funding information tags and grant information strings from PubMed records with citation counts from Scopus allows us to examine the impact of different funding types in the biomedical research area. Matching of Scopus to PubMed records produced positive one-to-one matches for nearly 96% of the records acknowledging funding from figure 3, and for 88% of the records in the ‘none’ category from figure 3. The no acknowledgment category likely has a lower matching rate to Scopus records because this category will include a higher fraction of article types that do not represent technical advances (eg, editorials, book reviews, etc) and thus are far less likely to acknowledge funding sources.
Average citation counts per article were calculated using Scopus data for the articles in each of the funding categories and publication years from figure 3. Figure 4 shows that articles acknowledging PHS funding (P) receive twice as many citations on average as those that have no funding acknowledgment (none). This can be taken as evidence that the PHS system is working in terms of funding work that is of higher-than-average impact. However, the data also show that impact increases with the number of different types of funding sources acknowledged. Citation counts rise by about 10% if funding from another US government agency is acknowledged in addition to that from PHS (P+O). However, they rise even further, by 40% or more, if funding from a non-US government source (eg, a foundation, non-profit, or society) is acknowledged in addition to PHS funding (P+N). This is consistent with previous results by Boyack14 showing that articles jointly funded by the US PHS and non-US government sources have a higher impact than those funded by the US PHS alone. The highest impact articles on average are those that acknowledge funding from three different types of funding sources (P+O+N). These findings are robust in that each data point shown in figures 3, 4 is based on a minimum of 3000 articles per category-year.
The analysis of figure 4 includes all PubMed articles to which we could attach Scopus citation counts. Figure 5 shows the corresponding results if the analysis is limited to the 690 325 unique PubMed articles in the grant–article match list. Although the citation numbers for the more limited set are about 5% lower on average than those in figure 4, the ordering of the curves is maintained, further suggesting that the findings mentioned above regarding impact and funding types are robust. The 5% decrease in average citations from the larger set suggests that the 10% of articles noted to have received PHS funding in the article type tags but that are missing grant information strings may be preferentially from higher-impact journals.
Limitations and suggestions
There are, of course, limitations to the data described here that limit the depth of the analyses that can be undertaken. Although it has not been mentioned until now, there are certainly false-positive matches between grants and articles in these data. The fraction of false positives is unknown but is expected to be small, given that we have excluded all matches to unawarded grants. One other type of potentially false-positive match is one where the article appears to have been published before the grant was awarded. Such cases amount to roughly 1% of the data and can occur due to a variety of errors in either the grant data or parsing steps. In some cases, they may also reflect a researcher incorrectly acknowledging a grant that is due to start but has not yet. We have not excluded these cases, since we expect that the majority of them can be explained with sufficient investigation.
Another key limitation is that despite PubMed's broad and increasing coverage of the literature surrounding the life sciences and biomedicine, it does not index all articles that acknowledge NIH grants. Thus, article counts and publication rates derived from this analysis should be considered lower bounds, and may be more underspecified in boundary areas with many journals and conference proceedings that are not indexed by PubMed (eg, informatics) than in core areas (eg, oncology). Another limitation is that the grant acknowledgments in PubMed do not, for the most part, include suffix information, and thus cannot be linked to individual grant years. Thus, time lags must be either assumed or ignored.
In the analyses here, we have not made use of any funding data. With the addition of such data, a variety of detailed input–output studies could be done. For example, time histories showing funding, publication counts, and citation counts could be constructed for individual grants, or for groups of grants by agency, program, funded institution, PI, etc.
The data that have been recently made available through NIH RePORTER and ExPORTER, based on the matching of grants-to-articles using the SPIRES system, are a great resource from which to pursue input–output studies of biomedical fields in the USA. Similar data exist for the UK in the Research Outputs Database. However, we note that no similar widely accessible data exist outside the biomedical area. For example, such data linking grants and articles are lacking for the US NSF and other agencies. We hope that such data will be made more widely available for other agencies in the future.
This work reports on NIH activities that have provided grant-to-article linkage data that are now available at the NIH RePORTER and ExPORTER websites. In addition, we have performed high-level analyses of NIH data input, output, and impact data at a large scale and have shown high-level features of those data that have not been reported before. Although key findings have been given in each section of the paper in context, we summarize them here for easy access.
The fraction of articles reflecting synergies between multiple NIH institutes is increasing over time; 29% of current articles acknowledge grants from multiple institutes. Synergy between multiple grants from the same institute is also increasing over time.
Grant durations are decreasing slightly over time.
The median time lag to publication from a new grant is 3 years. Most grants that will produce an article will have done so within 5 years.
Each grant contributes to roughly 1.7 articles per year, averaged over all grant types.
PI are not listed as authors on roughly 27% of articles that acknowledge their grants.
Articles acknowledging PHS funding receive on average twice as many citations as articles that acknowledge no funding source whatsoever.
Articles acknowledging both PHS funding and a non-US government funding source (eg, foundation, non-profit, society) receive on average 40% more citations than those acknowledging PHS funding sources alone.
Although we have stayed with a high-level view in the analyses reported here, the data support analysis at much finer-grained levels—by institute, grant type, etc—and even to the level of individual grants. We plan to pursue additional studies using these data, and encourage others to do the same.
We acknowledge B van Houten, former chief of the Program Analysis Branch at NIEHS, for his work and support at the genesis of the SPIRES project, and Y Gao and FO Finch, for their work on the SPIRES parsing and matching engines.
Funding This work was partially supported by NSF award SBE-0738111.
Competing interests Although KWB is employed by SciTech Strategies, no patent or product based on this work is under development.
Provenance and peer review Not commissioned; externally peer reviewed.