Application of change point analysis to daily influenza-like illness emergency department visits
- Taha A Kass-Hout1,
- Zhiheng Xu2,
- Paul McMurray1,
- Soyoun Park3,
- David L Buckeridge4,5,6,
- John S Brownstein4,7,
- Lyn Finelli8,
- Samuel L Groseclose1
- 1Public Health Surveillance and Informatics Program Office, Office of Surveillance, Epidemiology, & Laboratory Services, Centers for Disease Control and Prevention, Atlanta, Georgia, USA
- 2Food and Drug Administration, Silver Spring, Maryland, USA
- 3McKing Consulting Corporation, Atlanta, Georgia, USA
- 4International Society for Disease Surveillance, Boston, Massachusetts, USA
- 5Department of Epidemiology and Biostatistics, McGill University, Montreal, Quebec, Canada
- 6Agence de la santé et des services sociaux de Montréal, Direction de santé publique, Montreal, Quebec, Canada
- 7Children's Hospital Boston, Harvard Medical School, Boston, Massachusetts, USA
- 8National Center for Immunization and Respiratory Diseases, Office of Infectious Diseases, Centers for Disease Control and Prevention, Atlanta, Georgia, USA
- Correspondence to Dr Taha A Kass-Hout, Public Health Surveillance and Informatics Program Office, Office of Surveillance, Epidemiology, & Laboratory Services, Centers for Disease Control and Prevention, Atlanta, Georgia, USA;
- Received 21 December 2011
- Accepted 15 May 2012
- Published Online First 3 July 2012
Background The utility of healthcare utilization data from US emergency departments (EDs) for rapid monitoring of changes in influenza-like illness (ILI) activity was highlighted during the recent influenza A (H1N1) pandemic. Monitoring has tended to rely on detection algorithms, such as the Early Aberration Reporting System (EARS), which are limited in their ability to detect subtle changes and identify disease trends.
Objective To evaluate a complementary approach, change point analysis (CPA), for detecting changes in the incidence of ED visits due to ILI.
Methodology and principal findings Data collected through the Distribute project (isdsdistribute.org), which aggregates data on ED visits for ILI from over 50 syndromic surveillance systems operated by state or local public health departments were used. The performance was compared of the cumulative sum (CUSUM) CPA method in combination with EARS and the performance of three CPA methods (CUSUM, structural change model and Bayesian) in detecting change points in daily time-series data from four contiguous US states participating in the Distribute network. Simulation data were generated to assess the impact of autocorrelation inherent in these time-series data on CPA performance. The CUSUM CPA method was robust in detecting change points with respect to autocorrelation in time-series data (coverage rates at 90% when −0.2≤ρ≤0.2 and 80% when −0.5≤ρ≤0.5). During the 2008–9 season, 21 change points were detected and ILI trends increased significantly after 12 of these change points and decreased nine times. In the 2009–10 flu season, we detected 11 change points and ILI trends increased significantly after two of these change points and decreased nine times. Using CPA combined with EARS to analyze automatically daily ED-based ILI data, a significant increase was detected of 3% in ILI on April 27, 2009, followed by multiple anomalies in the ensuing days, suggesting the onset of the H1N1 pandemic in the four contiguous states.
Conclusions and significance As a complementary approach to EARS and other aberration detection methods, the CPA method can be used as a tool to detect subtle changes in time-series data more effectively and determine the moving direction (ie, up, down, or stable) in ILI trends between change points. The combined use of EARS and CPA might greatly improve the accuracy of outbreak detection in syndromic surveillance systems.
- public health
- change point analysis
- Simulation of complex systems (at all levels: molecules to work groups to organizations)
- monitoring the health of populations
- detecting disease outbreaks and biological threats
Public health agencies are increasingly using syndromic surveillance to monitor population health.1–3 Most systems draw data from emergency department (ED) visits and use influenza-like illness (ILI) syndromes to supplement information from other systems for monitoring the impact of seasonal influenza.1 These ED-based syndromic systems typically provide health departments and Centers for Disease Control and Prevention (CDC) with more timely data on ILI than existing networks of sentinel healthcare providers.4
The Distribute system accepts aggregate data submitted daily from over 50 state or local health departments. The data submitted include total counts of ED visits and counts of ED visits for ILI, both of which are stratified by age group. Submitted data are processed automatically and transmitted to a centralized repository maintained by CDC and ISDS.5 The Distribute system accepts aggregate data submitted daily from over 50 state or local health departments. The data submitted include total counts of ED visits and counts of ED visits for ILI, both of which are stratified by age group. Submitted data are processed automatically and transmitted to a centralized repository maintained by CDC and ISDS.
Automated surveillance data are typically analyzed using aberration detection algorithms.6 Different algorithms have been proposed and evaluated in syndromic surveillance systems.7–9 Techniques include regression methods, time-series analysis, statistical process control, spatial-temporal clustering analysis, and multivariate outbreak detection.9 The sensitivity, specificity, and timeliness of outbreak detection for different algorithms vary significantly.7 The algorithms encoded in the Early Aberration Reporting System (EARS) have been used widely to analyze automated syndromic surveillance data in public health owing to their simple format, implicit correction for seasonal trends, and ease of implementation.10 Modifications have been made to the EARS C2 algorithm such as using a longer baseline period, restricting a minimum SD of one, and accounting for total visits, which have resulted in improvements in the performance of the EARS algorithm for aberration detection.11 Consequently, the EARS algorithm is effective for detecting sudden major changes in automated surveillance data. However, these algorithms, especially the modified C2 algorithm used by BioSense, have a limited ability to identify subtle and potentially important changes in surveillance time series. Subtle changes in disease trends are expected to occur before the onset of major increases or decreases in communicable disease incidence. Detecting these subtle changes could be critical for public health decision-making in the early emergency response. The cost of failure to detect such changes could have significant impact on the control and prevention of emerging diseases.10 In addition, the EARS algorithm detects only changes in static disease activity at a given time when the outbreak threshold is met and only signals the direction (ie, upward, downward, or stable) of changes in disease trends at a single time point.
The limitations of aberration detection algorithms such as those in the EARS system can be addressed by the use of other analytical methods, such as methods for change point analysis (CPA), which are designed expressly to detect subtle changes in incidence and characterize changing trends in time series. Over the past 50 years, CPA methods have been applied to problems in statistics, economics, medicine, agriculture, intelligence (Al Qaeda network), and, more recently, to microarray data12 ,13 For example, Finney14 used this tool to detect significant changes in the insect population within fields, Hansen15 studied dating structural changes in the USA laboring productivity, Erdman and Emerson16 developed a fast Bayesian CPA for the segmentation of microarray data. Even though CPA has been widely used in many different fields, to the best our knowledge, the application of CPA in detecting disease outbreaks from public health surveillance system, especially in syndromic surveillance, has not been used. CPA methods can be used to investigate whether (1) one or more changes occur in a series of data points, and (2) the direction of the change in the time series between change points. During the 2009–10 influenza A (H1N1) pandemic, public health officials were especially interested in identifying in ‘real time’ whether the proportion of ED visits due to ILI was rising, decreasing, or stable, and the likely trend for the near future.
In this paper, we consider the use of CPA methods as a complementary approach to aberration detection to address the limitation of EARS algorithm in detecting subtle changes and determining the direction of changes in disease trends. We compared three different CPA methods in detecting change points in the daily proportion of ED visits due to ILI reported to the Distribute system. We present the results of simulation studies designed to assess the impact of autocorrelation inherent in time-series data on the performance of CPA. We also discuss the application of CPA methods in real-time epidemic forecasting, which might enhance public health decision-making.
In this paper, we evaluate the benefits of combining CPA with the EARS algorithm in detecting disease outbreaks in syndromic surveillance. We compare Taylor's CPA method—cumulative sum (CUSUM) in detecting change points from the real surveillance data, with two other CPA methods, structural change model (SCM) and Bayesian CPA (programs available in the open-source R packages). We used the daily syndromic surveillance data reported to the Distribute system and CDC as our real data example. The outcome measure is the daily proportion of ED visits due to ILI. Simulated time-series data were also generated to test the robustness of CPA methods with respect to autocorrelation in time-series data.
Daily proportions of ED visits due to ILI reported to the Distribute system from four contiguous US states during the period October 4, 2008 through October 9, 2010 (2008–9 and 2009–10 flu seasons) were analyzed. Proportions by age group (all ages, <5 years, 5–17 years, 18–44 years, 45–64 years, and >65 years) were calculated. Figure 1 illustrates daily proportions of ED visits due to ILI by age group during the study period. The daily ED visits due to ILI data at each age group were analyzed by CPA methods separately.
Taylor developed a CPA method through the iterative application of CUSUM charts and bootstrapping methods to detect changes in time series and their inferences.17 This approach is based on the mean-shift model and assumes that residuals are independent and identically distributed (iid) with a mean of zero. Inferences such as CIs and p values on the change points were obtained through bootstrap analysis. For each of the 1000 random bootstrap samples generated, we obtained information on change points and the difference between maximum and minimum CUSUM of residuals as where and . Several bootstrap techniques (centile, bias-corrected and accelerated, and jackknife) were used to compute CIs for the change points.18 ,19 The distribution of 1000 was used to determine the p value for the change point as the percentage of values which are less than from original time-series data.17
Two other popular CPA methods, SCM and Bayesian CPA, were compared with CUSUM. An intercept-only regression model is used in SCM CPA and the minimum of the sum of squared residuals is defined as the change point.20 ,21 The SCM can be used with autoregressive data and can incorporate independent covariates; however, it assumes a stationary process and surveillance data often have temporal trends or seasonal effects. Before using the SCM, surveillance data must be transformed from non-stationary to stationary data through differencing or other approaches. Similar to CUSUM, the SCM is based on a mean-shift model. The significance level we used in CUSUM and SCM was taken as 0.001. An alternative to CPA methods based on a mean-shift model is the Bayesian CPA.22 With the Bayesian CPA, the posterior distribution of the change points is obtained from the combination of prior distributions and the likelihood is derived from the time-series data. The default prior distribution in the Bayesian CPA is chosen as normal. Other non-informative prior distributions, such as uniform, can be defined in the model as well. The posterior probability of a change point at each position can be ordered from largest to smallest and plotted against the time scale. The open-source R packages, strucchange and bcp, were used for the SCM and Bayesian CPA, respectively. Our CUSUM programs have been developed in R, SAS 9.2, and Stata 11 and can be downloaded from our open-access collaboration website for CPA at: https://sites.google.com/site/changepointanalysis. Technical summaries of SCM, Bayesian CPA and EARS are included in the online supplementary appendix.
Autocorrelation is often present in time-series data and it can affect the robustness and accuracy of CPA. We tested our CPA methods on simulated data based on a first-order autoregressive model as follows:
where and μ and σ are the mean and SD we estimated from the Distribute ED ILI surveillance data (μ=0.02, σ=0.03), and ρ is the autocorrelation coefficient which ranges from −1 to 1. ρ was chosen at −1, −0.8, −0.5, −0.2, 0, 0.2, 0.5, 0.8, and 1 in this study. At each ρ level, we generated one time-series dataset with 100 observations and computed the single change point. The simulated dataset with ρ=0 was taken as reference data and its change point was taken as a reference point. Then, we detected change points in other simulated datasets with ρ≠0 and determined whether they fell into acceptance regions. The acceptance regions are defined as either zero or three time points (ie, 0 day or 3 days) away from the reference change points. We repeated the simulation 1000 times at each ρ level and then computed the percentage of change points falling in the acceptance region. A high percentage indicated the strong robustness of CPA methods in detecting change points in autocorrelated data.
ILI trend determination
The ILI trend (upward, downward, or stable) was calculated as the difference in the mean of the %ILI between the interval after the change point and the interval before the change point. For example, the difference (Δ) of 3.0% at change point April 27, 2009 indicates that the percentage of ED visits due to ILI increased significantly (3.0%) after April 27, 2009 (table 1). In addition, change points were used to divide an entire time series into four types of segments based on disease trend: moderately up (Δ>1%), slightly up (0<Δ≤1%), slightly down (−1%<Δ≤0), and moderately down (Δ≤−1%).
Change points detected using CUSUM for influenza seasons 2008–9 and 2009–10 and statistical inferences (ie, 95% CI) are provided for each change point (table 1). Additionally, the %ILI differences before and after the change point are also provided in table 1 to show the flu trend. During the 2008–9 season (October 4, 2008 to October 3, 2009), 21 change points were detected and flu trends increased significantly after 12 of these change points and decreased nine times. Eleven change points were detected during the 2009–10 flu season; flu trends increased significantly after two of these change points and decreased nine times. Figure 2 shows the pattern of change point intensity on daily ED visits due to ILI across the 2008–10 flu seasons for all age groups in one health and human services region in the USA.
To illustrate the association between signals generated by CUSUM CPA and EARS methods, we insert the aberration points detected by EARS in table 1 rows closest to the nearest change point dates. Owing to the unusual temporal distribution of the H1N1 pandemic in 2009, most EARS anomalies were detected among change point intervals 4/27/2009–5/2/2009 and 8/16/2009–9/7/2009 in the four states. Subtle changes detected by CPA at 5/26/2009 (down 0.75%) and 7/25/2009 (up 0.26%) might give public health authorities more lead time to prepare for emergency and response activities as compared with responding to the multiple anomalies detected by EARS in the late August and early September of 2009. Since the modified C2 EARS algorithm used in BioSense does not have a function to capture decreasing trend, we can use CPA as a complementary approach to determine the direction of the ILI trend as upward, downward, or stable. For example, change points detected by CPA after mid-September, 2009 consecutively illustrate the decreasing trend of H1N1 influenza activity in the four states.
Figure 3 displays anomalies detected by EARS (red crosses) and CUSUM change points (vertical lines). The largest spike in %ILI during the H1N1 event was in April/May 2009 and was detected by both methods (figure 3). Table 1 indicates a significant change point on April 27, 2009 where ILI activity increased by 3.0% in the period after (April 27, 2009–May 6, 2009) compared with period before (April 6, 2009–April 26, 2009). The detection of the 3.0% increase in %ILI by CUSUM CPA and the multiple EARS anomalies detected during this period indicate the beginning of the H1N1 event in the four states. CUSUM CPA also detected four consecutive change points with significant increasing trends (August 14, 22, 30, 2009 and September 4, 2009) that illustrate the arrival of the fall 2009 H1N1 season in the four states. Simultaneously, a cluster of anomalies was detected by EARS in the August/September of 2009 (figure 3). In summary, figure 3 demonstrates the complementary use of CUSUM CPA and EARS in enhancing the precision of aberration detection and determination of disease trend in the ILI syndromic surveillance system.
In addition to CUSUM CPA, we also assessed the performance of the SCM and Bayesian CPA using the same Distribute ILI data. CUSUM and SCM are both based on a mean-shift model, while Bayesian CPA uses a prior probability assumption and data likelihood function to calculate the posterior probability of the actual change point occurring at each location. A threshold value (ie, 0.5) was chosen to filter out non-significant change points. Table 2 lists change points detected by three different CPA methods during the 2009 H1N1 pandemic in the four states (March, 2009 to July, 2010). Given that each of the three methods uses different algorithms in finding change points, we still observed a high degree of consistency in the location of change points across the three CPA methods. Approximately 90% of change points detected by SCM and Bayesian CPA exactly agreed with those found by CUSUM. The performance of the three CPA methods was similar, but Taylor's CUSUM approach has the advantages of a simpler mathematical format and is more conservative in detecting change points.
Autocorrelation is often seen in time-series data, such as public health surveillance data. To demonstrate the sensitivity of CPA performance to the degree of autocorrelation in the data, we generated random autoregressive time-series data at different correlation levels and tested the robustness of CUSUM and SCM methods in detecting change points. Table 3 lists the coverage statistics from the two CPA methods (CUSUM and SCM) at each ρ level. The coverage statistics are computed as the probability of change points falling in the acceptance regions. The acceptance regions are defined as either zero (ie, the same time point) or within three time points (ie, 0 day or 3 days) away from the reference change points. The change point at ρ=0 is taken as a reference point. For CUSUM, more than 80% of change points detected at ρ level (−0.2≤ρ≤0.2) match the change point for iid time-series data where ρ=0. Moreover, the coverage rate at ρ level (−0.2≤ρ≤0.2) increased to >90% when we defined the acceptance region as three time points (ie, 3 days) away from the change point for iid data, and 80% coverage was achieved at a larger ρ level (−0.5≤ρ≤0.5). For the SCM, the coverage was not as good as CUSUM. Figure 4 shows the scatter plot of coverage probability between these two methods at day 0 and day 3. In the Distribute ILI time-series data, we observed moderate autocorrelation (−0.5≤ρ≤0.5). Therefore, results from the autocorrelation simulation conducted in this study support the robustness of CUSUM CPA in detecting change points for timely syndromic surveillance data.
In this paper, we assessed the utility of CPA as a complementary analytic method to the EARS algorithm when analyzing automated disease surveillance data—for example, ED visits due to ILI. Even though the EARS Shewhart variant method is very effective for detecting sudden major changes in time-series data, it has limited ability to identify subtle changes in the time series. Therefore, we assessed the performance of CPA as a complementary method to EARS since CPA can detect subtle changes in time-series data more effectively. Furthermore, CPA results show the direction of change in the ILI ED visit time series while EARS only detects isolated ILI visit trend anomalies. In most situations, anomalies are detected when ILI visits are increasing and identification of increasing disease incidence is of most interest for public health response. However, understanding when the disease trend goes up, down, or is stable could help public health agencies to allocate limited resources to the much-needed places in a timely manner. In public health surveillance, the intent of developing outbreak detection algorithms is to identify incidents that matter. In analyzing Distribute data during a flu season, we might examine over 500 time-series charts of ILI activity daily to determine whether the trend is stable, increasing, or decreasing and whether the detected change merits a public health response. With the use of CPA in addition to EARS methods, we were able to focus our time-series review on a more limited number of signals in the ILI time series to investigate further. This makes it a lot easier to prioritize the detected time-series changes that require investigation. The more limited set of time-series anomalies and related change points identified by complementary use of CUSUM CPA and EARS methods can help analysts focus their surveillance data review and decide whether further investigation is needed, such as subanalyses of those time series or even examinations of other data sources. During the 2008–9 and 2009–10 flu seasons, we continually shared the CPA results and the associated time series with influenza experts at CDC.
We also assessed the performance of three CPA methods applied to timely ED-based ILI surveillance data: CUSUM, SCM, and Bayesian CPA. Significant change points (p value <0.001) in ILI activity were detected by each method and ILI trends were further characterized (increasing, decreasing, or stable). SCM is ideal for detecting change points in multiple linear regression settings where vectors of covariates are added to the regression model to estimate the dependent variable in multiple time-series segments. However, our study only focuses on the outcome—the proportion of ED visits due to ILI, and no covariates are considered. Therefore, SCM is simplified to the Taylor's mean squared error (MSE) method.17 CUSUM is more sensitive than SCM (Taylor's MSE) in detecting change points as shown in table 2 where CUSUM detects more change points than either SCM or Bayesian CPA. Unlike CUSUM and SCM, Bayesian CPA is not based on the mean-shift model assumption. Bayesian CPA makes a normal distribution assumption for the time-series data and adopts noninformative priors on the model parameters. Given the nature of uncertainty in public health surveillance data, the normality assumption may interfere with the estimation of the posterior distribution obtained by Bayesian CPA. Therefore, the nonparametric methods (Taylor's CUSUM and MSE) are more favorable than Bayesian CPA in our study. Among the three methods, we selected CUSUM as the preferred method for use in the routine analysis of timely ED ILI surveillance data in Distribute owing to its simple mathematical formula and model robustness (performed well with autocorrelated time-series data). Results of CUSUM analysis in combination with EARS methods may be a valuable resource for policy makers who, for example, must direct emergency preparedness and response resources based upon their understanding of emerging disease trends.
In addition to detecting subtle time-series data changes more effectively, CPA can be used to determine the trend (ie, upward, downward, or stable) in ILI activity between change points, while the modified C2 EARS algorithm used by CDC BioSense only flags time points when disease activity is significantly high. Common feedbacks from surveillance epidemiologists suggest that it is difficult to review and adequately investigate large numbers of surveillance data anomalies, especially during a public health event. Since the CPA can transform time-series data into multiple segments between change points, it aids interpretation of the time-series’ EARS-generated anomalies in each CPA-generated segment. To aid interpretation, the change points can divide the whole time series into, for example, four types of segments based on the direction and magnitude of the disease trend: moderately up, slightly up, slightly down, and moderately down. When the disease trend goes moderately down, it is unlikely to detect any anomalies in those segments. However, possible anomalies could be detected in the part of the segment where disease trend is slightly down. Knowing that the trend of disease activity is downward may help epidemiologists focus on exploring unexpected factors which could contribute to the occurrence of this individual anomaly point instead of being overwhelmed with a multitude of false alerts. When the disease trend goes slightly upwards, the anomalies in those areas could help epidemiologists closely monitor the situation for emergency preparedness and response to a potential event. Many more anomalies are detected using EARS methods when the trend of the disease is moderately upwards (figure 3) and are commonly interpreted as all part of the same finding.
As a complementary approach to EARS and other aberration detection methods, the CUSUM CPA method can be used as a tool to detect subtle changes in time-series data more effectively and determine the moving direction (ie, up, down, or stable) in ILI trends between change points more appropriately. The combined use of EARS and CPA can greatly improve the accuracy of outbreak detection in syndromic surveillance systems.
The authors acknowledge Dr James W Buehler, MD, Director of the Public Health Surveillance and Informatics Program Office at US CDC.
All authors revised the manuscript for important intellectual content and approved the version submitted for review. This article is original work, not currently under review elsewhere and no part of this article has previously been published.
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.