rss
J Am Med Inform Assoc 2006;13:432-437 doi:10.1197/jamia.M2013
  • The Practice of Informatics
  • Application of Information Technology

ResourceLog: An Embeddable Tool for Dynamically Monitoring the Usage of Web-Based Bioscience Resources

  1. Nian Liu,
  2. Luis Marenco,
  3. Perry L Miller
  1. Center for Medical Informatics (NL, LM, PLM), Department of Anesthesiology (NL, LM, PLM), Department of MCD Biology (PLM), Yale University School of Medicine, New Haven, CT
  1. Correspondence and reprints: Nian Liu, PhD, Center for Medical Informatics, Yale University School of Medicine, 300 George Street, Suite 501, New Haven, CT 06511; e-mail: <nian.liu{at}yale.edu>
  • Received 9 November 2005
  • Accepted 26 March 2006

Abstract

The present study described an open source application, ResourceLog, that allows website administrators to record and analyze the usage of online resources. The application includes four components: logging, data mining, administrative interface, and back-end database. The logging component is embedded in the host website. It extracts and streamlines information about the Web visitors, the scripts, and dynamic parameters from each page request. The data mining component runs as a set of scheduled tasks that identify visitors of interest, such as those who have heavily used the resources. The identified visitors will be automatically subjected to a voluntary user survey. The usage of the website content can be monitored through the administrative interface and subjected to statistical analyses. As a pilot project, ResourceLog has been implemented in SenseLab, a Web-based neuroscience database system. ResourceLog provides a robust and useful tool to aid system evaluation of a resource-driven Web application, with a focus on determining the effectiveness of data sharing in the field and with the general public.

Introduction

Many academic libraries, hospitals, research laboratories, and other entities in the biomedical domains have websites that may provide tremendous resources to the public. Unfortunately, website administrators and project managers face many difficulties in knowing the usefulness of their websites, for example, the usage of resources by the visitors. One simple mechanism to evaluate websites is to count the number of total page requests. Since the website's behaviors including receiving resource requests are recorded into log files by the Web server, statistical analyses of these log files may provide certain measures on the usage of the Web resources. Using log files to analyze the resource usage, however, requires certain computer expertise and is tedious, inflexible, and often incomplete. It does not allow dynamically monitoring the resource requests or initiating just-in-time interactions with visitors of interest.

To facilitate system evaluation, this paper presents a database approach implemented as an embeddable (“plug-in”) tool, called ResourceLog. Invoked by the host Web application, ResourceLog extracts information from the client request and stores in a relational database. ResourceLog includes a reporting website that provides usage statistics on different resources, including the entire website, the folders/subfolders, each script file, and any particular content page as defined by a unique combination of the script file and the query string (also called the dynamic parameters). ResourceLog performs data mining and identifies website visitors that show certain patterns, such as exceeding a threshold for the number of total page requests within a given range of time period. The identified visitors may be asked to participate in a user survey upon their next page request. The survey allows initiation of administrator-user interactions, which not only help to solicit useful comments but also serve as the first step in building a community of users in the website's domain. The goal of ResourceLog is to help administrators to evaluate the website, analyze usage of the resources, justify the priorities of different resources, and improve the website design.

Background

A common approach for usage analysis of Web-based resource utilizes the Web server-generated log files. This approach has been widely used by librarians to analyze the usage of resources posted in the library websites.1 Traditionally, a way of measuring the usage of library resources may be carried out by counting the number of questions asked at a reference desk in the library.2 With the Internet available, librarians have begun to use the log files to analyze the use of library resources.2 3 4 Since Web servers record their behaviors, e.g., handling the resource requests from clients, in the log files, statistical analyses of these files may provide measures on total page requests, referrals, unique visitors, on/off-campus demography of visitors, and requested resource types.5 6 The tools used by the librarians, however, are either inflexible or tied to a particular application.7

Log file analyses have been used in a few healthcare and bioscience websites to determine the usefulness of online resources. System evaluation helps enhance the online health sciences library in order to better meet the needs of visitors.8 Analyses of the usage data may help identify patient-specific information needs of the system visitors.9 10 Analyses of an online information retrieval system help prove that clinicians' use of online evidence is related to direct patient care.11 Scientific LogAnalyzer, a log file-based Web service, aids behavioral and social scientists in psychological research.12 Medical educators and librarians have also used the log files to determine the effectiveness of online materials for students taking related courses13 or those in community and ambulatory settings.14 Compared with the library websites, however, system evaluation by usage analysis has largely been ignored in the biomedical websites, in part due to lack of efficient analytic tools.

Using log files for usage analyses poses several limitations. (1) Existing software applications need to import the log files.12 15 As a result, for most non-technical administrators, it becomes tedious work to locate and import the proper log files. In addition, the tools are used only in the local machines and may not be readily accessible to the administrators through the Internet. (2). Usage analysis based on the log files provides only a retrospective view of the statistics on resource requests. It does not allow a dynamic monitor of the resource usage, nor does it allow dynamic pattern mining of the usage data, for instance, to identify visitors with certain patterns in order to initiate just-in-time user interactions. A plug-in application, called eXTReMe Tracker, provides a dynamic reporting on referrals and visitors.16 This application does not keep track of multiple pages on a website, thus limiting its use for resource tracking. (3). Log files are not designed to “log” everything occurred in the Web server. By default, Web servers do not record the query strings, i.e., the dynamic parameters or search arguments.17 In database-driven websites, the same Web script file may display hundreds or thousands of dynamically created “Web pages,” with the contents depending on the query strings. Without recording the query string, the specificity of the requested resource will be lost. Because of these limitations, using the log files to analyze the usage of Web-based resources tends to be inconvenient, inflexible, and incomplete.

Design Objectives

The objectives of the new Web logging system are the following:

  1.  To provide an application framework for logging website usages with a database approach;

  2.  To identify unique visitors and record related information, such as IP (Internet Protocol) addresses, host/domain names, Web browser types, etc.;

  3.  To keep track of the “detailed” content page that each visitor has requested, including script name, query string, and request date/time;

  4.  To provide data mining features to automatically identify certain visitors, such as those of heavy usage, and initiate interactions;

  5.  To provide a Web interface for administrators to view and analyze the usage data;

  6.  Not to interfere with normal trafficking of the request/response or data upload to the website.

Based on these objectives, we designed ResourceLog, an application that can be embedded into the existing host websites. As a pilot project, ResourceLog has been implemented in the SenseLab website.

System Description

ResourceLog is comprised of four components: logging, data mining, administrative interface, and back-end database (Figure 1). These components altogether provide a suite of tools for extraction, storage, analysis, and presentation of the usage data. The source code and database schema can be obtained from http://senselab.med.yale.edu/senselab/softwares/resourcelog/download/.

Figure 1

Flowchart of the log data in ResourceLog. The logging component is embedded in the host Web server. It extracts information from the requests and identifies the visitor, the script file and the query string against the log database. All visitors with valid usage triggers will be asked to fill out a survey form. The data mining component currently includes two scheduled tasks, “Usage Check” and “DNSLookUp.” The administrative website provides interfaces for administrators to view and analyze the data.

Logging

The logging component serves as the “plug-in” to the host Web application. When the host application receives a page request, it invokes the logging system. The system uses HTML cookie techniques to identify visitors. The cookie value is the Web-log visitor ID, a unique number that is sequentially generated by the logging system. When no cookie is retrieved, the IP address is used to identify the visitor. This may occur when the cookie has been deactivated or deleted, or when the visitor accesses the website for the first time. A single visitor ID will be generated as the default identifier for machines with no cookies but sharing the same IP.

The returning visitors will be subjected to “usage check”, a mechanism to solicit feedbacks from visitors that heavily use the resources. A scheduled task runs on the server, performing data analysis and issuing “usage triggers” to visitors that match the triggering criteria (described below). When the logging system detects a valid trigger, it will direct the current page request to a survey form. The checking for triggers will be skipped for all visitors using the method “Post” which may imply that the visitor is carrying out insert or update operations on the database.

The logging system records the absolute path (directory and file name) of the script name of the requested Web page and the query strings. If the path and the query string are found in the database, the script ID or query ID will be fetched from the database. If not, new entries will be added in the database. The final step in logging is to create a “join” of visitor ID, script ID, and query ID.

Data Mining

ResourceLog includes scheduled tasks that implement the administrator-designed algorithms to identify visitors with particular browsing patterns (e.g., request frequency or preferences). For instance, the system may identify visitors of heavy usage, e.g., having 100 or more page requests in a day. “Usage Check” is the task that identifies these visitors and labels them with a “trigger”. It also removes obsolete triggers, those that last for more than a specified period of time, say, 12 hours. “DNSLookUp” is a scheduled task that looks up the domain names, i.e., the host names for the IPs.

Administrative Interface

ResourceLog includes a Web interface that allows administrators to monitor the usage of resources and carry out statistical analyses of the usage data. This interface is secured with login and password credentials. It shows the number of page requests for the whole website, each folder/subfolder, each Web script, and each unique resource page defined by a combination of the script and the query string. It lists all the visitors for any given resource at any given period of time. The administrators can retrieve/edit the profiles and view the browsing history of the visitors.

Database Schema

ResourceLog has a clearly defined database schema (Figure 2). Table “log_visitor” stores information about the visitors. The field “name” stores domain names for the IPs. The field “speed_ticket” records the flag showing that the visitor is being labeled with a trigger and subjected to usage survey. The field “dt_last_ticket” shows the date/time when the last trigger was issued.

Figure 2

Schema of the relational database used for ResourceLog.

Table “log_script” stores the absolute path of the script files. A new entry will be added when the related Web page is first accessed from the Internet. Therefore, non-existing or never-requested pages will not be added to the table. Table “log_query” stores all the request query strings. Table “log_visitor_script_query_join” records a “join” of the visitor ID, script ID and query ID, and the date/time when the “join” occurred. Each row in this table represents a page request to the website.

Table “log_sites” has been added to create counters of page requests for the entire host website and for resources in each folder or subfolder. Each time a page request is made, the counter for the corresponding folder/subfolder will be incremented by one. Table “log_excluded_ipaddr” stores the IP addresses, such as those for website developers, from which all the page requests need to be excluded from usage data analyses.

A Pilot Implementation—SenseLab

System Implementation

ResourceLog has been implemented in conjunction with a Web-based neuroscience database system, the SenseLab.18 The application uses Oracle as the back-end log database and the Web pages have been written in Microsoft ASP (Active Server Pages). The major steps in the logging component have been written in one script file. A single line of code to “include” this file is added in a global page shared by all the ASP pages in SenseLab. On average, the time for the logging lasts 55 milliseconds. The “DNSLookUp” task uses a commercial server object, AspDNS, to implement the reverse domain name lookup for the IP addresses (see http://www.serverobjects.com).

SenseLab includes six major databases (NeuronDB, CellPropDB, OdorDB, ORDB, ModelDB, and OdorMapDB). It also provides other resources including system architectures, management and complex search. ResourceLog embedded in the system has been in operation since its launch on February 13, 2002. Currently, the SenseLab website is displaying the counters for the entire application and each of the major databases showing the number of page requests since January 1, 2005.19

Usage of Specific Resources

Figure 3 shows example Web pages in the administrative interface. On the top is a form to set the search criteria. The page displays a hierarchical tree-like structure showing the parent folders, subfolders, and all the scripts for a selected folder (e.g. “senselab” in Figure 3A). The tree elements are hyperlinks allowing administrators to browse the tree. The numbers following the elements are total page requests for the corresponding folders/subfolders and scripts. When the button “List Visitors” is pressed, all the visitors who meet the search criteria and have made requests related to the selected element will be listed.

Figure 3

Example Web pages from the administrative Website. The figures show the usage of SenseLab resources in the first two months of Year 2004, excluding requests from the website developers. (A) When a folder (a hyperlink) is selected, all subfolders and script files are listed. (B) When a script file is selected, all the related query strings are automatically listed in the dropdown box. At the bottom of each panel are lists of SenseLab visitors. Note that both tables have been sorted by hits and truncated (a total of 6892 visitors retrieved in A and 7 visitors in B).

Figure 3B shows the usage of resources provided by “/senselab/neurondb/ndbeavsum.asp”. The administrators may choose to list all the visitors, or only those using the method “Get” or “Post”. For the “Get”, a specific query string may be selected from the dropdown list. Thus, the administrators are able to keep track of the usage of specific resources, such as a particular neuron in the NeuronDB or an olfactory receptor in the ORDB.

Resource Usage by Individual Visitors

Figure 4 is an example Web page showing the profile and browsing history of an individual visitor. On top of the page describes the visitor profile, including IP, domain/host name, VIP (Very Important Person) status, and content of the survey feedback (i.e., the Note), if there is any. The table shows the entire history of resource usage by the visitor. The visited URLs (scripts plus query strings) are hyperlinks, which allow administrators to click on and see the content of the requested resource.

Figure 4

An example Web page showing the profile and browsing history of a SenseLab visitor. The usage data indicated that the visitor had made 10 page requests during a 1-hour period.

Data Mining and Visitor Survey

In SenseLab, an algorithm has been designed to identity the visitors of heavy usage, defined as having 100 or more page requests within a 24-hour period. The scheduled task “Usage Check” runs every 5 minutes. It executes queries against the database and labels the identified visitors with a flag, the usage trigger. Upon the next page request, a visitor with a valid trigger will be directed to a survey page and asked to fill out a survey form. It is the SenseLab policy that filling out the form is entirely voluntary. In the survey form, visitors are asked to provide their contact email addresses and to answer three questions: who they are; what they are using SenseLab for; how we might improve SenseLab. The content in the form will be stored in the table “log_visitor.” An automatic email message with the content will be sent to the corresponding website administrators. Since the launch of the logging system, the visitors who participated in the survey include researchers, educators, students, curators and others around the world. It is learned that the SenseLab resources have been used for a variety of activities, including basic research, course teaching, thesis, database design and curation, etc.

Usage Statistics

Table 1 shows some statistics of the usage of SenseLab resources from February 12, 2002 to March 23, 2005. (A more thorough analysis is being prepared in another manuscript.) Over the three years, there had been 81,414 unique visitors to the website. By a more conservative standard, these visitors came from 61,232 IP addresses. Over a million page requests have been made by human visitors. With a threshold of at least 1000 requests, 128 of the visitors have been arbitrarily determined to be robots/crawlers, which account for 77% of the total requests.

Table 1

Usage Statistics of SenseLab Website

There are more than 300 script files deployed in the Senselab website. The unique page requests (>591K) are defined by distinct combination of the script file and the query string recorded in the usage database, hence more accurately reflecting the amount of resources. Over the three years, there have been 1385 usage triggers issued to 497 visitors, 70 of whom participated in the survey.

Discussion

This paper describes an open source, embeddable software tool for logging the usage of Web-based resources in the biomedical domain. The application provides a mechanism for usage data mining and user interactions and allows administrators to dynamically monitor the usage of resources in the websites.

A Comprehensive Approach to Log the Biomedical Websites

With the Internet, many biomedical websites have been built as portals for sharing resources. Keeping track of how the resources are being requested is a daunting task and has been largely ignored. For most website administrators, knowing exactly how many requests over time have been made to the website and what types of resources have been requested would be very helpful. Some websites provide particular “datasets” as resources.20 The request or download of these datasets may be easily counted. For most websites that use the Web pages to display the content of resources, keeping track of the usage of each page becomes difficult. The present study provides a comprehensive approach to solve this problem. By using cookies, ResourceLog recognizes the same individual machine no matter where it might be located, in the office, at home, behind a firewall or in a subnet. It provides a dynamic system to monitor the usage of the resources and allows just-in-time communication with visitors identified by data mining.

Definition of Unique Visitors

Using cookies and IPs, ResourceLog provides two robust estimations of the number of website visitors. While using cookies unambiguously identifies the visitors, it may cause “termination” of existing visitors and “over-estimation” of the number of total visitors.3 Since it is critical to keep the usage history separate for different visitors, e.g., not to over-estimate the number of requests in data mining, the benefit of identifying unique visitors outweighs the drawback.

The use of IP address and cookies inevitably raises the privacy issue. IP is intrinsic to HTTP requests and is automatically recorded in the server log file. Whereas other analysis software programs extract IPs from the log files, ResourceLog records them directly from the requests. The cookie feature in the browsers is fully controlled by the visitors. That is, a visitor may choose to accept, block, or delete cookies. Even if the cookie feature is deactivated, ResourceLog is fully operational and solely uses the IP instead. In the current pilot implementation, the administrative interface is secure and requires authenticated login to access the usage data. It is our policy that the program is used exclusively to analyze the SenseLab usage and relationships between different resources, and not used to investigate the behaviors of particular visitors, per se, and that the recorded IP addresses are not shared with any third party.

Definition of a Single Page Request

ResourceLog defines a single page request, or a page hit, as a single “Get” or “Post” request. A unique page request is determined by a combination of a script file and a set of dynamic parameters. The outcome from the request may be conceived as a single Web page. Displaying the content in a page may involve many Internet “hits” to the host website or even to other websites. The statistics of these “hits” may be useful for website managers to increase the system performance but are not useful from the resource administrator's point of view. Some statistics analyses focus on “sessions”, each of which is called a single visit and defined as a collection of actions taken by the visitor while visiting the website without leaving it.21 22 The behavior of visitors to public databases is most likely unpredictable. Sessions may become unreliable due to the fact that a session will expire after a period of inactivity.

Extensible Usage Data Mining and Visitor Interactions

Since ResourceLog is open source, the data mining component is extensible. Depending on the needs of the administrators, new data mining algorithms can be developed and implemented. Also, multiple mining algorithms can be run simultaneously and may raise different triggering flags. For example, since there are six databases in SenseLab, the preference of visitors to different databases may reflect the special interests of the visitors. It would be appropriate to present different survey forms with more specific questionnaires to those visitors.

The usage data stored in the database can be used for thorough statistical analyses, including usage frequency during different time course, visitor demography, patterns of browsing flow during the visit, etc. The frequency of request, the spent time for each resource and the flows among different resources may constitute a complex networking of resource usage in the website, providing a means to investigate the human interest and relevance of knowledge in the biomedical domain.

Lessons Learned

In the current implementation, we were interested only in visitors with heavy usage by setting a high threshold on the belief that these visitors would benefit the most from the resources and be likely to provide thoughtful feedback. As a result, only a small percentage of tens of thousands of visitors received usage triggers. The voluntary policy further reduced the percentage of visitors who have participated in the survey. One would assume that a lower threshold would result in more usage triggers and likely more visitors who would participate in the survey.

The survey, nonetheless, provides a useful tool for SenseLab developers to solicit visitors' feedback and suggestions. For example, at the requests from visitors, we have added hippocampal interneurons to the NeuronDB and added new pages in the ORDB site to include all the aliases of OR gene sequences based on genomic analyses for the human and mouse genome. More importantly, some misidentification of terms has been pointed out by the visitors, and we have, accordingly, made the necessary corrections. We have also replied to many of the survey-generated e-mails, addressing some specific issues raised by the visitors. The survey has proved to be valuable in determining the usefulness of the resources, making website improvements, and establishing a community of interested SenseLab visitors.

Future Directions

In order for ResourceLog to be a more useful and adaptive tool that can be embedded in other resource-driven biomedical websites, future work may be explored in two directions. The first is to develop more complicated algorithms to identify visitors with particular interests and to study the relevance of different resources in the website. The second direction is to make the logging system more portable and platform-independent. A Java version of the component has been developed and embedded in a JSP-based Web database system run on Apache Tomcat Server. Implementing this component in a module or package with database connection utilities, however, may allow a broader use of the program in different Web servers and back-end database systems.

Footnotes

  • This research was supported by NIH grants K22LM008422, T15LM07056, P20LM07253 and P01DC04732.

  • The authors thank Dr. Gordon M. Shepherd for constructive discussion on the project and critical reading of the manuscript.

References

Access policy for JAMIA

All content published in JAMIA is deposited with PubMed Central by the publisher with a 12 month embargo. Authors/funders may pay an Unlocked fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication.

All content older than 12 months is freely available on this website.

AMIA members can log in with their JAMIA user name (email address) and password or via the AMIA website.