Missing Web References — A Case Study of Five Scholarly Journals

The present study attempts to ascertain the proportion of missing web references of 5–10-year-old research papers of the five leading open access (OA) journals in library and information science. The results suggest that the number of web citations has increased from 41.60% of all citations in 1998 to 53.32% in 2002. But a substantial quantity of web citations (32.09%) was found to be missing. The percentage of missing web citations goes on increasing with each passing year — ten-year-old publications having the highest number of missing citations, i.e., 39.96% and five-year-old publications having the lowest number of missing citations (25.89%). 0.92% of citations had moved to a new URL address and 74.14% of missing citations resulted in an HTTP 404 (page not found) error.


Introduction
Much scholarship, if not all, is based on previous work, and when new scholarly work is produced it is important that detailed and accurate information on sources consulted is provided.To facilitate referencing, scholarly works have been routinely collected and preserved in print by libraries and database producers (Veronin, 2002).With the advent of the internet large numbers of scholarly journals and other sources of information have become available online.This has resulted in an increasing usage of web references in research articles.The proportion of citations of electronic resources has increased from less than 5% of all citations in 1995 to nearly 30% in 2001 (Rumsey, 2002).
However, compared to their paper counterparts web references have their own problems.It is now well documented that web pages and web sites come and go, and that occasionally they may resurface (Koehler, 2004).This has serious implications for researchers: whereas we are still able to read thousand-years-old written documents, the information put on the web a mere few years ago is in danger of being lost.The present study endeavours to ascertain the proportion of missing web references in research articles from 1998 to 2002 (5-10 years old) across five leading open access (OA) journals in the field of library and information science.

Objectives
The objective of the present study is to ascertain the proportion of missing web citations in 5-10-year-old research papers.

Scope
The scope of the present study is limited to web references in research articles (excluding editorials, news, reviews) from 1998 to 2002 in the following five leading OA journals in library and information science:

Methodology
In August 2008, the references in each research article from 1998 to 2002 of five OA journals were analysed to locate the web references.The URL of each web reference was copied and pasted into the search box of Internet explorer to find out which references were missing.All the details were noted.To ensure that inaccessibility was not due to temporary server problems, another attempt was made to access the sites in October of 2008.

Related Literature
The web is not a particularly stable environment for the publication of long-term information and the maintenance of individual objects or items (Koehler, 2004).Taylor and Hudson (2000) found variation among domain types and subject collections of printed bibliographies of URLs and web lists.Kitchens & Mosley (2000), while discussing the ephemeral nature of web references, question the utility of printed internet guides.Germain (2000) also questions the usefulness of web resources as citations for scholarly literature due to their ephemeral nature.According to Spinellis (2003) approximately 28% of the URLs referred to in Computer and Communications of the ACM articles between 1995 and 1999 were no longer accessible in 2000 and the figure rose to 41% in 2002.A study by Benow (1998) found an attrition rate of 20% and 50% for websites over two-and three-year-periods.Nelson & Allen (2002) found a 3% attrition rate for digital library objects.Lawrence, Coetzee, Glover, Pennock, Flake, Nielsen, et al. (2001) also found many web references invalid after analysing 2,70,977 web references in computer science publications.The percentage of URLs that was invalid varied from 23% in 1999 to a peak of 53% in 1994.Harter & Kim (1996), after examining scholarly e-journal articles from 1993 to 1995, found that one third of the URLs were no longer accessible.Analysis of 1068 web citations by Sellitto (2005) demonstrated that 46% of all citations could not be accessed, with the HTTP 404 (page not found) message being the most common error message.A study of health-related web sites by Veronin (2002) found that 59% of the sites could not be found, 17% had moved to a new URL address and only 24% could be accessed at the original URLs.Rumsey (2002) tested citations over a span of five years in 2001 and found that 39% of URLs of 2001URLs of , 37% of 2000URLs of , 58% of 1999URLs of , 66% of 1998URLs of and 70% of 1997 URLs were not accessible.Markwell & Brooks (2002) estimated the URL half-lives for online literature in the field of biochemistry and molecular biology at 4.6 years.

Findings and Discussion
A total of 8755 references appear in 630 articles across 1998-2002 in the five journals, 4001 (45.69%) of which are web references.32.09% of web references, i.e., 1284 references are no longer accessible.The number of missing references goes on increasing with each passing year.The number of missing references for the publications of 2002 is 25.89%, which increases to 28.15% in 2001, to 37.23% in 2000, to 37.79% in 1999 and to 39.96% in 1998 (Table 1).The table also shows that with each passing year the number of web references is increasing (from 41.60% in 1998 to 53.32% in 2002).74.14% (952 of 1284) of missing references resulted in an HTTP 404 error, while 0.93% (37 of 4001) references had moved to a new URL address with a link from the original URL.3).

Conclusion
The present study reveals that with each passing year the number of web citations increased, and so was the number of missing citations.For five-year-old publications the percentage of missing citations is 25.89% and this increases to 39.96% for ten-year-old publications.The most common error is an HTTP 404 (page not found) message.
There are many causes for failed web references, one being the wholesale restructuring of any give domain (Koehler, 2004).Although in certain cases the invalid web references could be located by means of alternative searches (Lawrence, Coetzee, Glover, Pennock, Flake, Nielsen, et al., 2001), this is not advisable from scholarly point of view as it is a time-consuming process.It has been observed that canonical URLs which took the form www.orgname.org and www.orgname.org.cc are more likely to persist than non-canonical forms (Koehler, 2004).
There is an immediate need to address the problem by devising and adopting uniform standards for long-term preservation of web resources and their persistence.

Future Research
No attempt has been made to locate citations other than by means of the designated URLs.As such there it is possible that some documents are available from a different URL.This could be explored in future research.

Table 1 .
Out of 1461 references appearing in 157 articles of Ariadne 1112 (76.11%) are web references.The number of web citations has increased from 50.0% in 1998 to 81.12% in 2002.32.10% of web references, i.e., 357 are not accessible.The highest number of missing references (43.40%) is found in 2000 and the lowest (25.41%) in 2001.The percentage of missing references for 1998, 1999 and 2002 is 36.61%,37.33% and 30.90% respectively (Table 2).Out of 1112 web references 11 (0.98%) had moved to a new URL address with a link from the original URL.257 (71.98%) of missing references had an HTTP 404 problem.Reference statistics of five journals.
* Figures in parentheses indicatea percentage.Library Philosophy and Practice Library Philosophy and Practice has published 32 articles during the period 1998-2002, which included 311 references.Out of 311 references only 70 (22.50%)are web references, 36 (51.42%) amongst which are not accessible.The highest and lowest number of missing references are in publications of year the years 2000 and 2001 respectively.Out of 34 web references 1 (2.94%) reference had moved to a new destination with a link from the original URL.26 (72.22%) of missing references had an HTTP 404 error.The number of web citations has increased from 0% in 1998 to 31.25% in 2002 (Table

Table 2 .
Reference statistics of Ariadne.
* Figures in parentheses indicate a percentage.

Table 3 .
Reference statistics of Library Philosophy and Practice.

Table 5 .
Reference statistics of Issues in Science and Technology Librarianship.

Table 4 .
Reference statistics of Information Research.

Table 6 .
Reference statistics of D-Lib Magazine.