On the Net, Raising Dead Links

Greg R. Notess
Reference Librarian
Montana State University

on the net

Raising Dead Links

EContent, December 1999
Copyright © Online Inc.

Sometimes the dead can be resurrected, found in a past life, or at least partially brought back to life.

With the advertisement-driven economy of the Web, popularity becomes the significant factor on many sites. The extremely popular sites served up by the likes of Yahoo!, Microsoft, Lycos, Amazon, and AOL are heavily visited, and boast in their press releases of their rising popularity rank or increased visitation.

But what is the single most frequently seen page on the Web? Is it the home page for one of the major sites? Some days it seems to be the omnipresent "File not Found" page--the 404 error-message page. Many Web sites have their own customized version of this dead-link error message, which means that at least we get to see different "404 File not Found" pages. Note, however, that they are all generated because the Web server is returning the same error: the Web page you are looking for is just not there.

At times, this error is simply a result of a typo in the URL. While www.name.org/index.html may work just fine, www.name.org/indec.html will not. Likewise, www.neddevine.com is not the same as nedevine.com. It is certainly easy enough to mistype a URL. However, this does not equate to a dead link. In contrast, the 404 error messages that turn up from a link on another Web page, from a bookmark list, or from search-engine output, come from the death of a link. Dead links are littered throughout the Net. Sometimes the dead can be resurrected, found in a past life, or at least partially brought back to life. Understanding how links die and how 404 error messages can be managed on the Web server is also helpful for the information professional trying to track down the new home of a dead link.

THE DEATH OF LINKS

The interconnected, hypertext character of the Web contributes significantly to its importance. It allows for rapid changes to be made to files, easy connections between one site and another, and detailed intrasite navigational features. The Web encourages rapid, frequent, and direct publication of information. At the same time, it means that just as rapidly, pages can disappear, move, and change their entire content. The links on Web pages connect to other Web pages. Some of the pages could be on the same Web server while others can be anywhere else on the Web. This great strength of the Web is also a weakness, in that a company can never know every Internet site, intranet page, or bookmark list that links to its pages. And thus, it is impossible to update every page linking to its pages.

A Web page dies every time that one or more files on a Web server have their names changed, their location in the subdirectory structure moved, or their host names modified. If a file named FIRSTtry.htm becomes redesign.html, anyone linking to the old URL will get the "404 File not Found" message. When www.name. org/directory/subdirectory/file.html becomes www.name. org/directory/ file.html, the old URL is again left orphaned.

Links also die when a server name changes. When www.host.net/~company/ becomes www.company. com and no redirect files are left behind, all the links to the old host die. Then again, there are plenty of pages, directories, and sites that have just been removed. Period. End of story. An organization may cease to exist, stop paying its Web bills, or get cut off by its hosting company. Once again, all the links pointing to those pages then die.

SEARCH-ENGINE DEATH

...with the recent emphasis on search engine size, one tempting approach for the search engines is to keep their spiders busy crawling new Web pages and spend less time on recrawling the pages already in the index.

Links from the Net portals and search engines are another huge source of dead links. Anyone trying to maintain a database of millions of constantly changing records is bound to end up with a large number of dead links, especially when none of the search engines index every page every day. Indeed, with the recent emphasis on search-engine size, one tempting approach for the search engines is to keep their spiders busy crawling new Web pages and spend less time on recrawling the pages already in the index. However, as the number of dead links in a search engine or directory grows, so grows the users' discontent.

Consequently, the search engines must focus both on indexing new pages and reindexing old pages. With some pages changing shortly after they have been indexed, the search engines all actually provide a database that is a revolving snapshot of the past, of what a Web page looked like at the exact moment it was indexed. And these pictures of the past are bound to have some links which have since died, moved, or been transformed.

ERROR RESPONSES TO DEATH

The dead links continue to point to the old addresses until someone changes them or the search engines reindex the page. Links on bookmark lists and other pages may not be changed for years. When a person clicks on a dead link, Internet computers along the way discover the problem and respond with an appropriate, if unintelligible, error message.

If the server itself is gone or offline, the error message will report a DNS failure or provide a "host unreachable" message. Sometimes, the host is down temporarily, and just checking back sometime between a minute and a month later will find the page.

The 404 Error

More frequently, it is just the file that went off the deep end. The proper error code for a missing file is the "404 File not Found" error. The 404 part of the error message is just a numeric identifier that tells the server which error message to display. In the early days of the Web, the only message displayed was the simple "404 File Not Found." Recent browsers and servers have recognized that the 404 is meaningless to most people, and now only provide a "File Not Found" error without the numeric identifier.

Modern servers can take this even further by allowing customized error messages. Instead of simply providing the default error message, Web site managers can create a specially tailored file for their site. It may still say "File not Found," but it can also provide suggestions for where to find the missing file.

For example, take a site that has reorganized all of its files into a new directory structure. The press releases have been moved from the /officeof publicaffairs to the /newsdirectory. The annual reports have been pulled from /ars and placed under /investors, with the old ones in /investors/old. Redirect pages can be left in the old directories, but with many sites going through complete redesigns every nine months or so, it becomes increasingly difficult to keep up with all the old locations.

A well-designed site will introduce a new 404 error message page at the same time as any redesign. That page can direct the user to the new locations for the most frequently-requested files. It can provide a site search capability to let users search the site directly. Ideally, it will include as many navigational features as possible: a site map, a site index, and an email contact for further assistance.

Customized Directory-Specific 404s

There are many advantages to having a customized 404 error message. Basically, it helps get people from the old or mistyped location to the relevant section of a Web site. As a searcher, it can be useful to know that these 404 pages can get even more sophisticated. Rather than just using one customized 404 page, a separate customized page can be created for as many directories as necessary. Each directory could have a separate "404 File not Found" error message. Using a directory-specific 404 page allows the Webmaster to provide even more targeted advice for finding the lost file. In addition, the customized error page can be set to automatically load another specific page after a period of time. Some sites use this technique to automatically display the site's home page instead of a 404 error message. Others use it to redirect to a new directory.

On a fast Net connection, the redirection can occur in an eye blink. With a redirected 404 page, the user may never see any kind of error message. But the alert Web surfer will notice that the URL has changed--which then implies that the content of the page has also changed since the time of the connecting link. While that provides a technique for the Webmaster to bring a dead link back to life, searchers must employ other methods.

RESURRECTING DEAD LINKS

For all the power of customized 404 files, there are still plenty of sites which fail to offer anything other than the default, black and white "File not Found" response. At that point, the user can try several techniques for finding that file or its successor.

The URL Slice

First of all, try the URL chopping approach. Simply cut off parts of the URL starting on the right-hand end and stopping at every forward slash. For example, if a URL such as http://www.name.org/directory/subdirectory/file.html has died, trim it down to http://www.name.org/directory/subdirectory/ and try again. If the subdirectory still has one or more files in it, they may include the missing file under a new name or at least some hints about where it may have moved. If that URL still gives a 404 error message, try http://www.name.org/directory/ and if that fails, just http://www.name.org/. At each step, look for pages or links with keywords from the dead page or a possible section of the site where the vanished content may now take up residence. Also look for any site-search capabilities and other site-navigational features at each of these levels.

The Google! Cache

If slicing and nosing around the site in general fails, try searching for a copy of the page in the past. Google! provides one of the easiest methods for past-life searching. Go to Google! (http://www.google.com), then try to find the page. Use a phrase search on the title, if known, or use several distinctive key words. If you can find the page in Google!, look at the end of the extract and URL for the link to a "cached" copy. Clicking on the "cached" link should bring up a copy of the page as it appeared at the time that Google! indexed that page.

This technique will not always work. Google! may not have found the page. Or maybe it is one of the ones that Google! does not cache. Or the page uses too many calls for files on the remote server, thus making Google!'s cached version unusable. However, many times the old page can be found here with a bit of creative searching.

The Alexa Archive

For Alexa users, and those willing to download and install the free Alexa software (http://www.alexa.com/), there is another way to find a copy of a page now dead or just to find an older version of a page. Start up the Alexa program. Browse to where the dead page used to be. If a copy of the page is available in the Alexa archive, the archive button (on the right side of the bar, looks like a building) will be lit up. Click on the archive icon to retrieve the old page from the Alexa archive. Like the Google! cached pages, the Alexa archive has uses beyond just finding previously-live Web pages. Even for the living Web, these Web-page archives offer a great opportunity for seeing how pages have changed over time. Go to the current page in your browser. Then click the Alexa archive icon. Compare the dates of the two pages, if any dates are available, to see how the page has changed. Unfortunately, neither archive has more than one archived copy of old pages, and there is no way to control the date of the archived page.

The Final Search Engine

And what can you do when both Alexa and Google! fail to provide an older copy of the dead page? Other search engines may be able to provide at least a summary or extract of the page. Try a field search for the URL of the dead page on AltaVista, Northern Light, Infoseek, and Lycos. If the page is found on any of them, check the summary for additional information about the page and clues for tracking down a new home. Also use the link field searching capabilities of AltaVista, Infoseek, and Google! to see how other pages refer to the dead link. Note especially the anchor text that points to the dead link. The words within the hypertext link and those nearby may be able to provide additional information.

In the end, some dead links lead to completely dead pages. No copy remains available, and the information content of the page has disappeared. Understanding the nature of dead links and using some of the techniques described above can help the searcher find some additional information, but fugitive Web pages can be even more difficult to find than fugitive print documents.

Communications to the author should be addressed to Greg R. Notess, Montana State University Libraries, Bozeman, MT 59717-0332; 406/994-6563; greg@notess.com ; http://www.notess.com.