On The Nets--Searching The World-Wide Web: Lycos, WebCrawler and More
by Greg R. Notess
ONLINE, July 1995
As in the print publishing world, the development of finding-aids
and indexes must wait for the development of the resources. When
anonymous FTP resources multiplied, archie appeared. With the growth
of gophers, veronica was born. The explosive growth of World-Wide
Web resources in the past year has inspired several contenders for the
title of "best Web search engine." The different keyword indexes of
Web resources feature a wide variety of search interfaces and
capabilities. No clear winner has emerged yet, and the diversity of
search engines and databases provides the information professional
with multiple choices.
There are many Web keyword indexes, but the best-known are:
* Lycos
* WebCrawler
* World-Wide Web Worm
* Harvest Broker
* CUI
Just as World-Wide Web clients can speak other protocols and
connect to gopher, telnet and FTP resources, some Web indexes include
more than just Web documents. Some of these search engines permit
Boolean searches and other sophisticated search options, but all suffer
from the problem of overload.
SYSTEM OVERLOAD
A major problem inherent with successful Internet keyword indexes
is that as soon as a particular search tool becomes useful and well-
known, it is flooded with users. This in turn makes it less dependable,
since the original server is unable to handle the increased load. This
happened with the first archie server at McGill University and then
with the first veronica server. For both archie and veronica, a partial
solution has been to divide the load by multiplying the servers. Many
archie servers on different continents now handle the thousands of
daily archie searches. The dispersion of veronica servers has occured
along similar lines. This has been an effective but only partially
successful way of dividing the load. As more servers are being set up
by generous hosts on the Net, Internet use is multiplying. The result is
that even with a dozen or more veronica servers, the load (determined
by the number of simultaneous search requests) is still too high. It is
not uncommon to try an archie or veronica search and get a failed
search response due to high system load.
The same situation occurs with Web finding-aids. When a particular
index establishes a reputation for successful searches, it attracts a
huge increase in traffic. Then users can no longer depend on that
resource and must look for an alternative. Most search options for the
Web have not yet resulted in a multiplication of servers, but that time
may soon arrive. Meanwhile, the different indexes provide alternatives
when a particular favorite is unavailable or unbearably slow.
LYCOS
Lycos, a project hosted by the computer science department at
Carnegie-Mellon University, is one of the best-known and most popular
indexing tools for the World-Wide Web. When Netscape Navigator was
first widely released in late 1994, the people at Netscape
Communications Corporation wisely set up a page that listed various
Internet search tools (http://www.netscape.com/home/internet-
search.html). In one quick and dirty comparison, they ranked them
based on the results from a simple search on surf. Lycos retrieved the
most documents and therefore was the first of the listed Internet
search tools. Due to its prominence on the Netscape Internet Search
page, Lycos' load has increased so greatly that it can be difficult to
get any response at all.
Although the Lycos database is one of the largest finding-tools,
there are other reasons that Lycos searches result in a high number of
hits. A single-word search on Lycos defaults to automatic truncation,
so the search on surf also retrieves documents with surface. On
multipleword searches, Lycos defaults to an OR operation. Although the
search results are ranked and give preference to records that have all
the search terms, this results in many irrelevant records.
In the Lycos technical documentation, the developers say, "We plan
to upgrade the search engine's language at some future point to
implement more standard Boolean operators. We will definitely
add...spelling correction and phonetic and semantic match capabilities."
Until that time, the efficiency of Lycos is severely limited. For single
keyword searches it works well, but multiple-word searches are not
as successful.
The current search engine has a few advanced features. While
truncation is the default, an exact search can be specified by adding a
period at the end of the search term. Also, preceding a search term
with a dash designates that term as a negative indicative. "Documents
containing that word have their match score reduced, but they may
still be retrieved if the other terms in your query are present." You can
use these two tools to obtain a more precise search. For example, the
search surf. -silicon would result primarily in records with the term
surf but not terms such as surface, and it would also mostly exclude
pages from Silicon Graphics about its "Silicon Surf" service.
The search options and database development for Lycos continue to
change. From the main Lycos home page (http://lycos.cs.cmu.edu/)
there may be several options (Figure 1). The databases may be
numbered Lycos1, Lycos2, Lycos2a or Lycos10. The actual designation
has changed over time. The Lycos page also offers a small Lycos
database and a big Lycos database. The smaller database is less likely
to be overwhelmed.
The output of a Lycos search can appear cryptic. At the top of the
search report is the number of documents found matching at least one
search term and a list of matching words. It includes the requisite
hypertext link to the found URL, but also includes a hypertext link to a
document with the record's ID number and weighted score. In addition,
the date of the document's last update in the Lycos database, the size
of the page in bytes, the number of links within the document, the
title, an outline and the search keys found in the document are listed.
With the default "verbose" display, the record also includes a
sometimes lengthy excerpt from the actual document.
WEBCRAWLER
WebCrawler, developed by Brian Pinkerton at the University of
Washington (http://webcrawler.cs.washington.edu) is a much more
simple interface and provides results in an easy-to-browse, single-
line report. The database WebCrawler searches is not as large as the
Lycos database, but it is substantial nonetheless.
WebCrawler has a single line for entering the search statement
(Figure 2). For a multiword search, it defaults to a Boolean AND search.
Just uncheck the button below the search line to run an OR search.
There are no nesting or adjacency features. While there is no
truncation symbol, WebCrawler does automatically strip "endings" and
convert search terms to all lowercase. The example given in the
documentation is that "NeXT Computers becomes next computer." Based
on the samples I tried, "endings" appears to only refer to plurals,
either a final "s" or an "es," and not to other suffixes.
While the options are limited, WebCrawler's Boolean capabilities
make it the first choice for a search needing an AND, at least until
Lycos develops its Boolean capabilities. Unfortunately, WebCrawler
can sometimes be as difficult to reach as Lycos. Once again, it is a
victim of its own success.
WWWW AND HARVEST
The World-Wide Web Worm (WWWW) indexes Web document titles
and embedded references to other Web resources. Thus it is a smaller
database than Lycos or WebCrawler that also indexes parts of the full
text in the documents themselves. The Worm works well for those
familiar with UNIX and the egrep "regular expression." For example, OR
is designated with a pipe | symbol, and .* represents "any amount of
intervening text." WWWW has been widely used and can be even more
difficult to reach than Lycos or WebCrawler. However, it shows a
message saying that it will be moving to a larger machine soon, which
will allow more than the current maximum of 25 connections.
The Harvest Broker using the Glimpse search engine, provides a
much fuller range of Boolean capabilities. This search option goes
under a variety of lengthy names: "Query Interface to the WWW Home
Pages Broker" or "The Harvest Information Discovery and Access
System." Both WAIS and Glimpse are used with Harvest, and the
Glimpse search at http://harvest.cs.colorado.edu/
Harvest/brokers/www-home-pages/ features full Boolean operators
with parenthetical nesting. Searches can be either case-sensitive or
case-insensitive, and truncation is only available as all or nothing.
Either the "Keywords match on word boundaries" is checked, which
designates an exact search on the search terms, or it is not checked
and the engine truncates all search terms. In addition, the Glimpse
version of the Harvest Broker supports field searching of title, URL and
keywords.
While it is comforting to have some more standard Boolean
operations available, Harvest Broker has two major limitations. First,
it is confusing. Starting with the lack of a distinct name, the
documentation describes Harvest as "an integrated set of tools to
gather, extract, organize, search, cache and replicate relevant
information across the Internet." Unfortunately, the tools are not
integrated well enough to make sense to most users. A bit more work
on the human-machine interface could improve Harvest Broker greatly.
The database also needs to be expanded greatly before Harvest
approaches the depth of coverage available from Lycos or the
WebCrawler.
CUI
Another smaller, more refined index option comes from the Centre
Universitaire d'Informatique (CUI). Its Web catalog derives from
several well-known listings of Web pages--NCSA's What's New pages,
CERN's Virtual Library Subject Catalog, Scott Yanoff's _Internet
Services List_, John December's _Computer-Mediated Communication
Information Sources_ and _Internet Tools Summary_, and a few others.
Searches can be based on PERL regular expressions. Like the WWW
Worm, | works for OR, and .* for AND (but terms must be in the
specified order).
CUI works well for finding major resources and for broad keyword
searches. The nature of the component databases can result in some
redundancy. The descriptions of the resources in the databases may be
brief or lengthy, so the success of a search is determined by how well
the source is described. Even so, it can help find the better known Web
resources.
DON'T FORGET VERONICA
The Web is rapidly replacing gopher as the standard Internet
publishing medium, yet even so, gophers offer many information
resources. Veronica should certainly remain in the arsenal of search
tools for a comprehensive search. As noted above, veronica servers are
often overburdened and too busy to respond to a new request. For this
reason, some hint about which servers are least busy can save a
considerable amount of time. At Washington & Lee University
(gopher://liberty.uc.wlu.edu:70/11/gophers/Veronica) the gopher
server does just that. Periodically, it automatically checks each of its
known veronica servers. Then it ranks the least busy servers first and
lists the servers which did not respond at all.
Veronica's strength is that the search statement can include
standard Boolean operators (AND, OR, NOT) and nested arguments
(designated by parentheses). The default operator on a multiword
search is AND. Veronica recognizes the asterisk as an end truncation
symbol. Veronica even supports limits. Searches can be limited by
gopher type--directory, text file, image, etc.
The major limitation with veronica is the database itself. While the
best Web-based finding aids index entire HTML documents and can
include gopher and FTP resources, veronica is limited to menu listings.
In addition, the menu listings may not make much sense out of the
context of the upper-level menu titles. With the capabilities of the
gopher+ protocol to invoke an external Web browser, some Web
documents are now included on gopher menus. However, only a few Web
documents are retrieved in a veronica search.
USE YAHOO FOR A SUBJECT APPROACH
While the keyword search of the search engines described
previously is a primary method for tracking down Internet resources,
using a classified or subject listing of resources can also be effective.
Just as there are numerous keyword search options, there are many
subject listings as well.
One of the best subject lists is Yahoo, available at Stanford and a
mirror site from Netscape. Yahoo has a keyword search option of the
entries included in the subject listing. Although the database is small
compared with the other keyword search options, it presents very
clear options, including case-sensitive matching, either Boolean AND
or OR, and substring or complete word searches. Yahoo can be a good
source for finding the best-known resources. Also, Yahoo lists over 40
other Web indexes under http://akebono.stanford.edu/yahoo/Reference/Searching_the_Web/
for those trying for a comprehensive Internet search.
SEARCH MORE THAN ONE WITH CUSI
With so many keyword indexes to Internet resources, the next step
is to find a resource that searches all of them. CUSI (Configurable
Unified Search Engine) provides one form that can then search various
Web search engines. The advantage to the CUSI front end is that the
keywords only need to be entered once; then, one at a time, the search
can be sent to different indexes (Figure 3).
CUSI is one of the few Web index services that has multiple
servers. Start at the URL listed for CUSI in the sidebar. Then choose
the server closest to you. For Lynx users, most of the CUSI sites do not
work, so try the CUSI Radio Button version at http://www.scs.unr.edu/~cbmr/net/search/cusi-r.html.
CUSI includes search options for many different search engines in the following categories:
* World-Wide Web (WWW) Indices
* Other Internet Indices
* People and Organizations
* Bibliographic
* Computer and Network-Related
* Reference Works
CUSI also includes a link to a multithreaded query page from
http://www.sun.fi/mtq/mtquery.html. This runs simultaneous searches
in each of the selected indexes. While this option and the CUSI
approach seem like the answer to the often time-consuming process of
Internet searching, they can take just as long. One problem is that the
links to the other indexes may no longer be accurate. In addition, the
special features and check boxes that some keyword indexes have may
not be available from within the CUSI form. The output is determined
by the actual index, and therefore varies greatly in format.
COMMERCIAL FUTURE?
One possible solution to the overload problem is to limit the
number of users by charging for the service. The question becomes
whether a commercial entity can make enough profit from an index to
develop an easy-to-use, yet powerful interface to a well-maintained
database. At least one company is giving it a try.
InfoSeek Corporation may offer a glimpse of how online services of
the future may be configured. InfoSeek has a variety of indexes,
including one to WWW pages, the past four weeks of Usenet news,
Computer Select and wire services. It also offers a demonstration
database and a one-month free subscription. After the free trial,
customers can pay either $9.95/month for up to 100 transactions, and
$0.10 a transaction thereafter, or choose one of the other subscription
plans. InfoSeek offers some useful resources, but due to the way in
which many users search the Net, the fees could add up quickly. (_See
Greg's August DATABASE column for an in-depth review of InfoSeek
and its content.--NG_)
The developers of these Web searching tools should be commended
for their hard work and creativity. However, what is needed in the
literature is a detailed comparison of the efficacy of the different
search options. Until there is a consensus on the best keyword indexing
of the Net, information professionals must choose their first try
carefully. For single keyword searches of a large database, use Lycos.
For multiword searches with an AND, try WebCrawler. For gopher
resources, try veronica. And for a time-consuming comprehensive
search, use CUSI.
Communications to the author should be addressed to Greg R. Notess,
Montana State University Libraries, Bozeman, MT 59717-0332,
406/994-6563, Internet--align@gemini.oscs.montana.
edu; http://notess.com.