[Photo]

Greg R. Notess
Reference Librarian
Montana State University

on the net

AltaVista's International Mirrors

EContent, August 1999
Copyright © Online Inc.





Popular Internet databases, such as AltaVista, generate a great deal of traffic. One way to dissipate the traffic and provide backup copies of the database is to establish mirror sites. By placing these mirrors around the globe, the intent is to divide the traffic regionally. Theoretically, avoiding trans-oceanic connections should result in faster response time.

The principle of multiple mirror sites is to provide exact duplicate copies of the database accessible at a variety of access points around the world. In the pre-Web days of the Internet, popular FTP archives often had several mirror sites. One difficulty with mirror sites is maintaining the same database at multiple locations. Ideally, every time the main database is updated, all the mirrors are as well. With dynamically and frequently changing databases that are also extremely large, the problem requires careful planning and coordination. Alternatively, mirror sites might provide additional regional content.

AltaVista currently has five international mirror sites. For the searcher, the questions to be answered include: Are the databases all the same? Are the search features identical? And are there any advantages to using one of the mirror sites.

THE ALTAVISTA SEARCH NETWORK

AltaVista's main site (http://www.av.com, http://www.altavista.com, or the older http://altavista.digital.com) is in Palo Alto, California. To see the current mirror sites, click on the Our Network link at the very bottom of the page. It links to the AltaVista Search Network page (http://www.altavista.com/av/content/av_network. html).

Five mirror sites are identified on a map: AltaVista Canada, AltaVista Australia, AltaVista Asiawide, AltaVista Southern Europe, and AltaVista Latin America. The last two both point to the same URL and are served by the same AltaVista mirror--AltaVista Magallanes. So the map actually points to four mirrors. A fifth, AltaVista Deutschland (http://www.altavista.de), was announced in mid-March and finally appeared on the map in May. The Swedish AltaVista mirror site known as AltaVista Northern Europe, formerly run by Telia, is no more. It shut down in Autumn 1998. The AltaVista Search Network page also lists a section of "Sites Powered by AltaVista." However, those are not mirror sites, but other sites which draw on one of the AltaVista mirrors.

DATABASES COMPARED

It is difficult to directly compare the currentness of each of the AltaVista mirrors, since the records do not generally record the date of indexing.
AltaVista is more than a single database. The main U.S. version has directory categories on the opening page from its partnership with LookSmart. It also has other partner database connections that show up on simple search results: entries from RealNames, questions and answers from the Ask Jeeves database, ads, and some paid placement links.

The main AltaVista database of millions of indexed Web pages is probably the primary database in which searchers are interested. Is it exactly the same database at all the mirrors? Does AltaVista Australia have more Australian pages and does AltaVista Magallanes provide more comprehensive Spanish language coverage?

Unfortunately, this is one of those answers that may vary over time. Having compared the mirror sites several times over the past several weeks, it appears that all except AltaVista Magallanes are using identical databases--at least at this point in time. That may change by the time you read this, so be sure to do your own comparisons as well.

Some of the mirror sites default to a different database or a more limited version of the database. For example, AltaVista Deutschland defaults to a German language limit. Changing the default on the mirrors to search as broadly as possible helps in the comparisons. In April, all the mirrors found the same hits in the same order when given identical search requests. The only exception was AltaVista Magallanes, which found fewer hits. AltaVista Magallanes seemed to be especially weak in finding recent pages.

Were all the others as current as the U.S. version of AltaVista? It is difficult to directly compare the currentness of each of the AltaVista mirrors, since the records do not generally record the date of indexing. Looking at the date stamp on the file for each result (which is based on the reported date that the page was last changed at the time it was indexed by the AltaVista spider) show some interesting results.

Since the date stamp on files is determined by the clock setting on the Web server, it is possible to have files stamped with future dates and times. On the main U.S. site, these future files are displayed with no date. However, the mirrors do display these fake dates, some of which are several decades in the future.

Looking just at several pages with the current day's date stamp, all the mirrors, except AltaVista Magallanes, found the same pages. This Southern Europe/Latin American AltaVista has proportionally fewer results for recently changed pages than do the other mirrors. In general, AltaVista Magallanes found between 70-95% the number of hits found by the main U.S. site or other mirrors. Even with a Spanish search term like publicant, AltaVista found 200 hits to only 140 from AltaVista Magallanes.

Evidently, the AltaVista Magallanes users have had an older and less complete database during this time, even though its main page proclaimed a "New Index: 180 million indexed pages" even when the U.S. AltaVista only appeared to be covering about 130 million.

Specialized Databases

Of the other databases, only the main AltaVista site and then only on its simple search, provides access to the RealNames and Ask Jeeves databases. The paid links, identified in a special "AV Relevant Paid Links" box, only show up on the main AltaVista site at this time. The LookSmart subject directory is used by the U.S. AltaVista and AltaVista Australia, but not by any of the others. AltaVista Deutschland and AltaVista Canada have no subject directory. AltaVista Asia uses the Asia-oriented Skali directory, operated by the same company hosting the AltaVista mirror.

AltaVista also has a Usenet news database. This is available on all of the mirrors, except AltaVista Magallanes, and produces the same Usenet results on all those mirrors.

Beyond the subject directories and Usenet, some of these AltaVista mirrors offer other valuable databases. Currently, AltaVista Canada goes the furthest in this direction. It offers several database choices: Canada, News, Government, World, and Usenet. The last two are the standard AltaVista databases. The default choice, Canada, is a completely separate database. Built with a separate crawl of an AltaVista spider, AltaVista Canada also uses some proprietary software called CyberFence. It works with the AltaVista spider and creates a boundary for the spider around a chosen geographic area, in this case, Canada. Beyond the simplistic top-level domain limit of .ca, CyberFence helps identify other publicly-accessibly Web sites hosted within Canada. It includes .ca sites which may be hosted outside Canada and Canadian sites that may have a .com or .net top level domain.

The AltaVista Canada News database is also produced with a separate crawl, using the AltaVista spider, but building another separate database. Covering over 300 Canadian news Web sites, the spider refreshes its index daily, providing far more current content than the general AltaVista database. The Government database is built in a similar fashion, focusing on Canadian federal and provincial government sites.

Some of the other mirrors may well follow AltaVista Canada's lead and provide important regional databases of their own. AltaVista Magallanes takes a slightly different approach. On first connecting, it prompts for a country and language. The language choice determines the interface language, not a default language search limit. The country apparently determines the ads and additional services available. Once chosen, a cookie is set for subsequent visits. To change the country choice or interface language, go to http://www.altavista.magallanes.net/jump.html. Eventually, depending on the choices, specialized databases for those countries in the language of choice could be provided, but I see no evidence of any as of Spring 1999.

SEARCH FEATURES

...even on the non-English language sites, Boolean operators and field names must be in English.
Not only are there these variations in the available databases, the way each of the mirrors search can differ as well. All have both the simple search form and the advanced search option. In general, both the simple and advanced searches work the same way. Use the + or - symbols in the simple search and full Boolean operators in the advanced. Phrase and field searching works in both. The mirror sites all use this same search syntax. Note that even on the non-English language sites, Boolean operators and field names must be in English.

But what happens with multiple words entered without any +, -, or phrase markings? On the main AltaVista when using a simple search on multiple words, AltaVista first tries to match a known phrase and only if that fails will it process an implied Boolean OR. Most of the mirror sites do the same. However, both AltaVista Canada and AltaVista Magallanes do not try the phrase match at all. Instead they all default to an automatic OR. Fortunately, you can always force a phrase search with the "double quotes" to identify the phrase. Only the main U.S. AltaVista site also provides other suggested phrase searches.

Diacritics

Given the worldwide scope of the mirror network, the way they handle non-English language searching is an important feature. In general, AltaVista does an excellent job of handling diacritics in searching. Enter a letter with an umlaut or tilde or other diacritic mark into an AltaVista search statement, and it will find matches that contain the specified international character. It uses the ISO Latin-1 character set. If no diacritics are used, words with diacritics may be found as well. The example given at http://www.altavistacanada.com/en/help_general_characters.htm is éléphant, which will only find the French form of the word, while elephant will find both. See the full help page for all the special character mappings.

Yet, even though most AltaVista mirrors support these ISO Latin-1 multinational characters, AltaVista Australia does not. Trying a search on ordnungsgemäßen or éléphant on the Australian mirror results in zero hits. Check a term both ways and on a couple mirrors if one appears not to be working accurately.

If the mapping works correctly, searching without diacritics should find all pages containing the word with or without the diacritics, based on AltaVista's explanation of its mapping of the multinational characters. Unfortunately, the mapping does not always work accurately. Searching for either +éléphant -elephant, or +ordnungsgemäßen -ordnungsgemassen should find zero hits if the mapping was always accurate. Yet both found hundreds of records. Flipping the search statement also retrieved several hundred. So a comprehensive search of a term using multinational characters with diacritics should be searched ORing the two versions of the terms.

Character Sets

And what about non-Latin-1 characters? Need to search for Cyrillic, Chinese, or Hebrew? For these and other character sets, look for the Set your Preferences option at the bottom of the AltaVista screens. Use of these character sets will require a computer capable of entering characters in the specified encoding. The main AltaVista, AltaVista Australia, and AltaVista Magallanes all offer the following choices of 20 character sets:

Western European (ISO-8859-1)
Western European (Windows-1252)
Chinese (Big5)
Chinese (GB)
Japanese (Shift-JIS)
Japanese (EUC)
Japanese (Auto-detect)
Korean (KSC)
Central European (ISO-8859-2)
Central European (Windows-1250)
Cyrillic (Windows-1251)
Cyrillic (KOI8-R)
Cyrillic (ISO-8859-5)
Greek (ISO-8859-7)
Greek (Windows-1253)
Hebrew (ISO-8859-8)
Hebrew (Windows-1255)
Baltic Languages (ISO-8859-4)
Baltic Languages (Windows-1257)

For all except AltaVista Canada, add the av/oneweb/ subdirectory to the root URL for use of Chinese, Japanese, or Korean. Up until May 1999, AltaVista Asiawide offered the complete character sets plus Arabic and Vietnamese, so look for these to possibly reappear sometime.

WHEN TO USE A MIRROR

So are any of these mirrors worth the effort when anyone on the Internet can get to the main AltaVista site? Yes, for certain searches and groups of users. Unfortunately, it is not as easy as saying that you should use the closest (geographically) mirror. There is no apparent difference in the relevance sorting. One might expect that AltaVista Canada would present Canadian sites as more relevant or that AltaVista Asiawide would prefer Asian sites. However, at this point, it seems that the same relevance ranking algorithms are used on all of them.

For Canadian topics, AltaVista Canada's Canada, News, and Government databases are quite useful and can find additional pages not available in the main AltaVista database. The Canada and News should probably be the first place to look in the AltaVista network for any Canadian topic with a supplemental check in the primary AltaVista database (identified at AltaVista Canada as "World").

The main U.S. AltaVista is a must for anyone wishing to see the RealNames, Ask Jeeves, or paid placement links. It is also the only one in the network that recommends more specific phrase searches (in the simple search only). It should also be used (at least occasionally) to verify that a mirror site continues to have an up-to-date version of the database. Any new features will likely be seen here first as well.

Use an AltaVista with a Set your Preferences option to search in Greek, Cyrillic, Korean, or any of the other available character sets. When helping a user who might prefer the search interface in Spanish, Portuguese, Brazilian Portuguese, Italian, or French, the AltaVista Magallanes mirror can prove its worth. Just warn the user that the database may be an older version than the other AltaVistas. If the choices for country and language are not presented at first, go to http://www.altavista.magallanes.net/jump.html. German-speaking users should find the same advantage with AltaVista Deutschland without the limitations of a different database.

Another advantage in using a regional AltaVista is that the advertisements may prove more relevant and targeted. Although AltaVista Australia has no other obvious advantages, when searching an Australian topic, some of the ads may prove themselves more relevant to the search than the search results themselves.

Do not discount the value of local and regional advertisements. While it can become easy, even convenient, to ignore the banner ads, there are times when the ads can provide very relevant links. This can be especially true when the ads displayed are connected to the search terms.

While this column explores the variations between the AltaVista Network mirrors, bear in mind that the Web search engines (especially AltaVista) can and do change quickly and frequently. So while at the time of this writing only AltaVista Magallanes has a different database, that might be updated and another may not at the point when you use it. The savvy searcher must constantly analyze search output to try and determine how a particular Web search engine is behaving on that particular day.


AltaVista Mirror Comparison Summary

  Same
Database?
Other
Database
Defaults Diacritics &
Character Sets
AltaVista Main
http://www.av.com
Original LookSmart, RealNames,
Ask Jeeves, Usenet
Phrase, OR Diacritics & 20
character sets
Use for RealNames, Ask Jeeves, phrase suggestions, and U.S.-oriented ads

AltaVista Canada
http://www.altavistacanada.com
Yes Canada, News,
Government, Usenet
OR, Canada database Diacritics
Use for specialized databases, better coverage of Canadian content & local ads & news

AltaVista Australia
http://www.yellowpages.com.au
Yes LookSmart, Usenet Phrase, OR No diacritics but 20
character sets
Use for local ads

AltaVista Asiawide
http://www.av.com
Yes Skali, Usenet Phrase, OR Diacritics & CJK
character sets
Use for regional directory, and other character sets and Asian ads & news

AltaVista Magallanes
http://www.altavista.magallanes.net
No None yet OR Diacritics & 20
character sets
Use for Spanish, etc. language interface and local ads

AltaVista Deutschland
http://www.altavista.de
Yes Usenet Phrase, OR, German
language limit
Diacritics & CJK
character sets
Use for German language interface and German ads & news


Communications to the author should be addressed to Greg R. Notess, Montana State University Libraries, Bozeman, MT 59717-0332; 406/994-6563; greg@notess.com ; http://www.notess.com.