alog, an intran
et without search capabilities is just a pile of information, no good for the people it's intended to serve.
That goes double if part of your intranet is visible to the public. If potential customers can't find what they want on your site -- and find it fast -- they will quickly click away to some site that's more hospitable, like your competition's, for example.
The brain behind making your intranet searchable is its search engine, which has two main jobs. First, the search engine must read all the parts of your intranet that you want to be searchable. These might include within-company Web sites, outside-company Web sites, networks, servers, and even desktop hard drives. Second, the search engine must handle queries from users, interpreting what the user is looking for, comparing that to the index, and presenting the most likely hits.
From speaking to the human brains behind many of the most widely used search engines, it is clear that there are many strategies for creatin
g such engines for intranets. Engine designers may emphasize any of a host of features including speed, compactness of index, and ability to distribute index or query functions, handle multiple languages, understand linguistic nuances, and manage gargantuan amounts of data. To choose the right search engine for your intranet, it is crucial to understand your searching needs and then match them to the engine features. And note that many search engines tout the number of formats they can handle. Take this number with a grain of salt because you can count formats in a number of ways, and you can be sure that engine vendors are counting them in the way that makes their product sound better.
Verity Knowledge Base Networks
Verity licenses from Xerox PARC the linguistic prowess of LinguistX as part of its Search'97 engine (see the sidebar "LinguistX and Computational Morphology"). But this is not the only search tool Verity uses, notes product manager Nick Arnette. Many algorithms, including
Boolean search, proximity search, fuzzy logic, Soundex (to handle homonyms or misspelled words), thesauri, and stemming tools, all work on input text. Matches on multiple engines lift that hit higher in the list of likeliness. A bonus: Verity can plug in new algorithms as they come along.
Verity differs from other search engines with its Knowledge Base Network. It captures
evidence
in the engine and gets smarter as it goes along. After a while, "Bill Gates" and "Windows 95" become associated with "Microsoft," so searches on one bring up the other as a possible hit. Verity also uses what it calls an open gateway approach to input: Its engine will look on the Web, in databases, and on networks to feed the search.
Text does not live by HTML alone. To handle non-Web documents that live in databases and on networks, Verity uses a variety of filters to handle the multiplicity of popular formats. These filters are licensed from Inso, the big Gorgonzola of filters (especially after its acquisiti
ons of Systems Compatibility Corporation, ImageMark Software Labs, and Electronic Book Technologies), and they also make use of technology licensed from MasterSoft. You can write your own custom filters for unsupported formats.
One common-sense feature is that location of a word influences its weight. A word in a title is a bigger hit than a word in a paragraph (meaning, for example, that this article would receive a pretty low hit score if you did a search for the term "intranet searching").
Besides the indexing itself, a search engine needs a
spider
,
a program that traverses Web links efficiently to feed text to the search engine. Verity's spider is centralized and offers many controls. For example, you can control its speed, whether or not to respect the robots.txt file at each site (which allows, disallows, or controls spider access), ignore certain formats, search only certain domains, and search by fields. It also offers agents: subprograms that will watch info as it gets
indexed (even from live news feeds) and route it to satisfy a query.
Verity's
Search'97
is a very scalable engine, handling 20 GB of text with excellent performance, according to Arnette. That scalability extends to the hardware platforms it runs on, from desktops (costing less than $100) to servers ($20,000-$30,000 range). Verity works on 18 operating systems, including the Windows family, Macintosh, and Unix flavors.
With its low per-seat cost, Search'97 offers a lot, especially for the smaller intranet or those just getting their feet wet. With its ability to search documents in many formats, availability on many platforms, and talented spider, Verity's engine could match your intranet search needs.
Livelink Intranet: Corporate Choice
Open Text's Livelink Intranet squarely targets intranets, especially on the large corporate level, according to product manager Kevin Weatherston. A customer list that includes Ford and Siemens-Nixdorf backs that up.
Livelink Intranet is a Web-based product with several parts (see the figure
"How Livelink Searches"
). Livelink Spider crawls the Web from a central server for text to process. You can control which sites to crawl. It's a gentle spider that conforms to robots.txt protocols and doesn't beat relentlessly on servers but waits between accesses. It also pays attention to when Web pages change and rescans a page only when it changes: That saves time and processing.
Livelink Index is a phrase-based engine that actually recognizes and returns full phrases -- don't worry, individual words count as phrases, too. It is also aware of Web-file structure, so you can search in a headline only rather than in all the text. Location within the structure influences relevance ranking, as do number of occurrences and other factors. Filters handle documents in other typical formats like Word, WordPerfect, PDF, and Excel, although not yet with awareness of internal structure. With a simple change o
f its "language definition," you can perform searches in many different languages, including English, German, French, Korean, and some versions of Japanese. Naturally, this simplifies matters for multinational organizations.
Livelink Search handles queries from users. Ordinary Web browsers access Search, which queries the index and formats and returns hits.
Livelink Intranet is built to take advantage of the speed of 64-bit UltraSparc chips. That's especially useful when wrangling multigigabyte databases. Supported platforms include Sun with Solaris, Hewlett-Packard with HP/UX, and Intel x86 with Windows NT. A stand-alone search system will cost in the $12,000 range, but prices generally depend on special company needs.
Open Text's Livelink Intranet is suited for large corporations that don't mind heavy hardware demands. Its multilanguage and distributed capabilities make it especially useful for multinationals.
CyberSearch and Secure Searching
Frontier Technologies
' CyberSearch searches the Web, newsgroups, local hard drives, LANs, or the Internet for Web pages, documents, spreadsheets, and ordinary files that may be in different locations. It also uses the latest security technology based on public-key certificates, according to Ray Langford, engineering manager for advanced products.
CyberSearch can keep track of when things change and update its index. It also does metasearching: setting up indexes of indexes so that instead of searching for the item itself, it checks the index where that item should be indexed. It can even
metasearch
one or more of the major on-line search engines (AltaVista, Excite, Infoseek, Lycos, WebCrawler, Yahoo).
CyberSearch uses multiple indexing algorithms, such as proximity, dictionary and thesaurus, position, stemming, field, and concept. The usual filters let CyberSearch handle HTML, Word, Excel, Access
Phone: 1-2-3, WordPerfect, Quattro Pro, and even news feeds (if using Frontier's News Server).
Plus, it can use the internal structures of documents to score hits. It orders what it finds by relevancy and can abstract documents by relevant keywords.
Assistants
automate searching with user-defined sets of query terms and locations to search. CyberSearch works in the background -- periodically or on command -- to return hits, whether Internet or intranet. CyberSearch also provides document monitoring to notify the user when a link or an item changes.
CyberSearch's client/server architecture is scalable. You can use CyberSearch at an individual PC or install the NT server component on your Web server to tap its power for maximum efficacy: Clients can share their indexed collections. A step-by-step wizard walks you through the setup process.
Although strictly for Windows platforms (Windows 3.x
Phone: 95, or NT client, Windows NT server), CyberSearch can get its information from non-Windows servers. Its low price of $99 per client makes it especially attractive to small and
midsize operations.
Frontier Technologies' CyberSearch has a low price and Windows orientation going for it. Its security features and automation capabilities may also satisfy your intranet search requirements.
Ultraseek Server and Keeping Track of Content
Infoseek's Ultraseek Server is a by-product of Infoseek, the Web search service. As might be expected, it shares much of its big sibling's prowess, scaled down to intranet size.
Its spider is particularly shrewd. It can run in either a centralized or decentralized environment. Also, it keeps track of when items change and rescans them only then, as do some other spiders. But it goes one step further by keeping track of how often each item changes and anticipating when it should check the item next. Since many Web pages are in fact updated by their owners on a scheduled basis, Ultraseek Server catches on to the pattern and does not waste time updating when it's unnecessary. That's smart, and it saves processing time.
The Ultraseek spidering process is very polite, obeys all protocols, and does not swamp servers with inquiries. Patient, too: If the server is busy, the spider will wait an hour before trying again, notes Andy Feit, director of intranet product marketing. The spider can configure on a host basis or by URL, specify a user agent, and even handle password-protected sites.
Steven Kirsch, CEO, points out that Ultraseek Server was designed to handle large collections fast -- no surprise when you're used to handling the entire Web. The search engine is statistically based and employs a variety of tools including natural language, Boolean, proximity, document structure, and fields. Like some other search engines, it should incorporate Inso technology for handling items in non-Web formats by the time you read this.
The index itself is centralized, but the spider can be either centralized or decentralized. With automatic operation as a primary goal, including the automatic updating mentioned earlier, Ultr
aseek Server requires no full-time administrator (
see the screen
). It runs on Solaris systems, but the NT version should be out by now.
Again, considering its heritage, it's no surprise that Ultraseek Server is highly scalable. Infoseek claims to be the only vendor capable of handling Microsoft's collection of a million documents. Pricing is also scalable, essentially by the number of documents, ranging from $995 for 1000 documents to $35,000 for 100,000 documents and so on.
Infoseek's Ultraseek Server is a complete package; it is suitable for big sites, and it has low overhead.
HotBot and the NOW Generation
Inktomi's HotBot powers the HotWired Internet search site and its own site search service. Its architecture is unique in that it uses "network of workstations" (NOW) parallel computing technology that Inktomi developed. NOW uses clusters of commodity workstations (like Intel Pentium Pro-based servers) and high-speed LANs to achieve supercomputer-c
lass performance. There are several advantages to this type of architecture. First, it is highly scalable since you choose how many workstations or disk drives, or how much memory, you want to use. Second, it is economical since the workstations don't cost as much as supercomputers. Further, you don't have to constantly upgrade your servers to keep up with growth. Finally, it is fast, thanks to some proprietary Inktomi software.
This software implements advanced multithreading across the NOW cluster. This is a parallel-processing technology that allows each processor to optimize and manage over a thousand network operations simultaneously. In a network-centric environment, this provides extreme performance. It also protects the system from delays or outages caused by individual portions of the network by balancing the load across the cluster. This cluster of parallel workstations thus has advantages over the symmetric multiprocessing architectures that other search products use. The cost-effectiveness i
s especially attractive to the bean counters, and the scalability should keep the administrator's back user-free.
The architecture has a downside, namely complexity. Besides routinely operating a horde of workstations, network switches, and disks, it must also monitor -- and bypass -- any failed components. Luckily, the complexity is nothing the user has to face. What HotBot needs, it has: It's in there.
The HotBot spider (the company name derives from a mythological spider of the Plains Indians) uses Inktomi's SmartCrawl technology to intelligently refresh its index. As usual, it crawls by visiting pages and following links from each page, but it is very efficient. Its networked architecture allows it to maintain enough simultaneous links to the Web to crawl up to 10 million Web documents per day (see the figure
"How SmartCrawl Searches"
).
The spider is courteous to sites that don't want to be crawled. Nor does it index pages that require passwords to access them. But
as long as the site has not requested that robots not crawl it, HotBot will index the site's pages on all servers. HotBot also does not burden any one site with its attention. It may index up to a few hundred pages of any one site within the first 24 hours. The next day it will revisit the site for any remaining pages, and so forth.
The HotBot technology is currently a centralized function, which is no surprise given the clustered network approach inherent in the architecture. There are plans to decentralize the technology: a "divide and conquer" approach that other engines have used to advantage.
Currently, HotBot is not capable of searching for non-Web documents, so access to ordinary documents on networks and individual machines is out. But if the information is on a Web page, HotBot can find it because it gulps the entire Web document. HotBot deals with words and phrases handily. And just because it is limited to Web documents, don't misconstrue that it speaks only HTML. Besides text, HotBot c
an also search for various media types, including images, Java applets, any file extension (e.g., GIF, JPG), Shockwave, Virtual Reality Modeling Language (VRML), audio, and video.
If your needs require searching non-Web documents, or items on networks or individual machines, you'll have to look elsewhere. But for fast and scalable searching with conventional equipment (used in conventional ways), HotBot is a good choice at a good price.
Lycos and Mega Queries
Originally developed at Carnegie Mellon University, Lycos is one of the top search services on the Web. Now you can use it as the search engine for your intranet.
The Lycos spider knows some clever tricks. First, it can crawl either Web sites or file systems. You have many choices of crawl modes, and it respects the robots.txt protocol. As Sangam Pant, Lycos vice president of engineering, notes, the spider also performs link analysis as it crawls. This helps it crawl the most popular pages first: If many pages point t
o one page, that page seems more important, and the spider will crawl that one sooner. The spider also downloads the context of the page, which aids in building the index.
The index can span multiple servers and files. It can also replicate indices to improve safety (in case one copy gets clobbered) and speed (to support searches on multiple copies). The indexer looks for keywords within a page, which improves the relevance score. It also keeps track of statistical properties like frequency of occurrence: If "HTML" appears in
every
document, it might not be so important. The indexer keeps track of phrases, as well as the location of a word on a page; words in titles, beginnings, and endings are more important.
Surprisingly, the indexer can also index sound and graphics. How? No, not by listening to or viewing the items but by analyzing the text (captions, titles, lyrics) surrounding the items. This text, analyzed statistically, gives important clues about the sound and graphics content an
d a hook to hang a multimedia search on.
The indexer can handle 20 to 30 GB of information without problems. Then the indexer turns it over to the search engine, which essentially turns things inside out for searching. As you'd expect from its big sibling, the search engine is tuned for speed, managing vast amounts of data and handling huge numbers of queries without breaking a sweat.
Lycos licenses parts of its technology on an OEM basis. For example, the Lycos spider is part of several Inmagic (Woburn, MA) products. With Lycos components, a vendor can create products to address special needs and niches.
Some people might think having their own Lycos is overkill. But Pant points out that "most intranets start small -- you could just grep 100 documents -- but then grow rapidly" in the number of documents, the complexity of documents, and the number of queries. Lycos can handle the speed, relevancy of returns, and complexity.
Lycos runs on high-end Sun workstations and Alpha Unix system
s, at prices that depend on the components included. But Lycos, for a flat price, also runs on small to midlevel systems running NT and some flavors of Unix.
Lycos's prowess is based on its ability to handle large collections of data and to do high-speed searching. Add multimedia capabilities and you have an impressive package.
AltaVista: Your Search, Your Way
As with Lycos, the AltaVista Web search service is now available for searching your intranet. AltaVista Search Intranet Private Extension (no one ever accused Digital marketing of cutesy names) offers a set of C libraries that extends AltaVista's speed and capacity to intranets. There are also optional developer's kits for integrating standard databases like Ingres, Oracle, and Sybase.
Given its speed and power, the AltaVista crawler is designed primarily to not bring down the server it's crawling. You can give it multiple pages to start crawling from. It follows rules you establish for its dealings with sites (inclu
ding the all-important robots.txt protocol), and it can run in automatic mode.
Its indexing engine can plow through data at about a gigabyte an hour, according to Bob Lehmenkuler, AltaVista Search marketing manager. It indexes every word and number from the crawler, creating a statistical model of the content. It throws nothing away, yet its "inverted text index" occupies only 10 to 30 percent of the original content. Its indexing handles fields, like host name or URL, and it is language-independent.
The indexer eschews thesauri to broaden searches on the grounds that a thesaurus necessarily depends on some other person's associations with a word, not yours. Instead, AltaVista is readying Visual LiveTopics, a tool that lets you perform user-directed querying: on-line analytical processing on your Intranet (
see the screen
). Suppose, for example, you perform a query on "orange." AltaVista has noticed "orange" in various contexts and presents each one as a graphical tree or text
list for further searching. Were you thinking of Orange County? The fruit? The color? The flavor? By selecting which context you meant, you direct the path of the search and can drill down to deeper levels of detail. Currently in beta testing, Visual LiveTopics should be available soon.
Also available should be support for searching common office document formats. HTML is not yet the basis for all business communication.
Given its pedigree, it's no surprise that AltaVista handles tremendous amounts of data rapidly. It can be configured for centralized search services or distributed. It runs on Windows NT on Intel platforms and Digital Unix on Alpha machines, with Solaris support expected real soon now.
Cost depends on server platform and number of users. List price (if anyone still pays that) for the ground-floor NT model is $16,000, which covers 250 users. License for unlimited users would range from $34,000 to $100,000. On the largest Alpha box (whose 2 GB of RAM hints that this is not a
desktop machine), the cost is $66,000.
Who buys AltaVista? Digital's Lehmenkuler claims the best customers are those who have tried other intranet search solutions first and not been satisfied. AltaVista is pricey yet powerful. You can't ask for more than the ability to handle practically anything, and do it fast.
Many of the search engines we've discussed have been tested on the toughest possible network: the Internet. They've proven capable of indexing millions of documents and serving millions of customers a day. Considering the features and capabilities of these engines, maybe adding search tools to your intranet won't be such a chore after all.
Where to Find
Digital Equipment Corp.
Maynard, MA
Phone: 508-493-5111
Internet:
http://www.dec.com