Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesLinguist X and Computational Morphology


June 1997 / Features / The Searchable Kingdom / Linguist X and Computational Morphology

Sure, Xerox PARC invented the mouse, windows, and the graphical user interface, but what has it done for you lately? Taught the computer how natural languages work, for one thing. This research has culminated in LinguistX, a search tool (not a commercial product itself) that search engines can incorporate for understanding languages.

Ramana Rao, chief technology officer and director of engineering for InXight (a Xerox subsidiary), explains that LinguistX employs computational morphology to plow through text and use its linguistic understanding of how words and phrases work, which enables highly sophisticated queries. Its tokenizer picks words out of text, dealing with commas and hyphens correctly. This tokenizer is als o language-independent, since all human languages share universal linguistic properties. (Thank you, Noam Chomsky.) All you have to do is select the specific dictionary you want LinguistX to use.

Its stemmer can tell that "survived" is related to "survive," is a verb, and is past tense. So when indexing, it can thus arrange similar words together, making a smaller index that is faster to search; astonishingly, LinguistX can represent up to eight English words with a single bit. This also helps you make more precise queries and find what you're looking for through a word's deep meaning, not just its surface form. On a higher level, the LinguistX thesauri find words with similar meaning: That can broaden a search, but it can also lead to serendipitous connections.

Taggers decide parts of speech. If you're looking for a saw, LinguistX knows a hand tool from the past tense of the verb "see." This is in stark contrast to search engines that treat words as mere strings of ASCII and would make no distinction between the tool and the verb. To tag a word properly means examining its context a little in the surrounding text. Obviously the extent of context examination must be as small as possible to preserve speed but wide enough to do the job.

LinguistX especially shines in handling phrases. If you search for "home run records" on most search services, you'll get a lot of dross about building a house, athletic footwear, and music albums. But LinguistX can tell that you are looking for exceptional batting performances in baseball.

All this has come out of years of linguistic research, turned into practical software. The result is a collection of ANSI C libraries that are platform-agnostic and eminently portable. For implementors, LinguistX saves time and space and delivers sharper query results. If this seems good to you, be sure to inquire whether commercial search engines incorporate LinguistX.


Up to the Features section contentsGo to previous article: Linguist X and Computational MorphologyGo to next article: Multicast to the MassesSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network