Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesManaging Mission-Critical Text


September 1997 / Core Technologies / Managing Mission-Critical Text

You can integrate text searches and relational data with ConText Cartridge and Oracle universal data server.

Ann O'Leary

Over the past decade, organizations have invested heavily in systems that enable rapid access to structured data stored in relational database management systems (RDBMSes). Surprisingly, this information represents only about 10 percent of a firm's data, while the other 90 percent exists as text in documents, manuals, reports, e-mail, Web pages, faxes, and presentations. Until recently this textual data couldn't be accessed with the same ease and efficiency as structured data could be accessed.

Traditionally, organizations that needed access to both of these types of data used SQL-based relational databases to access structured data, while text was stored in flat file systems and accessed through customized search/retrieval engines. But such engines typically lacked key data-management functions found in major relational databases, such as security and transaction integrity. Another problem with this scheme was that any text stored in relati onal databases could not be searched natively using SQL.Such dual-system solutions are expensive, often requiring custom middleware to coordinate data changes and queries between the text-search engine and the database system.

Oracle resolved this dilemma with its ConText technology, which makes text a native data type of an Oracle RDBMS. This eliminates the costly dual-system approach, since the same scalable and secure relational database houses both the structured data and the text. It also provides a un ified interface where textual information can be rapidly searched and retrieved through SQL queries, often combined with associated relational data.

Getting There from Here

Oracle's ConText Cartridge enables full text retrieval within Oracle's universal data server. ConText accomplishes this by processing text into a format usable by an RDBMS and accessible via SQL ( see the figure ). It also provides linguistic capabilities that deal with the ambiguities and language dependencies of text. The term Cartridge indicates that the ConText engine is a software module that extends a database's capabilities to manage the new text data type.

The ConText Cartridge supports several different document-storage methods. The simplest one is the direct data store , which enables text storage in a text column of a database, with one document -- or any logical body of text -- per row. Depending on the column types chosen, each row can contain a text string as brief as a few names or as long as a book containing hundreds of thousands of pages.

Another storage method involves keeping the documents in a file system or anywhere on the Web. In this type of situation, the text column stores only the file-system or URL pointers to those documents. ConText searches and retrieves the text contents at the end of these pointers. While storing text in the database provides more robust data integrity, keeping files on the Web or in the file system may be more expedient. This feature is especially useful for constructing intranet search applications without having to copy existing Web pages to a central location.

As the database stores the text document, ConText builds a text index that contains critical information about the document's contents, as illustrated in the figure . Multiple ConText servers can run in parallel to reduce the index-generation time for large documents. The index is stored and managed as a set of relational tables in the databa se, which enables ConText to search and retrieve text information as rapidly as it can retrieve relational data.

The index's relational tables contain information about every word contained in the documents, such as which documents a certain word comes from, how many times it occurs in each document, what other words it's near, and by how much. Because text searches execute against these indexes (rather than directly against the full text), they can be as rapid as relational queries.

The generation of a text index is governed by a user-designated or default text policy supplied with ConText. The policy tells ConText how to index the text. A group of user preferences defines the characteristics of a given index. For example, there are preferences for languages (e.g., English or Japanese), document format (e.g.,Word or WordPerfect), document-location information (e.g., in the RDBMS or pointed to by a URL), and so on. The defaults are usually sufficient, but you can create highly customized po licies for special search requirements.

Because ConText Cartridge is integrated into the Oracle universal data server, all development and administration tasks can be accomplished through the use of straightforward SQL and PL/SQL calls. Oracle's SQL engine handles ConText queries, while ConText background processes automatically update text indexes and handle other administrative tasks. Developers don't need to use special API calls to manage text, and they can take advantage of the data server's capabilities, such as scalability and security.

Using ConText

The primary method of accessing ConText is through the CONTAINS function, which enables SQL to query the contents of a document. This function can be included in a WHERE clause that also includes conditions that are based on structured fields. Here's a sample SQL call that uses this function:

SELECT title
FROM docs_tbl
WHERE organization = 'Support'
   AND CONTAINS(docs_tbl.text, 'WebServer') > 0;

The column docs_tbl.text holds the documents that you wish to search. Based on the text-query expression, the CONTAINS function returns a numeric "score" that indicates the degree to which a given document matches the expression. In this example, the expression requests an exact match on the word WebServer ; the score will be higher for those documents in which WebServer appears the most times.

The SQL call above simply returns the title of every document in which the target word WebServer appears at least once, given that the document also meets the structured field condition ( organization = 'Support' ). When the user chooses from the title list, the application then retrieves and displays the specified document.

To rank the results of a query according to the score, the SCORE function can be used, as in the following example:

SELECT SCORE(0), title
FROM docs_tbl
WHERE organization = 'Support'
   AND CONTAINS(docs_tbl.text, '
WebServer',0) > 0
ORDER BY SCORE(0)

This call returns an ordered list of documents. (The argument to SCORE is simply an index, which is included as the third argument to the CONTAINS function. It places no restriction on the rows returned.) For more complex ConText searches, the text-query expression can get more complicated, and the SQL call might have multiple CONTAINS clauses.

ConText has been designed to easily upgrade existing Oracle-based applications with text-search capability, thus offering users direct access to text that's already stored in Oracle databases. Electronic-commerce applications that manage quantities, prices, and product-ID numbers can be upgraded to include searches on text descriptions of content or to offer the text data itself for sale.


A Tale of Two Architectures

illustration_link (22 Kbytes)

ConText eliminates the dual-system design often used to integrate a text search with relational data.


Turning Text into a Native Data Type

illustration_link (28 Kbytes)

Text itself (or file/URL pointers to text files) can be stored in database columns.


Ann O'Leary, a senior manager at Oracle, is responsible for marketing the Oracle universal data server and the ConText Cartridge. She can be reached at aoleary@us.oracle.com .

Up to the Core Technologies section contentsGo to next article: Managing Devices with the WebSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network