As files proliferate and become containers for multimedia objects, document management is more necessity than luxury. Now, it's becoming part of the system software, with profound implications for networks and the user interface.
Andy Reinhardt
Today, the primary use of computers by far is for document processing. According to Dataquest (San Jose, CA), 98 percent of business computer users employ word processing software on their PCs; many use their PCs only for word processing. Says Frank Gilbane, president of Publishing Technology Management (Cambridge, MA) and editor of The Gilbane Report on Open Information and Document Systems, at least 80 percent of corporate electronic information is in the form of documents, as opposed to structured database records.
Now, the role of documents is p
oised to become even more central. Documents are no longer merely an electronic analog to paper, but rather dynamic, modular, multimedia entities. At the same time, documents are becoming the focal point for the user interface and the design center of software programs. This is being done through initiatives such as Microsoft's OLE and the OpenDoc standard from Apple, IBM, WordPerfect, and others.
The rise of documents also has a dark side: the information glut. The explosion of desktop documents has spilled over onto servers, and many people are hooking up to the Internet and other on-line services, where millions more documents reside, ready for the taking and misplacing. The average user has enough trouble creating directories or structuring files into folders, much less remembering later where he or she has put files. The inadequacies of contemporary file systems--especially the limited file attributes and ``8.3'' naming convention of DOS--have never been more apparent, nor the need for powerful do
cument management tools greater.
``If you go into most companies and ask them to track their capital assets, they can do so with unfailing detail,'' says Scott Wells, product line manager for the NetWare applications services group of Novell (Provo, UT). ``But if you ask them to do the same with intellectual assets--documents, memos, letters--they can't.''
Given the importance of documents, it's ironic that PCs have handled them so badly until now. The dominant paradigm of operating systems, files arranged in rigid hierarchical directories, is fundamentally computer-based, not human-based. People arrange their desktops and documents in ad hoc folders and piles, clipping together related papers and rearranging groupings to reflect changing priorities and tasks.
The new document-computing model reflects and embraces this reality while at the same time adding a uniquely computational capability: Documents can carry with them information about their origin and identity, as well as executable
code that knows how to manipulate or render them. No piece of paper can match that.
As documents become the center of computing activity, users will require new tools to identify, store, track, retrieve, and present them. Operating systems now provide these functions in only the most rudimentary fashion, so users resort to third-party software or even dedicated systems. Eventually, operating systems will take on some of the management functions now assumed by stand-alone packages. ``Users see document management as a tool, not an application,'' says Bruce Silver, vice president of BIS Strategic Decisions (Norwell, MA).
At the same time, document management is following the architectural model exemplified by databases and mail systems, toward a layered design in which client tools, middleware, and back-end services are separated and wrapped in published interfaces. Document management clients are being rewritten to support APIs such as ODBC (Open Database Connectivity), MAPI, and Lotus Notes, and
document engines are migrating from proprietary to industry-standard platforms.
Two standards efforts are occurring in document management. ODMA (Open Document Management API) is an interface that will let any program talk to a document client. The Shamrock group, led by Saros and IBM, has proposed a wrapper for document engines (see the text box below).
This convergence of emerging technologies gives rise to an intriguing scenario. Growing demands on file systems are driving them to become more like distributed databases. Eventually, in operating systems such as Microsoft's Cairo or Taligent from the Apple/IBM joint venture of the same name, file systems will become ``universal'' object stores able to contain documents, messages, data records, and executable program modules.
Meanwhile, on the desktop, traditional file managers are blending with query tools, which are typically forms-based front ends for databases, such that you may end up needing only a single dialog box to access any d
istributed object. With a unified front end and open back ends, the battle shifts to middleware such as Lotus Notes or the new Document-Enabled Networking initiative from Novell and Xerox, and to client differentiators such as better retrieval techniques or a more intuitive and informative user interface.
The New Document
The document has traditionally been static: a memo, a book, or a photograph. On PCs, documents were typically owned by a given application and stored in a unique format. Until PCs were networked, these files usually belonged to only one user and passed from one person to another in printed form. Documents also tended to be ``dumb,'' knowing nothing of themselves.
The emerging definition is more dynamic. Old distinctions between different data types are fading away, as all of them find their way into document containers, such as those used in the object-oriented OpenDoc technology. Explains Alan Adamson, director of product management for Symantec/Peter Norton Group (Santa Mo
nica, CA): ``A document will no longer be a single file, but rather a book of pointers to text objects, data objects, images, fonts, and so on.''
New documents are also multidimensional. In the temporal domain, their component parts can be linked back to other documents and updated with fresh content. In the spatial domain, work-flow software can automatically route documents, some with built-in intelligence, around a network and present them to users through a variety of forms. Taken together, these attributes define virtual documents, which exist only at the time you view them and via the lens through which you are able (or allowed) to do so.
Responsibility for managing documents is normally shared between operating systems and applications. Some applications, especially databases, have traditionally implemented their own storage systems, optimized for performance and security. Others, such as spreadsheets and word processors, leave the job of file I/O to the operating system, which means that
the only information stored about the file are the fields built into the file system.
More up-to-date programs, such as Microsoft Word, go beyond the limited fields built into DOS and attach a summary box to each file. The data in this summary box, including author, title, keywords, version number, description, and file statistics, is bound into the file and isn't readable to other applications. Document management packages essentially perform the same function, but in a nonapplication-specific way. They usually employ proprietary user interfaces and file repositories.
Merging Definitions
Document management has traditionally been divided into two broad categories: products for cataloging and retrieving editable files stored locally or on a server; and products for inputting, tagging, storing, and recalling the images of documents, usually from paper originals created outside the organization. These uses have led to different feature sets and a bifurcation of suppliers, but observers believe
the distinctions will disappear over time.
``Document management, work flow, images, forms, and OCR are all getting married together.'' says Scott Cooper, a senior product manager for Lotus Development Corp. (Cambridge, MA). One example is PageKeeper from Caere Corp. (Los Gatos, CA), a desktop-class document manager designed to handle both editable files and images.
User requirements for document management and image management differ for two reasons. First, in imaging applications, the documents are static, imported into the system in their final form. In document management systems, they are dynamic. Image managers thus focus on moving and handling fixed files, while document managers concern themselves with policing the creation of content. ``The purpose of document management,'' says Alvin Tedjamulia, executive vice president of technology at SoftSolutions Technology (Orem, UT), ``is to know that an original is an original, and who touches it, and when.''
Second, in imaging applicati
ons, the documents are bit maps, and, as such, are faithfully reproducible but not editable or searchable. In document management, they're editable, which can mean that their appearance is not consistent to all users. Being able to render electronic documents accurately across multiple platforms is of growing importance. It's driving interest in portable file formats such as Adobe Acrobat, Farallon Computing's Replica, No Hands Software's Common Ground, and WordPerfect Envoy.
Both imaging and document management systems generally run on networked infrastructures that should provide protocol independence, locationless file access, security, and storage management. Ideally, server replication, link tracking, and extended file attributes are also built in, which is one reason that environments such as Lotus Notes, NetWare 4.x, and Microsoft's pending Windows NT-based Microsoft Exchange Server (known until recently as the Enterprise Messaging Server, or EMS) are becoming such attractive platforms for docum
ent management.
A third class of document management products, favored by big engineering firms, supports document assembly, or the creation and presentation of large, fast-changing, or customized documents. These high-end systems, from suppliers such as Documentum, Frame Technology, and Interleaf, combine attributes of desktop publishing and databases. Source materials are maintained in huge repositories and assembled into customized views for electronic or paper distribution. The document, as such, is not a fixed entity; rather, it exists only as a slice or snapshot of a flexible, evolving information base.
This publishing model will become more prevalent as compound document architectures move onto the desktop. Document assembly will no longer be a high-end application, but rather the way you put together a routine report. Some low-end products are already starting to appear. For instance, Capsoft Development (American Fork, UT) sells a $99 utility called HotDocs that lets you turn Word, Word
Perfect, and Ami Pro documents into templates for custom publishing.
``What we are headed for is an integrated desktop where you can work on spreadsheets, documents, data, voice, and it doesn't make any difference,'' says imaging consultant Harvey Spencer (East Northport, NY). If document management is now a niche market, soon it will be synonymous with file management, data access, and data presentation. The document manager will be the user interface.
However, compound document architectures also present difficulties that still must be addressed. For instance, says Mark Walter, a senior editor for Seybold Publications (Media, PA), OLE links among documents are fine for work in progress, but they're impossible for archived documents. Once a document is committed to a tape or WORM medium, it can't be reliant on objects outside the archival medium.
Most desktop document managers are designed to work closely with popular word processing programs. Users can select documents to edit and then
launch a word processor. An alternative approach, typified by market leaders PC DOCS (Tallahassee, FL) and SoftSolutions, ties the document manager directly to a word processor's file I/O operations. When you open or save a file within WordPerfect, these packages intercede and take over the function.
For a save operation, the document manager forces you to fill out an on-screen form, or profile, which specifies information such as the author, title, and subject of the document; a job or case number; and keywords for categorizing the file. Sophisticated packages fill in some of these fields by default, such as author, typist, date, and version. Advanced packages also store a complete inverted-tree index of the document for full-text retrieval.
File-open operations invoke the opposite action, presenting you with a blank copy of the profile form that you use to query the document database. Using a QBE (query by example) technique, you fill in one or more fields with search criteria that you use to
locate a document.
Changing Model
Both PC DOCS and SoftSolutions now use client/server architectures as a means of supporting multiple platforms, improving performance and robustness, and tapping into industry standards. Like most DOS-based packages, PC DOCS used to have a monolithic architecture, providing both the client interface (a TSR program for DOS) and the document store (Btrieve). The networked version used Btrieve to store the document profiles, which pointed to documents stored on a NetWare server.
In the latest version, PC DOCS Open, the company has taken a huge step toward platform diversity and openness. The client portion now runs on Windows (DOS and Macintosh versions will follow), and the back end runs on a plethora of servers. Documents can be stored on Banyan Vines, DEC Pathworks, LAN Manager, NetWare, or NT Advanced Server. And the profiles live in SQL databases such as Microsoft SQL Server for OS/2, NT SQL Server, Oracle, Sybase, and Watcom.
SoftSolutions has also
made the transition from a DOS-based solution to a multiplatform client/server model. Clients are available for DOS and Windows; servers run on NetWare and various flavors of Unix. On Windows, you can run the SoftSolutions Document Desktop, a Norton Desktop-like home page that eliminates the Windows Program Manager/File Manager duality and hosts documents, applications, folders, search tools, and saved searches.
Through DLLs, SoftSolutions is able to work from inside several Windows programs, including Ami Pro, Excel, Lotus 1-2-3, Microsoft Mail, Word, WordPerfect, and WordPerfect Office. Through OLE 2, it can link directly with other Windows applications. For instance, SoftSolutions bundles in a copy of Watermark Software's Discovery Edition, a set of low-end imaging tools (i.e., compression, fax, and OCR support; optical media management), and uses OLE to communicate with it.
Another company embracing a client/ server model is Apple, whose $1800 AppleSearch tightly couples text retrieval into
the Mac OS. Implemented as an AppleShare server engine with front-end clients, AppleSearch lets desktop users search across the network for documents using fairly conventional criteria (e.g., Boolean with proximity, wild cards, and creation date) and see results ranked by relevance, using technology licensed from Personal Library Software (Rockville, MD).
Middleware Is Key
One of the most powerful ways to use SoftSolutions is in conjunction with Lotus Notes. Notes provides useful middleware services--multiplatform support, user administration, security, a messaging transport, form views, and, most important, database replication--but by itself, it's not suited for document management. SoftSolutions fills in where Notes falls short, providing library services, such as document checkout and revision control. ``Your entire world view is through Notes, and the SoftSolutions document profile becomes a form,'' explains Tedjamulia of SoftSolutions.
Riding on the Notes database, profiles are automati
cally replicated throughout the Notes network. But the original documents are kept on a single SoftSolutions server. ``If you store the documents themselves in Notes, they're replicated and you lose control of them,'' Tedjamulia says. This hybrid architecture increases security and preserves bandwidth, he says, yet still allows Notes users to search for documents and call them up from the SoftSolutions server across the LAN or WAN.
The SoftSolutions engine is now accessible via ODBC drivers. Later this year, the company says it will add support, including fast text searching, for 19 third-party databases. The company is committed to supporting both OS/2 and NT, as well as the ODMA interface.
Lotus sees big opportunities for Notes in document management. ``We don't position Notes as a document manager, per se,'' says Chris Reed, director of market development for Lotus Notes. ``Rather, it's a layer of services that anyone doing document management can leverage off of.''
Lotus argues that l
ow-level file managers are not the best tools for document management. ``You need a middleware/groupware layer,'' says Judy Jalbert, the Notes product manager for DBMS integration and document management. ``That's the appropriate place for it because you need to be cross-platform, and one operating system won't solve that.''
Through aggressive partnering, Lotus has already added significantly to the basic Notes package. Verity (Mountain View, CA) provided a version of its well-regarded Topic full-text search engine, which is bundled into release 3.0 of Notes. Action Technologies (Alameda, CA) has built a sophisticated server-based work-flow system on Notes. And in conjunction with Kodak, Lotus has delivered Lotus Notes: Document Imaging, or LN:DI (commonly pronounced ``Lindy''), a set of client and server tools that support image files.
LN:DI includes Windows client software that performs basic imaging functions, such as scanning documents, compressing/decompressing files, and zooming, panning,
and rotation. The server component, which runs on its own OS/2-based system, implements an image database with integrated HSM (Hierarchical Storage Management). Notes was able to handle images already, but without LN:DI, they were treated like any document and replicated indiscriminately, which had repercussions for WAN bandwidth. With LN:DI, images can be stored centrally and referenced with 100-byte pointers in distributed Notes databases, much as SoftSolutions does with its document manager.
Following the same middleware model, Kodak has also partnered with Novell to enhance NetWare 4.x's image support. The companies have created Image-Enabled NetWare, a set of client components, NLMs, and APIs that implement storage management, server-based imaging, and a document management front end. The storage management piece, written by Kodak, consists of optical media drivers (the High Capacity Storage System) and HSM capabilities (Mass Storage Services) for NetWare 4.x. Document Management Services is a sch
eme for organizing network files into folders according to keywords or ad hoc groupings.
Image Management Services implements functions on the server such as inbound and outbound fax and mail support; raster operations like cropping, scaling, and rotating; and support for image file types (i.e., TIFF, GIF, and Group 4 fax). ``This means that developers like Kofax can create IMS-aware apps and save having to write all these capabilities themselves,'' says Novell's Wells. ``Using IMS to render or handle images lets most of the work be done on the server.'' IMS also implements client- and server-based scanner drivers. ``We've made them into network services so that ISVs [independent software vendors] don't have to worry about the details,'' he says.
Is This NOSA?
A separate initiative between Novell and Xerox may turn out to be the most significant development of all for document management. The partnership aims to create a middleware layer and published programming interfaces, known collective
ly as Document-Enabled Networking, that should make it easier for developers to create networked document management applications. ``For document management to become more pervasive, we need broader tools for end users, VARs, and system integrators,'' says Dennis Hamilton, the major architect of DEN and principal software scientist for Xerox's XSoft applications subsidiary. ``DEN empowers them to implement document management solutions more readily.''
Architecturally, DEN bears a striking similarity to the model used in Microsoft's WOSA (Windows Open Services Architecture). Applications talk through an API to a set of middleware services (DLLs in the Windows case, NLMs in Novell's case), and back ends write through an SPI (Service Provider Interface) to the middleware. The result is that any compliant client can talk to any compliant server.
Initially slated to ride on NetWare 4.x (it will be ported to other operating systems in the future), DEN consists of network services for accessing and man
aging documents and development tools. The initial specification will be available by the time you read this, and the software development kit will ship this year, says Hamilton.
The DEN coordination layer, built on NetWare 4.x's distributed file system, is intended to provide a consistent mechanism for getting at documents anywhere on the network, or at least those housed in NetWare servers or DEN-compliant libraries. It will provide integrated text and attribute indexing, security, commenting, and library services (e.g., checkin/checkout, access control, and usage tracking). A Xerox partnership with Mastersoft (Scottsdale, AZ), will also provide file-format conversions. DEN's SPI will let third parties deliver enhanced back-end services, such as indexing or conversion engines.
Xerox and Novell also plan enhanced network printing capabilities, including server-based printing, a critical capability for document assembly. Says Hamilton, ``People want to print from the server, not bring the docume
nt back to the desktop, load it up, and then spool it back out to a print server.'' High-end publishing systems do their composing on the server, he says, whereas on PC LANs, the client and the application do all the work. ``With big enough documents, you can't even afford to do it on the client.''
NetWare and DEN have some advantages over Notes in the DMS middleware arena. First, Novell controls the underlying operating system, while Lotus is beholden to IBM, Microsoft, Novell, and other platform providers. More important, the automatic replication in NetWare 4.0 distributes directory information but not the data itself, whereas Notes replicates the content of the databases. Obviously, Notes replication is beneficial for messaging and groupware applications, but document management, unless it is enforced at the operating-system level, prefers a more controlled and centralized model.
To strengthen its hand, Novell also plans to add extended file attributes to NetWare's file system. ``Extended at
tributes attach more information that people can inspect to the raw material,'' says XSoft's Hamilton. ``This could thin the layer you have to build on top of the raw material in order to describe it. But you will still need a layer between the operating system and the document manager, because different search engines look for different things.''
Another Contender
Remarkably quiet so far in document management has been Microsoft, but the company is about to enter the fray with its much-delayed Exchange Server. Designed to run on NT and to be accessed through the Extended MAPI programming interface, MXS is an ambitious effort to accommodate a range of messaging-based applications on a single dedicated server.
MXS began as a project to create an NT message store--in effect, a high-end post office for Microsoft Mail. Over time, however, it has evolved into a platform for implementing message-enabled client/server applications, such as work flow, forms routing, and group communications. For this
reason, it has been called a ``Notes killer,'' a label Microsoft vigorously rejects. Unlike Notes, MXS isn't a programmable database engine. Rather, it's a repository onto which Win32 applications can be layered. Microsoft hopes, for instance, that third parties will develop document management programs that use MXS as a file store.
Like Notes, MXS contains data files, not just pointers to them, and it automatically replicates itself. ``It's a storage system, closer to a database than to a file system,'' says Thom McCann, MXS product manager for Microsoft. The contents of the repository will be visible to MAPI and ODBC, but not directly from the NT file system. But while MXS supports ODBC, it doesn't have a programmable schema. ``We take care of that,'' McCann says. ``We've optimized the info store for the kinds of things we do.''
MXS is the core of Microsoft's push into enterprise messaging. As such, it's designed for heterogeneous environments. It has TCP/IP and NetWare support built in, and
it uses native implementations of the ISO's X.400 addressing scheme/message transport agent and X.500 directory services. This means MXS will have a separate user directory from the NT network to which it belongs, but Microsoft will provide tools that let network administrators set up user accounts on both NT and MXS simultaneously. MXS will also be able to import user directories including those from NetWare 3.x Bindaries, and NetWare 4.x NetWare Directory Services.
McCann contends that the advantages of MXS over an NLM-based solution accrue in part from NT's inherent strengths: manageability, scalability, and GUI-based administration tools. MXS, he says, ``can be a fairly good platform for document management,'' because out of the box it will take care of basic features such as checkin/checkout and versioning. Advanced capabilities like revision control, full-text indexing, and global file management will have to be provided by ISVs (e.g., Microsoft is working with Watermark on a MAPI-enabled version
of its image store).
In support of more advanced groupware applications, MXS can store multimedia data types, custom forms, and calendar/ scheduling information. It will maintain the integrity of OLE links among documents, but only within the information store. ``We're trying to take a lot of the functionality that needs to be driven down into the operating system or into the server and put that into MXS,'' says McCann.
Third-party developers mostly applaud the potential for MXS. To effect document management, says Albert Behr, product manager for forms products at Delrina Software (Toronto, Ontario, Canada), ``the plumbing needs to be both in the operating system and the workgroup infrastructure.'' MXS and Notes both provide workgroup capability, but Microsoft also controls NT.
Scott Kadlec, the president of PC DOCS, calls MXS ``very strategic'' for his company. ``In the past, we've been so focused on overcoming problems the operating system should have solved itself that we haven't bee
n able to step up to the next level,'' he says. If the company is freed of responsibility for low-level file management, he says, PC DOCS can concentrate its engineering efforts on creating better ways to find documents or to present search results.
Of course, not everybody welcomes MXS. Lotus's Reed chides Microsoft for being too Windows-based. ``MXS misses the mark,'' he says. ``Microsoft's fundamental approach is to get everyone on one platform, but cross-platform apps are the ones winning in the market.''
Foundation Support
For document management to become a true mass-market capability, better information and object management tools have to migrate down to the average desktop, not just to the server, because many files are still stored and accessed locally. In the Windows marketplace, some functionality will be shifted down into the operating system when Microsoft delivers Chicago, and much more so with the future object-oriented successor to NT, known as Cairo.
Chicago offers lim
ited but important enhancements that improve support for document management, says Rogers Weed, the lead project manager for Chicago. At the lowest level, a small number of additional file attributes have been added to the FAT (file allocation table) file system, using previously reserved but unused fields. Chicago will support long filenames, breaking at last the hard-coded ``8.3'' DOS file-naming scheme and vastly improving your ability to name files with memorable descriptors. And in addition to date/time of the last modification, Chicago will store the date/time of creation and most recent access to files, even if that access produced no changes. These fields will help track file activity and will be especially useful for network management and backup, but also for document management.
Outside the confines of the FAT file system, Chicago will let developers attach additional fields of information (e.g., the contents of the Word and Excel summary boxes) to files and then publish, via an API, the str
ucture of those records. This could be a boon to document management systems, which would gain a standardized way of reading summary boxes and importing the data into document profiles.
At a higher level, Chicago introduces a new user interface, called Explorer, that merges file and program management onto a single desktop, like the Mac or OS/2. This is a critical step toward document-based computing, because it exposes documents at the desktop, rather than burying them inside the context of their creating applications. In conjunction with OLE and OpenDoc, it moves the Microsoft/Intel computing world into a more document-centered user interface (see the text box ``Distributed Document Management with OLE and OpenDoc''). Explorer will also offer an improved finder that lets you search for files over the network, and it will ship with built-in file viewers, a critical aid to document management.
Explorer also supports Mac-style aliasing, which means that an icon on the desktop can be a pointer to
another entity anywhere on the network. Called Shortcuts, these desktop-level links are a new data type managed by the operating system. They fully support OLE, which means you can drag and drop an object from the desktop into another application or onto a service such as printing or backup.
It's important to note, however, that the integrity of these links is not ensured at the operating system or network level; you can easily break the link by moving or deleting the target of a pointer. And to perform really fast searches against a large directory of files and objects, you need a better file system than FAT. That's where Cairo comes in. ``Cairo is a fundamental revisiting of the file-system structure,'' says Weed. ``It includes indexing, security, and management of lots of objects.''
The Cairo object store won't necessarily be a single entity that contains all data types. Rich Tong, the general manager of product marketing for Microsoft's business systems division, explains that OLE wrapping w
ill be a standardized way of representing what an object is--``a way to label the outside of something''--but that the actual file stores and retrieval engines could vary depending on the nature of the data. ``A half-gigabyte financial database needs a different structure from a thousand documents or a million objects,'' he says. ``With OLE wrappers, you can use any kind of store as your back end: a legacy VAX, a Notes database, or MXS.''
Cairo's OLE file system will support richer attributes than FAT or even NTFS (NT File System), and the definition of those attributes is flexible because of object orientation. At a minimum, Tong says, it might include fields such as object ID, author, and version, while more specialized attributes could be laid down by the host document manager. Document managers like FileNet could write drivers to route document calls into the Cairo file system, thus preserving their customers' existing application and databases. This tie into legacy systems will be accomplished lar
gely through the use of COM (Common Object Model), which will also tie to object models such as SOM/DSOM (System Object Model/Distributed System Object Model), CORBA (Common Object Request Broker Architecture), and DOE (Distributed Objects Everywhere) through a DEC-authored object broker (see ``Componentware,'' May BYTE).
Finding It
Once you have documents stashed in an appropriate file system and wrapped with identifying information, you still need an object browser or some other means of quickly locating the information or function you need. Microsoft hasn't said much about the Cairo user interface, but you can draw some conclusions from other object and information managers.
There will likely be a wealth of choices for accessing distributed object stores. Imagine a query tool that brings together elements of a Mac or Chicago desktop; a Borland or Gupta QBE database-access dialog box; a custom business form from Delrina, JetForm, or WordPerfect; and a customized data view from Lotus Notes.
Things start to get even more interesting when you consider technologies used to formulate queries, organize searches, and represent search results to the user. A leading researcher in these areas is Xerox PARC (Palo Alto Research Center), which is pioneering more effective ways of scanning unstructured textbases and presenting the results. The goal is to help users find the information they need, so the first line of attack addresses formulating queries.
Simple, often inflexible, Boolean searches can produce unintended or incomplete results, including no hits or too many hits, and many people don't understand how to effectively formulate a Boolean query. Researchers are experimenting with natural-language techniques that parse out the meaning of a user's request and search for hits based not only on exact matches but also on word associations and semantics. Most of these tools use thesauri or semantic networks. For instance, if you searched for occurrences of the word tooth, citations for molar
, incisor, fang, and tusk might also be returned.
Sometimes, users don't know even what they're looking for, so the Xerox PARC information-retrieval project, led by Jan Pedersen, is experimenting with a technique called scatter/gather, which reads huge collections of documents and intelligently groups them into categories based on the frequency of word occurrences. Without understanding actual content, scatter/gather can examine unindexed data sets and progressively narrow the universe of choices such that a query you issue is more likely to result in hits.
Scanning a reference database of 1 million documents maintained by the Federal government, for example, produces ``clusters'' of subjects such as foreign policy, computers, and defense. These can be scanned again for finer clusters. By doing this, searchers can find documents they might otherwise not know to look for.
The flip side of not finding enough documents is finding too many. Much of text-retrieval research is now directed at w
ays to make searches more productive by returning only meaningful hits. A new technology from Oracle (Redwood Shores, CA) called ConText goes beyond thesaurus-based tools, performing syntactical analysis that can determine the subject of a sentence by isolating the main clause.
Brett Newbold, senior director of Oracle's text server division, says that ConText helps users find only documents that are really about the subject of the query. If you were looking for information about the Federal Reserve Bank, he says, a conventional search tool might return an unrelated article that merely quotes a bank official. ``ConText knows this document isn't about the Fed, even though the words Federal Reserve Bank appear in the article,'' he says.
What happens if, even after sophisticated filtering, you are inundated with documents that match your search criteria? PARC researchers Stuart Card, Jock Mackinlay and George Robertson have created an Information Visualizer that explores ways to present orders of ma
gnitude more data on a computer screen than is now possible with GUIs. One technique involves creating 3-D trees of linked objects, which can be rotated in space to select certain topics. Another concept, ``rooms'' of flexibly grouped files and programs, resulted in a commercial product from XSoft called Rooms for Windows.
Perhaps the most promising commercialization of PARC technology is XSoft's Visual Recall, which uses both file trees and a 2-D grid to present query results. The grid, or wall, shows documents arranged along the x-axis according to a linear criterion, such as date, and on the y-axis by another criterion, such as file type, author, or subject. The grid folds back into 3-D space, letting you view a great deal of data, and slides back and forth so you can quickly narrow in on specific clusters of documents or files.
Technologies such as these, combined with powerful document stores and search engines, will make locating information much easier. (Finding images remains a challenge
, however; see the text box ``Image Retrieval for Compound Documents''.) The next step is to make the presentation of that information more consistent and aesthetic, an area now being addressed with cross-platform portable document formats such as Adobe Acrobat and WordPerfect Envoy.
The final step is to enable those portable documents to be encapsulated and linked into other files, which is being addressed by OLE and OpenDoc. When all these technologies are in place, document management, as such, will cease to exist as a category unto itself and will become, as it was always meant to be, synonymous with computing.
Illustration: Spectrum of Document Types
Some documents are smarter than others. Plain image files or faxes are inflexible and unscalable, and they can't be searched. With keywords, images are more easily retrieved. Full-text retrieval is possible only by using OCR or a portable document format that preserves both the appearance and content of the document. Documents prepared wit
h SGML (Standard Generalized Markup Language) or the ODA (Open Document Architecture) know their own content and structure, but they don't necessarily appear the same across platforms. Structure allows more exact searches, such as occurrences of a word in a caption. Formats such as Microsoft Word address content, structure, and appearance, but they are proprietary. Containers are the ``smartest'' documents of all, holding objects or pointers to objects that know their own behavior and characteristics.
Illustration: Classes of Document Management
Systems for managing dynamic documents and image documents share many underlying capabilities, but they offer starkly different feature sets to users. Both also typically ride on top of a rich network infrastructure.
Illustration: The document profile in PC DOCS Open is typical of the kinds of fields you can track with a document manager. These fields, plus a full-text index, are the keys to retrieving filed documents, and they
may someday be a part of the operating system.
Illustration: Adding a Mac-like foldering scheme to Windows, the Document Desktop from SoftSolutions is an integrated launch pad for documents or applications. Microsoft's Chicago release of Windows will also do away with the current distinction between the Program Manager and the File Manager.
Illustration: XSoft's Visual Recall uses a 3-D-like wall to show clusters of documents conforming to certain search criteria. Seeing these clusters helps you find documents faster than by scrolling through a 2-D list of hits ranked by relevancy.
Andy Reinhardt is BYTE's West Coast bureau chief. You can reach him on the Internet or BIX at
areinhardt@bix.com
.