BYTE.com
RSS feed

Newsletter
Free E-mail Newsletter from BYTE.com
Email Address
First Name
Last Name




 
    
             
BYTE.com > Mr. Computer Language Person > 2006

Alarums and Excursions

By Martin Heller

May 1, 2006

(Alarums and Excursions :  Page 1 of 1 )



Recently, I've been working on an application that reads Word documents, extracts key information, and stuffs that into a database. I know that sounds like a joke definition of a brain, but it's for real.

I have a client that does a lot of technology assessments, all conforming to a small number of Word templates. In the past, investigators could look at old assessment documents on a network share and pick out any information that interested them. That didn't work very well once there were thousands of old documents, until one of the administrators built a Google Desktop index of the archive directory.

The local Google index made finding things very easy, at least for the in-house people. Before we could figure out a solution for the outside contractors, however, the lawyers put in their two cents, and all of a sudden nobody was allowed to read old assessments except the executives and editors, on the theory that the documents might contain proprietary information.

Fortunately, it turned out that the genuinely useful information in old assessments is non-proprietary: market size estimates, the contact information of the people the investigator talked to, the patents the investigator found, and the references the investigator consulted. All of that stuff happens to be contained in tables or footnotes, and all the indexing information that describes the area of the assessment--the Library of Congress headings, the patent classes, and so on--is held in one table on the first page.

Parsing a Word Document

I've known for a long time that Microsoft Word had an extensive document object model, but as a C++ developer I found that model difficult to use. It was really designed for VB programmers, and it wasn't supported very well with professional development tools.

In the last couple of years, Microsoft has come out with Visual Studio Tools for Office, which makes the experience of programming against Word, Excel, Outlook, and InfoPath more like writing a managed code application.

 Page 1 of 1 


BYTE.com > Mr. Computer Language Person > 2006
Dr. Dobb's Media Center

What Zope Did Wrong (and How It's Being Fixed)
Dr. Dobb's talks with Lennart Regebro about the many things that Zope 2 did right and did wrong. Lennart has also been one of the driving forces behind Five, the integration of Zope 3 technologies into Zope 2.

Ubuntu and the Software Around It
Dr. Dobb's interviews Ubuntu's Gerry Carr about the Linux-based Ubuntu operating sytem and the application lifecycle tools -- such as the recently released Launchpad -- that surround it.

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE: Volume 2 - Heuristic Algorithms
The Best of BYTE: Volume 2 - Heuristic Algorithms
In this volume of Best of BYTE, we explore the emergence of some heuristic algorithms. Although we have only scratched the surface of this intriguing subject, we hope we've suggested the potential of the synthesis of heuristics and algorithms.

© 2008 Think Services, Privacy Policy, Terms of Service, United Business Media Limited
Site comments: webmaster@byte.com
Web Sites: BYTE.com, dotnetjunkies.com, Dr. Dobb's Journal, SD Expo, Sys Admin, sqljunkies.com, Unixreview



MarketPlace
One easy to use component adds safe and reliable updating features. Download today for a free trial.
Fast online exception analysis. Capture customer crash data online.
and develop 10 times faster ! ALM, IDE, .Net, PDF, 5GL, Database, 64-bit, etc. Free Express version
Understand C/C++ code in less time. A new team member ? Inherited legacy code ? Get up to speed faster with Crystal Flow for C/C++. Code-formatting improves readability. Flowcharts are integrated with code browser. Export flowcharts to Visio.
Develop distributed systems conforming to open standards like CORBA and Web Services faster with SANKHYA Varadhi - The Digital Bridge.
Wanna see your ad here?
 

web2