BYTE.com > Mr. Computer Language Person > 2006
Alarums and Excursions
By Martin Heller
May 1, 2006
(Alarums and Excursions
: Page 1 of 1 )
Recently, I've been working on an application that reads Word documents, extracts key information, and stuffs that into a database. I know that sounds like a joke definition of a brain, but it's for real.
I have a client that does a lot of technology assessments, all conforming to a small number of Word templates. In the past, investigators could look at old assessment documents on a network share and pick out any information that interested them. That didn't work very well once there were thousands of old documents, until one of the administrators built a Google Desktop index of the archive directory.
The local Google index made finding things very easy, at least for the in-house people. Before we could figure out a solution for the outside contractors, however, the lawyers put in their two cents, and all of a sudden nobody was allowed to read old assessments except the executives and editors, on the theory that the documents might contain proprietary information.
Fortunately, it turned out that the genuinely useful information in old assessments is non-proprietary: market size estimates, the contact information of the people the investigator talked to, the patents the investigator found, and the references the investigator consulted. All of that stuff happens to be contained in tables or footnotes, and all the indexing information that describes the area of the assessment--the Library of Congress headings, the patent classes, and so on--is held in one table on the first page.
Parsing a Word Document
I've known for a long time that Microsoft Word had an extensive document object model, but as a C++ developer I found that model difficult to use. It was really designed for VB programmers, and it wasn't supported very well with professional development tools.
In the last couple of years, Microsoft has come out with Visual Studio Tools for Office, which makes the experience of programming against Word, Excel, Outlook, and InfoPath more like writing a managed code application.
Page 1 of 1
BYTE.com > Mr. Computer Language Person > 2006
|