Escolar Documentos
Profissional Documentos
Cultura Documentos
THE ARCHIVE RAIDER: PROOF OF CONCEPT & OCR TESTING Collection & Goals
The Lyndon Baines Johnson Library and Museum (LBJ) plays host to a collection of recordings and transcripts of the Presidents telephone conversations and meetings. These recordings are available online in WAV and MP3 format at the Miller Center, a nonpartisan public policy institution afliated with the University of Virginia. 1 However, the transcripts associated with the recordings are behind a paywall and, to my knowledge, are not synced up with the available audio. Furthermore, the LBJ will not permit scanning or copying of the thousands of pages of transcripts without paying what would amount to a substantial fee. The goal of this project was to capture a large volume of transcript pages as efciently at possible. The resulting captured images needed to be of sufciently high quality as to permit OCR to render these transcripts into machine-readable text, which could then be synced with audio recordings in GLIFOS.
Set-Up
Using a scanner was out of the question due to LBJ policy, so Konrad Lawsons guide to building an ultra-portable copy stand was used to create an archive raider. 2 The archive raider attaches to the researchers table with a heavy-duty clamp; this project used Manfrottos Super Clamp, which can t a table edge up to 2.17 inches wide and bear a load of 33.07 pounds.3 We attached Manfrottos aluminum 2-Section Articulated Arm with Camera Bracket to this clamp. The arm has a maximum length of 23.82 inches and can manage 3.31 pounds at maximum extension, which is sufcient to hold a professional-quality DSLR camera.4 (See Figure 1 for an image of this setup). Attached to the camera bracket was my one-year old Canon PowerShot SX130 IS, a compact zoom camera with 12 MP and 12x optical zoom range that starts at 28mm. It does well in low light and features continuous autofocus, image stability, and a self-timer that can delay the shot by 2
1 2
Lawson, Konrad. The Chronicle of Higher Education, "The Articulated Arm of an Archive Raider." Last modied December 07, 2010. http://chronicle.com/blogs/profhacker/the-articulated-arm-of-an-archiveraider/29243.
3 4
Gaede 2
or 10 seconds. The lack of support for remote capture on this camera was a major hindrance; I would suggest that remote capture is necessary for any long-term, high-volume project for reasons of ergonomics and efciency. Future work could include testing of the Canon Hack Development Kit (CHDK), a third-party rmware enhancement that enables supported camera models to use a USB remote.5 Since the remote capture feature is standard on higher-end camera models but mostly unavailable on budget models, CHDK could potentially reduce the cost of similar projects. Though I spent considerable time practicing assembling the portable copy stand at home, conguring both it and the camera in the LBJ Librarys Reading Room took nearly 45 minutes. The differences in table width and lighting conditions played havoc on my intended set-up. (For the list of camera settings ultimately used for image capture, please see Figure 2.) Images were captured against a piece of white foam board, which helped align transcript pages and provided a neutral background. I used the cameras two second delay self-timer to prevent blurring caused by pressing the shutter button and disturbing the articulated arm. I was not permitted to use the ash in the reading room, but the automatic white balance and macro mode functions produced bright, crisp images without it. In an effort to create consistent image sets, I left the zoom alone for most shots.
Program Mode Automatic White Balance Macro Mode 2 second self-timer No ash Left zoom alone for nearly all images
Figure 2: Camera settings used for image capture
The actual photography proceeded at a brisk pace, taking, on average, 17 seconds per page. This was sometimes interrupted by the need to remove staples and re-staple after capture was complete, which took about 30 seconds per trip. I only had the archivist on duty remove staples for particularly long (>6 page) transcripts; I used a bean bag to weigh down the shorter multi-page transcripts. Adjusting the bean bag for each page added about 7 seconds to capture time; single page transcripts only took ~10 seconds, compared to the 17 second average. In 90 minutes, I captured 157 transcript pages that cover the period between June 1967 and December 1967, corresponding to 15 audio recordings. Captured images were transferred to a MacBook and exported in TIFF format, creating les roughly 36 MB in size with dimensions of 3000 x 4000 pixels. (See Figure 3 for an example image.) I spent two
5
Gaede 3
hours exporting, downloading, and organizing the images, audio les from the Miller Center, and machine-encoded text les created in the second half of this project into folders by month and sub-folders by transcript number.
Remote capture is vital: obtain a camera that comes standard with this feature or see if the Canon Hack Development Kit (or a similar project) will enable your camera to support a USB remote. This will reduce the amount of time needed per page, since you will no longer need to use a time-delay to reduce blurriness as well as save you a great deal of back pain.
Gaede 4
Take the time to become extremely familiar with your camera: you will not be able to control lighting conditions in your archive and must be able to adapt to natural, uorescent, incandescent, and low light. Prepare your le structure ahead of time: exporting your large photos and organizing them can be quite time intensive. Research your collection and create your le directory before taking your rst photograph; drag and drop on site is far easier than trying to remember what belongs where after the fact.
There are a number of options for creating machine-readable versions of these images, each of which has potential benets and disadvantages, depending on the document. With some time spent on training, ABBYY FineReader 11 does very well with more complicated formatting. However, it is susceptible to artifacts (as seen in my issue with the full stop) and has difculty with handwriting. If you have documents with signicant handwritten annotation, you may wish to consider re-speaking instead, though the formatting may not be preferred. Dragon Dictate performed very well with summaries that were mostly text and contained little formatting other than standard punctuation. Re-speaking from audio is to be avoided; a good typist will be much faster and encounter less frustration.