Approaches to Messy Digitized Archival Documents:
Testing Automated Procedures for Enhancing OCR Legibility of 20th Century Business Licenses

Pre and Post-processed image
Project summary
In spring 2020, we, Omar Hammami and Leo Zhang, experimented with computer-based approaches to extracting data from photographs of archival records. Sophisticated OCR programs, like ABBYY FineReader 15, were unable to accurately extract data from these documents because of poor image quality. These problems arose from the original documents and the conditions for photographing in archives.
While our peers, Grace Gao and Xinyi Wang (From Archive to Data) worked with a cleaned subset of these images, our goal was to design a process that could be applied to all images in these dataset. Working with thousands of photographs of digitized business license records, we designed procedures for handling messy digitized archival documents.
We worked on making the images more OCR legible, by first cleaning the image in a preprocessing stage. We created several new preprocessing approaches aiming to isolate words on the page at the right orientation through algorithms. Each attempt has its own strength on certain images, but the inconsistency across our data has been our biggest enemy.
As a second step, we worked created a protocol for dealing with messy output from messy input data (images that could be effectively cleaned in preprocessing)
We avoided human decision-making as much as possible, since automating the process clearly has advantages efficiency wise. It would enable them to scale up the procedures to even larger datasets.
Our tool box includes the following.
01Labelling
02Edge Detection
03Post Processing
Labelling
We created a system of labelling to preserve such information to enable us to deal with images by problem area. Image labelling aims to sort images regarding the chronological order or order it was taken into an order than we want it to be processed inside OCR softwares.
Area size

Unfortunately, we encounter the problem where a piece of noise could be around the same size as a letter, thus obstructing us for filtering.

Morphology




k55 = np.ones((5, 5), np.uint8)
img = cv2.dilate(img, k33)
img = cv2.erode(img, k55)
img = cv2.dilate(img, k33)
ret, img = cv2.threshold(img, 170, 255, cv2.THRESH_BINARY)
img = cv2.dilate(img, k33)
img = cv2.erode(img, k55
Population of the United States, from 1790 to 1870; Population of the Principal Cities of the United States, having a population of over 25,000 in 1870; Population of the Dominion of Canada, from the Official census of 1871; Population of the Globe
The New York Public Library Digital Collections. 1876.
Edge Detection
The idea was that cv2s .findContours is able to find every contour in the edge detection version of the image. After the contours are found, it would be able to grab the largest contour, which would be the outline of the document, and then use it to do a perspective transform on the original document.
However, this turned out to be too difficult for our project, as our pictures are not able to provide an entire contour around the whole document, as the pictures vary in terms of lighting, outline, and document shape. When using one of these mobile document scanners, it should be noted that there is a clear outline and difference between the foreground & background of the pictures. This was not the case for our set of pictures, so ultimately it was found out that edge detection in the aims of document cropping was no longer feasible.
Convolution

Gif image source: Wikipedia.

Gif image source: Wikipedia.



Table of Distances; Post Offices; Population from U.S. Census of 1870.
The New York Public Library Digital Collections. 1875.
Post Processing
.txt approach
The archival data was tabular–three columns with a wide range of rows–and we wanted tabular output. ABBYY FineReader, however, misread the rows and columns in two ways: 1) mixing up columns and 2) mixing up rows. The most difficult outputs combined both problems.
To process the files that were produced by ABBYY, Regular expressions were used to convert each .txt file into a formatted .csv file. Regular expressions identities patterns found within lines of characters, which allows for the separation of text. These separated pieces of text were then used to create categorized rows per each entry of data. Pattern recognition had to adapt to the irregular ABBYY outputs, where .txt files differentiated between a “horizontal” split (columns of data separated horizontally), “vertical” split (columns of data separated vertically), and a mix of both.
This was done in aims to produce a structured csv output for every txt output. However, pattern recognition is shattered by OCR inconsistencies, where patterns are broken up, so clever regular expressions had to be employed.


Team

Omar Hammami
Omar is majoring in Computer Science, with a minor in Data Science. Born in Ohio and raised in Saudi Arabia as a teenager, he found an alike interest in a project that involved people moving across the Atlantic Ocean. Omar also enjoys data visualization, as they help him fulfill his fantasy of being a real artist.

Leo Zhang
