Approaches to Messy Digitized Archival Documents:

Testing Automated Procedures for Enhancing OCR Legibility of 20th Century Business Licenses

Pre and Post-processed image

Project summary

In spring 2020, we, Omar Hammami and Leo Zhang, experimented with computer-based approaches to extracting data from photographs of archival records. Sophisticated OCR programs, like ABBYY FineReader 15, were unable to accurately extract data from these documents because of poor image quality. These problems arose from the original documents and the conditions for photographing in archives.

While our peers, Grace Gao and Xinyi Wang (From Archive to Data) worked with a cleaned subset of these images, our goal was to design a process that could be applied to all images in these dataset. Working with thousands of photographs of digitized business license records, we designed procedures for handling messy digitized archival documents.

We worked on making the images more OCR legible, by first cleaning the image in a preprocessing stage. We created several new preprocessing approaches aiming to isolate words on the page at the right orientation through algorithms. Each attempt has its own strength on certain images, but the inconsistency across our data has been our biggest enemy.

As a second step, we worked created a protocol for dealing with messy output from messy input data (images that could be effectively cleaned in preprocessing)

We avoided human decision-making as much as possible, since automating the process clearly has advantages efficiency wise. It would enable them to scale up the procedures to even larger datasets.

Our tool box includes the following.

01Labelling

02Edge Detection

03Post Processing

Labelling

The documents were photographed in the order they were arranged in the archive, which meant that the order of the photographs provided information about the content of the documents. The order of the document, to our surprise, matters since the borough information was contained on some pages and not other pages because the documentor assumed a continuously flowing page.

We created a system of labelling to preserve such information to enable us to deal with images by problem area. Image labelling aims to sort images regarding the chronological order or order it was taken into an order than we want it to be processed inside OCR softwares.

Area size

One problem area for OCR programs was detecting the boundaries of the content, so we designed an approach that would help make that clearer to computers. The idea behind this approach is to calculate the area of each connected region in an image and filter out those that are too big or too small for a letter, shown as follows.
For example, a successful example could be the following. We don’t particularly need clarity on each letter in the result because we can mask it to the original later.

Unfortunately, we encounter the problem where a piece of noise could be around the same size as a letter, thus obstructing us for filtering.

Morphology

Blurry or fuzzy texts were another problem area of OCR programs, so we applied mathematical concepts to help deal with those issues. Morphological union takes one image and sweep or smudge in the shape of the other. Implemented in cv2’s morphological functions, it has strong advantages in smoothing edges of letters. Demonstrated as follows
We also figured that we can essentially chain them up to create more effects
k33 = np.ones((3, 3), np.uint8)
k55 = np.ones((5, 5), np.uint8)
img = cv2.dilate(img, k33)
img = cv2.erode(img, k55)
img = cv2.dilate(img, k33)
ret, img = cv2.threshold(img, 170, 255, cv2.THRESH_BINARY)
img = cv2.dilate(img, k33)
img = cv2.erode(img, k55
Population of the United States, from 1790 to 1870; Population of the Principal Cities of the United States, having a population of over 25,000 in 1870; Population of the Dominion of Canada, from the Official census of 1871; Population of the Globe

The New York Public Library Digital Collections. 1876.

Edge Detection

One of the issues with the photos of the documents was what was going on beyond the borders of the page. It was initially tackled through thresholding, but it was thought that if we could crop the image based off of the edges of the document, there would be no outside noise at all to disrupt the OCR. Also, once cropped, the document would then be able to be shifted in perspective, fixing the image skew. It was thought to create some sort of “document scanner”, like those mobile applications that will crop and brighten a scanned document using the phone camera.

The idea was that cv2s .findContours is able to find every contour in the edge detection version of the image. After the contours are found, it would be able to grab the largest contour, which would be the outline of the document, and then use it to do a perspective transform on the original document.

However, this turned out to be too difficult for our project, as our pictures are not able to provide an entire contour around the whole document, as the pictures vary in terms of lighting, outline, and document shape. When using one of these mobile document scanners, it should be noted that there is a clear outline and difference between the foreground & background of the pictures. This was not the case for our set of pictures, so ultimately it was found out that edge detection in the aims of document cropping was no longer feasible.

Convolution

Another major problem was skewed text, which we addressed by measuring and correcting to make the text appear more linear. Convolution is a mathematical approach of taking function and multiplying with another in a shifted phase and integrate over then area under the curve.

Gif image source: Wikipedia.

If we implement using f as a binary patch of an image and g is a matrix block with half of it -1 on the left and half of 1 on the right, We get something resembling a “statistical” derivative. Then we just need to find the maximum and minimum. Theoretically, it would work really well on finding large piece of sudden change across an 1-d axis without being bothered by noise. Like this:

Gif image source: Wikipedia.

We can also prepare a set of rotated [-1, 1] blocks like the following.
We use the same convolution technique to find which angle is the best
Here are some examples of final results after combining the translational and angular convolution.
Table of Distances; Post Offices; Population from U.S. Census of 1870.

The New York Public Library Digital Collections. 1875.

Post Processing

.txt approach

While the above approaches cleaned up many of the images, not all of the problems could be solved. Messy archival images produced messy output.

The archival data was tabular–three columns with a wide range of rows–and we wanted tabular output. ABBYY FineReader, however, misread the rows and columns in two ways: 1) mixing up columns and 2) mixing up rows. The most difficult outputs combined both problems.

To process the files that were produced by ABBYY, Regular expressions were used to convert each .txt file into a formatted .csv file. Regular expressions identities patterns found within lines of characters, which allows for the separation of text. These separated pieces of text were then used to create categorized rows per each entry of data. Pattern recognition had to adapt to the irregular ABBYY outputs, where .txt files differentiated between a “horizontal” split (columns of data separated horizontally), “vertical” split (columns of data separated vertically), and a mix of both.

This was done in aims to produce a structured csv output for every txt output. However, pattern recognition is shattered by OCR inconsistencies, where patterns are broken up, so clever regular expressions had to be employed.

Team

Omar Hammami

Omar is majoring in Computer Science, with a minor in Data Science. Born in Ohio and raised in Saudi Arabia as a teenager, he found an alike interest in a project that involved people moving across the Atlantic Ocean. Omar also enjoys data visualization, as they help him fulfill his fantasy of being a real artist.

Leo Zhang

Leo is currently studying in Tandon double majoring in physics and mathematics and minoring in computer science. He enjoys building mathematical models to express visual effects.

Next Project

Capturing Quantitative Data on Immigrants: US Census Data and NYC Immigrant Tenements