From Archive to Data
Optical Character Recognition and data visualization of documents used in New York Restaurant Keepers project.
Oyster Stands In Fulton Market
The New York Public Library Digital Collections. 1870.
Data Visualization and Mapping
By the end of the 19th century, a large percentage of the restaurants in New York City was owned by foreign-born citizens, which have left significant records for us to learn about the intriguing mobility pattern of immigrants in New York around that period.
Prof. Heather Lee at NYU Shanghai is conducting a project called New York Restaurant Keepers that aims to explore the migrant experience in New York City through its culinary scene. Along the way, there has been a problem that has been holding back the research process: the research group needs to digitize data from images of scanned documents through OCR (Optical Character Recognition), but both the tabular structure and the noises of these images affect the recognition accuracy significantly. Digitized data from these documents are crucial to the project because all of the further data analysis depends on it.
Our research project, therefore, aims to bring up the OCR recognition accuracy rate and provide visualization based on the extracted data. Our solution proposes a tailored preprocessing and recognition process customized for the documents used in New York Restaurant Keepers. Images are categorized into groups by their needs of preprocessing steps, and various data cleaning methods were introduced into the post-processing phase. This process depicted more than 85% accuracy on the subset of data from the borough of Manhattan, and the visualization of map and charts are published onto the project website for demonstration and future reference.
02 Historical Data
Preprocessing & Visualizing Approved Licensing in Manhattan
Manhattan, 1910 census tabulation tracts
The New York Public Library Digital Collections. 1910
Gender Distribution in Different Culinary Businesses
Overall Gender Distribution
Oyster Stands In Fulton Market
The New York Public Library Digital Collections. 1870
First and foremost, we decided on the OCR tool for recognition. We were looking at two choices: Tesseract and ABBYY. Tesseract is an Optical Character Recognition engine, now sponsored by Google. It is a powerful and popular OCR tool that supports many languages. It is executed from the command-line interface and outputs .txt files, and there are some projects that provide a separate GUI for more friendly user experience. ( “3rdParty – tesseract-ocr – GUIs and Other Projects using Tesseract OCR“. github.com. Retrieved 2017-03-30.). ABBYY FineReader, on the other hand, is an OCR software with a GUI that allows custom configuration for users. It provides different recognition modes such as Text and Table, and it allows users to appropriately pre-process the images before beginning the OCR algorithm. After a series of experiments, we decide that for New York Restaurant Keepers, we base on ABBYY FineReader as our fundamental tool and optimize the OCR process on top of it. Our decision is based on the average accuracy and level of customizability towards our project’s specific needs.
With ABBYY FineReader’s customization, we were able to output the recognition results in table format. However, because of the skew and line distortion of the input images, the table recognition results were very poor. Therefore, we decided that we need to apply the preprocessing of deskewing and line-straighten to improve recognition accuracy.
We did experiments with approaches of applying pre-processing steps onto images, and we came to the conclusion that it was most efficient when we group the images into categories by the problems they have and adjust them respectively. Considering the fact that there are in total more than 2900 images to be digitized for the 5 boroughs of New York City, we built a classifier that can filter out images with a skew from those without skew. The classifier takes a folder of images placed in the same directory with the python script and outputs two .txt files: one with the image names that are skewed with their detected angle, and the other file with the image names that are not skewed with the angle. Based on repeated experiment results, we set the threshold of “skew” to be 2 degrees. With the result of the classifier, we were able to halve the number of images fed into the deskew step of ABBYY FineReader, and in this way, we saved half of the time of preprocessing and boost up the efficiency by two times.
We conduct OCR in ABBYY with necessary manual configuration. For the majority of the images, we preprocess them according to the grouping result from the classifier. We applied deskewing onto the skewed group and do OCR directly on the other group. Approximately 10% of all the images have columns overlapping which affects the address column recognition, and for these special ones, we cropped the image into two parts: one for ID numbers and names, the other one for addresses. This special case would require an extra step in the phase of data cleaning, which is to re-match the two parts back to one file. The trade-off between accuracy and processing speed is inevitable, as the cropping saves the entire file from being wrongly recognized.
With the name and address of the culinary business owners, we have got the basic information of them but no other perspective can be obtained. In order to get a richer historical insight from the data, we used external APIs: the Gender API and NamSor and enriched the data set to have gender and nationality features. The Gender API takes a person’s first name and returns the most possible gender of this name based on a global dataset lookup. It performs normalizations on the input name and fixes possible typos as much as possible, which in our case can effectively eliminate the bias caused by incorrect recognition. NamSor API provides a comprehensive classification of a person’s country of origin based on their full name. Also, since the restaurant owners’ information would be eventually visualized onto a map, we converted the addresses into latitude and longitude through the GoogleMaps Python Library.
With all the information prepared, we moved along to visualize the data. We hoped to reveal the immigration patterns of New York City from a geographical perspective, and we would like to see the gender distribution among the population of the culinary business owners, as well as to dive into the details of each sub-category within the culinary business.
We generated a pie chart that depicts the overall gender distribution. It shows that 88.2% of the stores are owned by male owners, while only 10.9% are owned by female owners. This result indicates that male owners are taking the absolute majority of culinary businesses.
We have also made a stacked bar chart to show the percentage of male and female owners in each culinary category. It shows that conducting a restaurant and selling milk are the two main businesses among all the owners, and that in most businesses female owners take up around 10 percent of the whole population, except for the category of meat & sausages, as well as shellfish.
Several conclusions can be drawn from the map. A conspicuous clustering of restaurants can be found in the lower east side, to the south-west of the east village. The majority of them are owned by a German immigrant. There is also a cluster in the west village, with a more diverse representation of countries of origin. As of each individual country of origin, the ones owned by Jewish immigrants tend to open their business in the midtown and downtown area roughly between the 51th Street down to Canal Street. The ones belonging to French owners had a scattered distribution in Manhattan borough. A majority of the Italian immigrants would choose somewhere between 23th Street down to Worth Street, with a few of them gathered within a short distance to each other high up in the East Harlem area. We anticipate reaching a more comprehensive observation after we extract and plot all of the data.
Having accomplished all of the above, we were able to deploy the visualization onto the project website. Because we did not use the entire dataset, we are marking these results as a progressive achievement towards the entire New York Restaurant Keepers.
The Next Stage
- An API that detects the gender of a name
- To find out the gender of each store owner
- A Name Ethnicity and Gender Classifier API
- To find out the nationality of each store owner
Google Map API
- To display all store address on Google Map
Xinyi is a Senior student at NYU Shanghai majoring in Data Science and minoring in Business. She is originally from Shanghai and studied in New York for a year during her junior year. Most of her previous work involved data analysis, machine learning, and software engineering. She is interested in combining data science and other subjects in the future. Her hobbies include photography, music, and playing basketball.