New York Restaurant Keepers

This project explores the migrant experience in New York through the city’s culinary scene. By the end of the 19th century, a large percentage of the restaurants in the city were owned by foreign-born residents.

West 18th Street, Nos. 461-463, Manhattan

The New York Public Library Digital Collections. 1938.

Project summary

Data Visualization and Mapping

This project explores the migrant experience in New York through the city’s culinary scene. By the end of the 19th century, a large percentage of the restaurants in the city were owned by foreign-born residents.

This project uses data visualization to explore questions of social mobility, the limits between personal and private established through the kitchen, as well as the networks established by restaurant owners for resources and distribution – placing New York City in the creation of national and transnational networks developed by the creation of the restaurants.

We used ArcGIS to map demographic and geographic trends within restaurant ownership in New York City, based on data from the 1898 and 1913 Trow Business Directories. We coded individual restaurant owners’ names by nationality and gender, and created a map of restaurant addresses. Mapping and comparing this data helps us look for patterns in immigrant restaurant ownership, and how these changed with time.

OCR

Following the introduction of the commercial typewriter in the 1880s, typewriting began replacing handwriting in all forms of written communication in the United States. A typewriter is a machine that produces written documents similar to ones created by professional printers, who published books and newspapers in large quantities. A typewriter has keys with characters that leave ink markings on a blank sheet of paper. These machines produced unique documents more legibly and faster than handwriting. Until the 1980s, when personal computers began replacing typewriters, business and government documents were typically created using typewriters. Writers used typewriters to draft their manuscripts. Individuals used them in their private correspondences.

The application of Optical Character Recognition to typeset documents produced by professional printers has created the possibilities of reading typewritten documents using similar technologies. Public and private archives have an abundance of typewritten documents from the twentieth century. Being able to search these documents and to study these texts using statistical and visualization analyses has the potential of transforming their use value to a wide range of scholars. Historians, economists, sociologists, and anthropologists ask questions of the twentieth century and answer them using historical records. The OCR techniques that had been developed for typeset documents produced by professional publishers need to be tailored typewritten documents. Differently from commercially typeset documents used by printers, typewritten documents had few standards for margins, line-spacing, column-width, indentation. These irregularities created additional complications for existing OCR techniques.

The goal of this project is to experiment with optimal OCR workflows for twentieth century typewritten documents. The test case is photographed images of records from NYC Department of Health Business Licenses (ca. 300 images 1917-1924). In addition to irregularities in the typing, these materials are challenging for OCR because the digital images are skewed, blurry, or have low contrast.

01 OCR

02 Historical Data

03 Image Gallery

OCR

Preprocessing Images

 

In order for the OCR softwares to perform well, we need to clean up the photo into black and white images with only words on the pages, that is cleaning up the boundaries and stains. Although easy for us to tell apart useful information and stains, it is rather complicated algorithmically. Since images are essentially rectangularly stored pixel values. The main tools we use are blur, canny and adaptive threshold functions from python cv2 library and a flood-fill algorithm take out outer edges topologically. We hope to attain a clean image from a linear combination of canny and adaptive threshold results.

The following is an example of an edge cleaning process:

The above example represents most of the photos in the set. However, we still occasionally encounter images in which the cleaning does not go as well as anticipated, like the following:

Though seemingly trivial, the dots in the left border section critically obstruct the OCR process and messes up the table structure when fed into the OCR softwares.

Here are some details:

  • Canny vs. Adaptive threshold
  • PCA approach
  • Flood Fill Algorithm

The flood fill algorithm is used to fill a connected region of the same color in the page to the same color. It is used for the following scenario: we first blur the edges in the image and take the original letter region away to form a “bubble” around the words

Tesseract vs. ABBYY

Currently, there are two main OCR softwares: Tesseract, developed by Google, and ABBYY FineReader, developed by ABBYY. This semester, the OCR group explored and compared between both of them. We see that, in general, ABBYY performs better at reading files both in terms of character wise accuracy and recognizing table structures, yet Tesseract is an open-source software favored by the OCR community for its flexibility in python and c++ code.

Tesseract example

ABBY example

The main challenge is not where it generally works but in fact the exceptions and their verifications. Currently, the more important matter is to develop a better algorithm to accurately recognize a grid/table structure. This way, a small error will not disturb other places. i.e. misreading a cell will not cause an error in the alignment of other cells.

Tesseract 

Tesseract is an open source OCR system utilizing internal machine learning with flexible and customizable attributes that are essential to character recognition for a wide variety of language, fonts, sizes, and shapes. For this project, the main goal is to develop an automated system in Tesseract that would enable feeding in thousands of images to produce automatic table-like output. In order to do this, Tesseract would either have to learn how to accurately recognize tables (or rows and columns) or we would use Tesseract to simply read our file regularly from left to right and top to bottom regularly.

Unfortunately, Tesseract would often misread some images, as shown in the images above, where it would read a column first and read another column without understanding they belong in rows. This points out the drawback Tesseract has, in which it does not have table recognition like ABBYY.

One feature that Tesseract had that was promising and should not be overlooked is that it gives us the direct option to receive the output in hOCR format. This format is essentially in HTML, with tags that we can use to select various parts of the extracted text. This is a very useful feature that can be used and tweaked with in future iterations of the project.

ABBYY

ABBYY advantages and drawbacks:

ABBYY is a multinational software company that specializes in document capture and optical character recognition. It is advantageous in the fact that there is no coding required and output depends on drawing text boxes manually in the required areas. However, some of the main disadvantages include its inability to remove noise from the white parts of the paper, even after pre-processing. There are marks that get read as text and blurry letters that gets read as symbols. It also reads most of the ditto signs in some of the rows as other symbols, not recognizing them as apostrophes. These rows itself are hard to read because of their random splitting, which is one of the biggest problems in both softwares.

Future research/improvements:

The next plausible step would most probably be looking for training of fonts to remove the symbol problems, and set it to a particular font matching the one of the input files so it is easier to read. Also, using the eraser tool to remove the extra noise and make it cleaner, but another problem with that is that it has to be done manually.

Future Research

Preprocessing can only improve the OCR performance up to a point if we don’t tamper with character recognition algorithm. i.e. font training, a type of machine learning.

Faculty Researchers – Yun Dai and Adrian Hodge

Our role in this project is a lot of brainstorming and analysis around ways we could approach the problem of accurately interpret data and scans.

Through a number of iterations and experiments we tackled the issues of accurately finding and representing rows of text data, extraction of tabular data from the ABBYY tool, export formats and other output data, up to using regular expressions to extract the information we needed to organize text into accurate columns.

Machine Learning

Overall, the next step is to translate what worked for the current cases into usable algorithms, and applying the model to new images sharing the same structure. We may also use ABBYY as a trainable tool in this process. We shall collaborate with computer science specialists towards accurate identification, recognition and formatting of historical fonts present in a wide range of archive materials.

Data management questions

  • Storing of gathered data and files
  • Iterative record keeping and archiving
  • Collaboration tools and methods
  • Working towards common output formats
  • Online publishing, web presence and visualisation
Dining in the extremely attractive surroundings of the roof garden restaurant of the Ritz-Carlton Hotel, New York, in which dancing [is] enjoyed during the dinner hour and evening.

The New York Public Library Digital Collections. 1918

Historical Data

1898 Restaurant Ownership by Nationality

1898

The mapping of restaurants in 1898 New York City allowed us to see an overwhelming population of restaurant owners who in their naturalization records resigned any affiliation to “The Emperor of Germany.”

Typically, anglicization of German immigrant names in the United States is associated with a post-WWI period. This map shows us that at least within the restaurant industry, there was no clustering in a specific area for German immigrants, like there was for other immigrant identities – such as the Italian.

1913

The juxtaposition with a map of current New York City, reveals how what is known today as Little Italy was already forming in 1898. The same does not necessarily hold true for German restaurants, which are not clustered at the time and also had a variety of services offered: acting as the city’s breweries and bars.

Central Park [Pleasure] Garden, Restaurants and Theodore Thomas Concerts

The New York Public Library Digital Collections. 1876

1913

1913 Restaurant Ownership by Nationality

1913

Even though we coded owners’ political background by identification with the empire or nation of origin, there was an abundance of last names with Jewish origin for which we were unable to find an exact match.

To have coded them purely as Unknown would have left out their significant presence in the city. Therefore, we have these names double-coding: Jewish, Unknown – or Jewish, Identified Nationality.

If we removed the nationality, we can see the abundance of Jewish last names involved in the restaurant industry.

1898

Even though there continues to be a clustering in Lower Manhattan, we can see that restaurants are increasingly located along the city’s subway lines.

There is also no clear indicator of a predominant nationality for the latter, as compared to those located in Lower Manhattan.

Restaurant Industry at a Glance

Interior of a Chinese restaurant, New York

The New York Public Library Digital Collections. 1907 – 1918

Research

Methods

Our project was motivated by trying to understand the migrant experience in New York City through the role migrant’s took in the city’s restaurant industry. Moreover, we wanted to understand how representative the involvement in the restaurant industry was reflective of migration to urban centers at the time.

The first phase of our project constituted in the mapping of all restaurants reported in two Trow Business Directories: 1898 and 1913. 1898 was chosen because it was the year of the unification of New York City’s five boroughs and due to the role the 1890s played in general for New York City’s establishment itself as an international food hub. 1913 was chosen to show the contrast in time after an already established restaurant network. Moreover, 1913 by being the year before the start of World War I provided this set of data with two additional interesting analyses. The first one is the fact that the economy of New York City would have a drastic change in the coming years. The second observation is that the largest foreign-born population identified in the restaurant industry at the time would have a different self-identification during and after World War I: the German population.

In order to better understand this phenomenon, we wanted to make use of the data existent in business directories and immigration records to provide a background story to these migrants’ experience. Therefore, we coded both 1898 and 1913 Trow Business Directories by gender and background. We understood the challenge of coding affiliation to a previous State -whether that was a nation or an empire – considering that at the time concepts of identity were increasingly influenced by ideas of race, ethnicity and the formation of nation-states.

Given that most businesses were classified by their owner’s name, the method approach was to search for restaurant owners’ names on Ancestry’s “Immigration and Travel” database. Given that this database includes naturalization records processed in New York City from 1794 to 1940 – and that naturalization records required to include profession – we were able to match many of the restaurant owners in the business directories. Even though we could not find an exact match for all owners, Ancestry helped us relate last name origins to specific places and give us a better approximation of those names whose exact match we could not find.

Once we coded these two characteristics for all records, we created two maps (1898 and 1913) to be able to visualize the location of the restaurants and understand whether there were any connections with the origin of the migrants themselves. In order to visualize this data, we used ArcGIS Online which gave us the tools to analyze this data in those terms.

The Next Stage

The first stage of our project has allowed us to make sense of the valuable primary sources we originally had access to. We hope that our work to visualize such data will make it easier for us and other researchers to understand New York as an immigrant city, in this case through its restaurants.

Our work allowed us to identify many immigration records of restaurant owners which we hope to further analyze to further explore the questions of social mobility and life beyond the restaurants for New York’s restaurant keepers. Moreover, we understand that while looking at immigration records can tell us a lot of New York residents who immigrated during the 19th century, we would not be able to find the owners’ names who identified as born in the United States. Therefore, we also hope to use New York’s census records to recover the silences at our current stage.

Resources

Source: 1898 Trow Business Directory
  • Sections for restaurants, delicatessens
  • Restaurant owners’ names and addresses
  • Coding names by (assumed) nationality and gender
ArcGIS for mapping
  • Data visualization – looking for trends in spatial distribution of restaurants
Other Resources

We would like to share the following resources which were helpful for us in our research process.

Team

Zehra Abacioglu

Zehra is a Computer Science major at the Tandon School of Engineering. She is interested in data sensemaking and data visualization. In her spare time she can be found cafe-hopping around the city.

Jessica Molina Abdala

Jessica is a junior at NYU Abu Dhabi, majoring in History. She is originally from Mexico City, where she discovered her passion for the environment. Her research interests lie on the relationship between media, politics and social movements – particularly emphasizing the role of transnational connections. In her free time, Jessica enjoys reading, playing baseball, going to the cinema and playing piano.

Yun Dai

I like to call myself the data services person at the Library of NYU Shanghai. I support research involving data, broadly speaking, for a variety of projects.

http://shanghai.hosting.nyu.edu/data/

Adrian Hodge

As the head of the research and instructional technologies group I work with the Library and other technologists in support of a wide range of projects. These include platforms, emerging tech, digital mapping, and in this project consulting on programmatic OCR.

Katherine Platz

Katherine is a senior studying History in the College of Arts and Science. She is originally from California, and loves to read and drink boba in her spare time. She is excited to learn how to use digital tools for data visualization, storytelling, and archival preservation.

Shambhavi Sengupta

Shambhavi is an Integrated Digital Media sophomore at the Tandon School of Engineering at NYU. Originally from India, she had started off with chemistry, but had later found a piqued interest in digital media and presentation. Most of her projects have involved coding, graphics, video editing, and recently, machine learning. Her hobbies include listening to music, creative writing and riding bicycles around the city.

Leo Zhang

Leo is currently studying in Tandon double majoring in physics and mathematics and minoring in computer science. He enjoys building mathematical models to express visual effects.

Next Project

Foreign Born Citizens and Tenement Housing in the 18th and 19th Century