New York Restaurant Keepers
This project explores the migrant experience in New York through the city’s culinary scene. By the end of the 19th century, a large percentage of the restaurants in the city were owned by foreign-born residents.

West 18th Street, Nos. 461-463, Manhattan
The New York Public Library Digital Collections. 1938.
Project summary
Data Visualization and Mapping
This project explores the migrant experience in New York through the city’s culinary scene. By the end of the 19th century, a large percentage of the restaurants in the city were owned by foreign-born residents.
This project uses data visualization to explore questions of social mobility, the limits between personal and private established through the kitchen, as well as the networks established by restaurant owners for resources and distribution – placing New York City in the creation of national and transnational networks developed by the creation of the restaurants.
We used ArcGIS to map demographic and geographic trends within restaurant ownership in New York City, based on data from the 1898 and 1913 Trow Business Directories. We coded individual restaurant owners’ names by nationality and gender, and created a map of restaurant addresses. Mapping and comparing this data helps us look for patterns in immigrant restaurant ownership, and how these changed with time.
OCR
Following the introduction of the commercial typewriter in the 1880s, typewriting began replacing handwriting in all forms of written communication in the United States. A typewriter is a machine that produces written documents similar to ones created by professional printers, who published books and newspapers in large quantities. A typewriter has keys with characters that leave ink markings on a blank sheet of paper. These machines produced unique documents more legibly and faster than handwriting. Until the 1980s, when personal computers began replacing typewriters, business and government documents were typically created using typewriters. Writers used typewriters to draft their manuscripts. Individuals used them in their private correspondences.
The application of Optical Character Recognition to typeset documents produced by professional printers has created the possibilities of reading typewritten documents using similar technologies. Public and private archives have an abundance of typewritten documents from the twentieth century. Being able to search these documents and to study these texts using statistical and visualization analyses has the potential of transforming their use value to a wide range of scholars. Historians, economists, sociologists, and anthropologists ask questions of the twentieth century and answer them using historical records. The OCR techniques that had been developed for typeset documents produced by professional publishers need to be tailored typewritten documents. Differently from commercially typeset documents used by printers, typewritten documents had few standards for margins, line-spacing, column-width, indentation. These irregularities created additional complications for existing OCR techniques.
The goal of this project is to experiment with optimal OCR workflows for twentieth century typewritten documents. The test case is photographed images of records from NYC Department of Health Business Licenses (ca. 300 images 1917-1924). In addition to irregularities in the typing, these materials are challenging for OCR because the digital images are skewed, blurry, or have low contrast.
01 OCR
02 Historical Data
03 Image Gallery

OCR
Preprocessing Images
In order for the OCR softwares to perform well, we need to clean up the photo into black and white images with only words on the pages, that is cleaning up the boundaries and stains. Although easy for us to tell apart useful information and stains, it is rather complicated algorithmically. Since images are essentially rectangularly stored pixel values. The main tools we use are blur, canny and adaptive threshold functions from python cv2 library and a flood-fill algorithm take out outer edges topologically. We hope to attain a clean image from a linear combination of canny and adaptive threshold results.
The following is an example of an edge cleaning process:
Though seemingly trivial, the dots in the left border section critically obstruct the OCR process and messes up the table structure when fed into the OCR softwares.
Here are some details:
- Canny vs. Adaptive threshold
- PCA approach
- Flood Fill Algorithm
The flood fill algorithm is used to fill a connected region of the same color in the page to the same color. It is used for the following scenario: we first blur the edges in the image and take the original letter region away to form a “bubble” around the words
Tesseract vs. ABBYY
Currently, there are two main OCR softwares: Tesseract, developed by Google, and ABBYY FineReader, developed by ABBYY. This semester, the OCR group explored and compared between both of them. We see that, in general, ABBYY performs better at reading files both in terms of character wise accuracy and recognizing table structures, yet Tesseract is an open-source software favored by the OCR community for its flexibility in python and c++ code.
Tesseract example
ABBY example
Tesseract
Unfortunately, Tesseract would often misread some images, as shown in the images above, where it would read a column first and read another column without understanding they belong in rows. This points out the drawback Tesseract has, in which it does not have table recognition like ABBYY.
One feature that Tesseract had that was promising and should not be overlooked is that it gives us the direct option to receive the output in hOCR format. This format is essentially in HTML, with tags that we can use to select various parts of the extracted text. This is a very useful feature that can be used and tweaked with in future iterations of the project.
ABBYY
ABBYY advantages and drawbacks:
ABBYY is a multinational software company that specializes in document capture and optical character recognition. It is advantageous in the fact that there is no coding required and output depends on drawing text boxes manually in the required areas. However, some of the main disadvantages include its inability to remove noise from the white parts of the paper, even after pre-processing. There are marks that get read as text and blurry letters that gets read as symbols. It also reads most of the ditto signs in some of the rows as other symbols, not recognizing them as apostrophes. These rows itself are hard to read because of their random splitting, which is one of the biggest problems in both softwares.
Future research/improvements:
The next plausible step would most probably be looking for training of fonts to remove the symbol problems, and set it to a particular font matching the one of the input files so it is easier to read. Also, using the eraser tool to remove the extra noise and make it cleaner, but another problem with that is that it has to be done manually.
Future Research
Preprocessing can only improve the OCR performance up to a point if we don’t tamper with character recognition algorithm. i.e. font training, a type of machine learning.
Faculty Researchers – Yun Dai and Adrian Hodge
Our role in this project is a lot of brainstorming and analysis around ways we could approach the problem of accurately interpret data and scans.
Through a number of iterations and experiments we tackled the issues of accurately finding and representing rows of text data, extraction of tabular data from the ABBYY tool, export formats and other output data, up to using regular expressions to extract the information we needed to organize text into accurate columns.
Machine Learning
Overall, the next step is to translate what worked for the current cases into usable algorithms, and applying the model to new images sharing the same structure. We may also use ABBYY as a trainable tool in this process. We shall collaborate with computer science specialists towards accurate identification, recognition and formatting of historical fonts present in a wide range of archive materials.
Data management questions
- Storing of gathered data and files
- Iterative record keeping and archiving
- Collaboration tools and methods
- Working towards common output formats
- Online publishing, web presence and visualisation
Dining in the extremely attractive surroundings of the roof garden restaurant of the Ritz-Carlton Hotel, New York, in which dancing [is] enjoyed during the dinner hour and evening.
The New York Public Library Digital Collections. 1918

Historical Data
1898 Restaurant Ownership by Nationality
1898
The mapping of restaurants in 1898 New York City allowed us to see an overwhelming population of restaurant owners who in their naturalization records resigned any affiliation to “The Emperor of Germany.”
Typically, anglicization of German immigrant names in the United States is associated with a post-WWI period. This map shows us that at least within the restaurant industry, there was no clustering in a specific area for German immigrants, like there was for other immigrant identities – such as the Italian.
1913
The juxtaposition with a map of current New York City, reveals how what is known today as Little Italy was already forming in 1898. The same does not necessarily hold true for German restaurants, which are not clustered at the time and also had a variety of services offered: acting as the city’s breweries and bars.
Central Park [Pleasure] Garden, Restaurants and Theodore Thomas Concerts
The New York Public Library Digital Collections. 1876

1913
1913 Restaurant Ownership by Nationality
1913
Even though we coded owners’ political background by identification with the empire or nation of origin, there was an abundance of last names with Jewish origin for which we were unable to find an exact match.
To have coded them purely as Unknown would have left out their significant presence in the city. Therefore, we have these names double-coding: Jewish, Unknown – or Jewish, Identified Nationality.
If we removed the nationality, we can see the abundance of Jewish last names involved in the restaurant industry.
Interior of a Chinese restaurant, New York
The New York Public Library Digital Collections. 1907 – 1918

Image Gallery





Research
Methods
Our project was motivated by trying to understand the migrant experience in New York City through the role migrant’s took in the city’s restaurant industry. Moreover, we wanted to understand how representative the involvement in the restaurant industry was reflective of migration to urban centers at the time.
The first phase of our project constituted in the mapping of all restaurants reported in two Trow Business Directories: 1898 and 1913. 1898 was chosen because it was the year of the unification of New York City’s five boroughs and due to the role the 1890s played in general for New York City’s establishment itself as an international food hub. 1913 was chosen to show the contrast in time after an already established restaurant network. Moreover, 1913 by being the year before the start of World War I provided this set of data with two additional interesting analyses. The first one is the fact that the economy of New York City would have a drastic change in the coming years. The second observation is that the largest foreign-born population identified in the restaurant industry at the time would have a different self-identification during and after World War I: the German population.
In order to better understand this phenomenon, we wanted to make use of the data existent in business directories and immigration records to provide a background story to these migrants’ experience. Therefore, we coded both 1898 and 1913 Trow Business Directories by gender and background. We understood the challenge of coding affiliation to a previous State -whether that was a nation or an empire – considering that at the time concepts of identity were increasingly influenced by ideas of race, ethnicity and the formation of nation-states.
Given that most businesses were classified by their owner’s name, the method approach was to search for restaurant owners’ names on Ancestry’s “Immigration and Travel” database. Given that this database includes naturalization records processed in New York City from 1794 to 1940 – and that naturalization records required to include profession – we were able to match many of the restaurant owners in the business directories. Even though we could not find an exact match for all owners, Ancestry helped us relate last name origins to specific places and give us a better approximation of those names whose exact match we could not find.
Once we coded these two characteristics for all records, we created two maps (1898 and 1913) to be able to visualize the location of the restaurants and understand whether there were any connections with the origin of the migrants themselves. In order to visualize this data, we used ArcGIS Online which gave us the tools to analyze this data in those terms.
The Next Stage
The first stage of our project has allowed us to make sense of the valuable primary sources we originally had access to. We hope that our work to visualize such data will make it easier for us and other researchers to understand New York as an immigrant city, in this case through its restaurants.
Our work allowed us to identify many immigration records of restaurant owners which we hope to further analyze to further explore the questions of social mobility and life beyond the restaurants for New York’s restaurant keepers. Moreover, we understand that while looking at immigration records can tell us a lot of New York residents who immigrated during the 19th century, we would not be able to find the owners’ names who identified as born in the United States. Therefore, we also hope to use New York’s census records to recover the silences at our current stage.
Resources
Source: 1898 Trow Business Directory
- Sections for restaurants, delicatessens
- Restaurant owners’ names and addresses
- Coding names by (assumed) nationality and gender
ArcGIS for mapping
- Data visualization – looking for trends in spatial distribution of restaurants
Other Resources
We would like to share the following resources which were helpful for us in our research process.
- Santlofer, Joy. Food City: Four Centuries of Food-Making in New York. WW Norton & Company, 2016.
- The New York Public Library’s Culinary History (link)
- Ancestry.com’s Immigration and Travel Database (link)

Team
Zehra Abacioglu
Zehra is a Computer Science major at the Tandon School of Engineering. She is interested in data sensemaking and data visualization. In her spare time she can be found cafe-hopping around the city.
Jessica Molina Abdala
Jessica is a junior at NYU Abu Dhabi, majoring in History. She is originally from Mexico City, where she discovered her passion for the environment. Her research interests lie on the relationship between media, politics and social movements – particularly emphasizing the role of transnational connections. In her free time, Jessica enjoys reading, playing baseball, going to the cinema and playing piano.
Yun Dai
I like to call myself the data services person at the Library of NYU Shanghai. I support research involving data, broadly speaking, for a variety of projects.
Adrian Hodge
Katherine Platz
Katherine is a senior studying History in the College of Arts and Science. She is originally from California, and loves to read and drink boba in her spare time. She is excited to learn how to use digital tools for data visualization, storytelling, and archival preservation.
Shambhavi Sengupta
Leo Zhang
