Mining DOHMH Restaurants Inspection Records

Detect Risky Violations Correlated to Food Poisoning Incidents

1. Introduction

In New York City, there are over 24,000 restaurants and food retailers are constantly welcoming customers from all over the world. Meanwhile, there are over 6,000 people end up hospitalized for foodborne illness in New York City every year2. Although the exact proportion of foodborne illness which caused by restaurants is unknown, it is found that consumption of food prepared outside the home has been linked to an increased risk of sporadic foodborne diseases1. Therefore, food safety in restaurants turns out to be a problem that cannot be neglected. In an effort to better improve the food safety and the transparency of inspection, the Department of Health and Mental Hygiene in New York launched a letter-grade system to indicate the performance of restaurants in health and environmental perspectives in 2010.
DOHMH’s inspection result of each restaurant is open to public and residents in the city are informed promptly about the condition of the restaurants. Although the existing system has urged the restaurants and food retailers to improve, there is still some food poisoning incidents happen and they even happen in those food retailers that received A grades. For example, Bella Blu, an Italian restaurant in Upper East Side was subject to a food poisoning incident in mid January3 after it had received an A grade during the inspection in late October last year. Hence, we believed there was still space for more improvement based on the existing grading mechanism to control food health and public safety and some violations in the current grading system might need more emphasis during the guidance and inspection.

2. Objective and Goal

In this project, we wanted to find those violations that have been underestimated but have a high risk to cause foodborne illness by linking and mining the 311 food poisoning report and DOHMH restaurant inspection record. Specifically, through cross-referencing the restaurants referred in the 311 food poisoning report and DOHMH inspection results, a list of restaurants (our study area) that have food poison reports against them would be generated. Then based on the list, we would like to identify the most common violations and hopefully discover the patterns of violations that had committed by these restaurants from the restaurant inspection results. The final product would be a list of such risky violations that need to draw more attention.
For example, if we observe violation A frequently appears in the inspection record of the restaurants in our list, it might imply that A is highly correlated to foodborne illness. In addition, if we find that violation A is non-critical but repeatedly occurs with violation C, D and E, it will provide evidence to show that there are several critical or non-critical violations that occur along with violation A and food inspectors and the administrators should look into and provide more guidance on this violation. Also, DOHMH might consider taking this violation in the critical category. Such actions could improve the grading mechanism by categorizing them more reasonably and better prevent the food poisoning from the food production process.

3. Data Cleaning and Transformation

DOHMH restaurant inspection data and 311 food poisoning data was the two major datasets used in this project. Since the whole 311 data on NYC OpenData website was too big, we first narrowed down the feature space of the 311 dataset by just keeping the attributes we need for the analysis like the date of occurrence, types of events, number of people involved, restaurant’s name, location and etc. in order to reduce the size of data to be processed. These are done using the filter function in NYC OpenData website.
For DOHMH restaurant inspection data, we used the same process and kept the restaurant name, cuisine type, address, zip code, borough, violation codes, score, letter grade and inspection date.
Furthermore, some entries that do not have either score or letter grade are eliminated from the dataset.
Since we wanted to cross-reference the restaurants in both dataset by geocoding the location of them, we preprocessed the street names via Python. In this case, we changed the abbreviation words “ST”, “AV”, “RD” to “STREET”, “AVENUE”, “ROAD”, and removed the hyphen in the words.

4. Analytical methods

4.1 Find the validated food poisoning incidents in 311 data:

The first step of our analysis is to found the food poisoning incidents in the 311 dataset and we did this by filtering a few of features in the data. The complaint type and descriptor were first limited to just “food poisoning” and “3 or more”, respectively since we only wanted to investigate the foodborne illness incidents caused during the food preparation process and we wanted to exclude the non-related or isolated reports such as food allergy. To make sure the report was confirmed by DOHMH to be a case of food poisoning, the data was then narrowed down so that the resolution column only contained one the following categories:

  • The DOHMH has investigated the complaint and issued a Notice of Violation. The agency will continue to monitor.
  • The Department of Health and Mental Hygiene has investigated the complaint. Owner/manager ordered to abate nuisance. Violations of rules and regulations issued and fines may be levied as appropriate.
  • The Department of Health and Mental Hygiene has investigated the complaint. Violations of rules and regulations issued resulting in the closure of the establishment by DOHMH. The establishment will be re-opened if the violations are corrected.
  • The facility you reported will receive a sanitary inspection.

4.2 Cross-reference the restaurants in both datasets:

In the previous step, we got a list of reports that had received food poisoning complaints and we wanted to link them with the restaurant inspection data. This was achieved by merging these two datasets using geographical information such as address and coordinates (longitude, latitude) in them. Although the DOHMH restaurant inspection record data did not provide us with intuitive location, we found a table which had information with regard to the geocoded restaurant locations in the DOHMH restaurant inspection record data.
Considering that the fact that the restaurants and the places where food poisoning complaints happened cannot match perfectly as the longitude and latitude of 311 complaints have random noise around the exact location, we developed our own algorithm to cross-reference the location in the two datasets by checking whether the restaurant’s address agree with the complaints’ address and validate the closeness of two coordinates simultaneously. Finally, using the unique CAMIS code for each restaurant identified, we were able to subset the DOHMH restaurant inspection data and that was the dataset we would use for study in the next step. (We referred this dataset to be study data in this report)

4.3 Analyze and item set mining the study data:

An descriptive data analysis was conducted on the dataset such as checking the size of data and finding the most frequent violations in the dataset. The violation code in the dataset was then grouped by CAMIS code as well as inspection date and turned them to a nested list for pattern finding purposes via Python Pandas.
In order to discover the patterns of violations committed by these restaurants, we applied association rule learning technique to the nested list in that association rule learning was good at identify strong rules discovered in databases using some measures of interestingness. For the algorithm for association rule learning, we chose RElim algorithm simply because it computationally outperformed SaM and FP-growth algorithm and could be conveniently realized in Python.

5. Ethical and Privacy Considerations

Overall, our project generated little personal information and ethical issues. The restaurants with food poisoning report were exposed to the public. On the other hand, the reporters’ identities were well-protected by NYC websites since the dataset did not contain any information about the reporters.
However, during the process of matching the locations from 311 food poisoning data to DOHMH restaurant inspection data, we had noticed that some locations reported were not quite accurate. Since there was no restaurant name attribute in 311 reports, restaurants’ information might be mismatched. Reporter might provide the wrong information about the incident occurrence. This series of issues could influence some restaurants’ reputations as well.
The result, the list of the restaurants, might give people the impression of discriminating some types of cuisine in the city. For example, some cuisines’ food processing was quite special but definitely a violation based on the inspection criteria. DOHMH should adjust its grading criteria to accommodate the special food processing of certain types of cuisines.

6. Data Outputs and Results

After the cross-referencing, we successfully find out 513 restaurants that had both at least one record of food poisoning complaint and inspection history on the 311 data and DOHMH restaurant inspection record, respectively. By counting the unique violation, we found that general violation 10F: “Equipment not easily movable or sealed to floor, adjoining equipment, adjacent walls or ceiling. Aisle or workspace inadequate.” to be the most common violation (appeared 1752 times) followed by general violation 08A: “Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.” (appeared 1363 times) and critical violation 02G: “Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation” (appeared 1207 times). This result indicated that although violation 10F and 02G were non-critical, they should be seriously taken care of since they were closely correlated to food poisoning risks. According to the analysis, violations that closely correlated to foodborne illness were: 10F, 08A, 02G, 04L, 10B, 06D.


Figure 1. The most frequent violation codes in the study data

Beyond that, we found a pattern with respect to violation 10F through association rule learning (ARL), which was that there were several critical and general violations that occur along with violation 10F and they were:

  • 02G (Critical): cold storage violation, “Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation”.
  • 04l (Critical): “Evidence of rats or live rats present in facility’s food and/or non-food areas”.
  • 06C (Critical): “Food not protected from potential source of contamination during storage, preparation, transportation, display or service”.
  • 08G (General): “ Facility not vermin proof. Harborage or conditions conducive to vermin exist”.
Rule Confidence Rule Category
1 0.91 02G ==> 10F Critical
2 0.82 04l ==> 10F Critical
3 0.72 06C ==> 10F Critical
4 0.71 08G ==> 10F General

Table 1. Result of association rule learning

The average confidence level between violation 10F and the violations group above is approximately 0.7 according to the result of ARL, which means about 70% of the violation 10F in each inspection accompanied by the violations group above. This result indicated that 10F is a violation, though non-critical, might trigger many other critical violations. Based on this result, we suggested this violation should be included in the critical category because of its recurrent nature; and due to the fact that it might be the root cause for many other critical violations. We also suggested that administrators in DOHMH should lay more emphasis on the guidance to correct general violation 10F.
There were other similar patterns with confidence level above 0.67 found by ARL. Nevertheless, since violations with respect to those patterns were all critical ones, we did not address them in this report.

7. Reproducibility

This project is fully reproducible and some data as well as all scripts used for data treatments are available on the project repository on GitHub. However, several reasons might result in the difference of the final result.
First, due to the limitation of file size of GitHub, the 311 reports and DOHMH restaurant inspection data were not uploaded to the project repository. Although they are accessible on the NYC OpenData website, the data on that would be different from the one used in this project since they are real-time. To retrieve the same dataset as we used, filter both datasets by created date or inspection date to be from July 2010 to December 2015.
Second, changing parameters in geocoding algorithm we used might affect the performance of cross-referencing process, which would also lead to a change in the result. A different geocoding method such as using GeoClient API by DoITT could also lead to a different result.
Third, we used pymining library for conducting ARL. Using a newer edition of the library or using other libraries could affect the result of ARL. Nonetheless, this should only result in minor differences as it comes to the final result.

8. Suggested Next Steps

Our analysis provided a list of violation codes that needs to draw more attention, but we did not find a method to differentiate the risk of those violations. For the next step, we suggested more item set mining techniques to be applied on the dataset such as frequent item set mining. After doing such analysis, one might consider constructing a scientific metric to quantify the risk for each violation code and assigned an appropriate weight on it. Finally, DOHMH could leverage such weights as a reference to improve and refine its grading mechanism by using a weighting algorithm rather than just have three categories for violations as the one used in the current grading system4.
We also suggested that one could investigate the relationship between violation codes and time so that the weight of each violation could fluctuate as a function of time according to the risk of that violation during the inspection period.


  1. Angulo, F. J., & Jones, T. F. (2006). Eating in restaurants: a risk factor for foodborne disease?.Clinical Infectious Diseases, 43(10), 1324-1328.
  2. Community Health Survey. (2011). New York City Department of Health and Mental Hygiene. New York, NY.
  3. Mongelli, L., & Eustachewich, L. (2016, January 19). Birthday girl: Eatery’s meatballs left me ‘sick as a dog’. Retrieved March 14, 2016, from
  4. NYC Health Department. (2010, December). What to Expect When You’re Inspected: A Guide for Food Service Operators. Retrieved March, 2016, from