The dataset was obtained from the online Yelp Dataset Challenge, consisting of five parts which provides us with 566,000 basic business information (e.g., hours, address, ambience), 2.2 million customer reviews as well as 519,000 tips by 552,000 users. The total size of data is about 2.39GB. For this project, we used customer reviews and business attributes data. These two datasets are both in json format. Attributes in review data include business id, full address, price range, business categories and etc, while attributes in review data include review content, rating, business id and etc. The attributes we use are business id, business categories, review content and rating. Specifically, review content is the corpus for our analysis; rating is our identifier for discriminate positive or negative sentiment; business id serves as the key for data munging and the business categories serve as the key for grouping. The meaning and data type of these attributes are shown in Table 1 below:
|business id||string||the unique identifier for businesses in Yelp|
|business category||string||the type of business (e.g. restaurant, Chinese)|
|review content||string||the text content of review|
|rating||float||the rating accompanied with each review|
Data Cleaning and Munging
The business dataset was first filtered by the attribute “category” so that only the reviews of restaurants were left. Then the business dataset was merged with the reviews dataset by the attribute “business id”. After that, the words in each review were separated and the punctuations were removed so that a “bag of words” was generated for each review. Finally, we stemmed,lemmatized and filtered out the stop words in each bag of words using both the in-built list in Python’s NLTK package.
The merged review-business data were randomly separated into training, validation and testing set according to ratio 3:2:5.
A Support Vector Machine (SVM) model was built in Python to differentiate positive and negative words in reviews. The features in the model were the frequencies of various words appeared in the review and the labels were “positive” or “negative” distinguished based on the value the rating. Specifically, we assumed and labeled reviews with ratings greater or equal to 4 as positive while the rest as “negative”. This decision was made based on our observation of the distribution of ratings.
Since each word was treated as an individual feature, a sparse feature matrix with very high dimensions would be generated. To deal with this problem, a dictionary instead of a list (vector) would be set up for each review, providing information for only the words appear in the review, which avoids having lots of zeros in the list. This operation may save lots of calculation time. We also wrote several functions of dictionaries for basic operations that may be used in the iteration process described below.
In this project, we used the Pegasos Algorithm to build the SVM since this algorithm was proved to have high computational efficiency when putting forward by Shwartz, Singer, Srebro and Cotter in 2011. To align with the notation used in the Pegasos’ paper, we’re considering the following formulation of the SVM objective function.
Pegasos is a stochastic subgradient descent method with varied step size. The pseudocode is given below.
By assessing the performance of different lambda on the validation set, we found that the result with best accuracy were achieved when lambda was set to 0.0003. The corresponding validation error was 11.035%.
Figure 1. Curve of Test Error
At last, we applied the scores of words to the test dataset and evaluated the accuracy of our classifier.
In order to find the specific words that were used to indicate the unique characteristic of each restaurant category, we neglect those adjectives that simply describing the polarity of sentiment (i.e. “good”, “amazing”, “terrible” and etc.) and assumed the rest of words could reflect the characteristic of different restaurant categories. Finally, to get a total sentiment score (a value that reflects the polarity of sentiment) towards each restaurant category, the sentiment score of each word was first multiplied by its frequency, and then normalized by the total number of reviews for the specific category of restaurants.
The accuracy of the SVM classifier on the test dataset is 88.906% with lambda setting to 0.0003. We also find that after stemming, lemmatizing and deleting stop words, the accuracy of our model is lower. Our guess is that some stop words may still have sentiment tendency. By ignoring these words, potential information was lost and lead to a lower accuracy. This viewpoint is supported in Saif’s paper.For instance, it has been proven that customers tend to use the words “I/me/mine/his” more frequently when they want to express negative evaluation.
The top 5 negative/positive words that were found in the review for all restaurants were list in the pictures below.
Figure 2. Top 5 Negative Words of All Types of Restaurants
From the top 5 negative words list, it can be easily observed that dry food is really a taboo when it comes to the food quality since it outnumbers others by a huge amount. In addition, we can also conclude that restaurant owners should avoid slow or rude service based on this result. As for the appearance of word cold, its meaning remains to be ambiguous since we are not sure whether it refers to food or the environment of the restaurant.
Figure 3. Top 5 Positive Words of All Types of Restaurants
Unexpectedly, the flavor of dishes does not rank first among all positive reviews. Instead, the service of restaurants seems to be the priority for most customers, since word friendly ranks first and word attentive is also in top 5. It can also be observed that when it comes to the flavor of food, customers value freshness more than tastiness and spicy food seems to appeal many customers.
The top-ranked words for different types of restaurants were also detected (See Appendix 1, Appendix 2), providing us a basic understanding of the characteristics for each category of restaurants.
From the negative word list, we could observe that overpriced is one of the main problems for Italian, French and east Asian restaurants. For Chinese and Vietnamese restaurants, the rude attitude of servers is more likely to be the reason for a low score. On the other hand, we notice that fresh ranks first among the positive words for Japanese and Vietnamese food. There are also some dishes’ names present in the positive word list, which might indicate that people prefer certain restaurants for their specific dishes, like pho in Vietnamese food and pizza in Italian food
or maybe such dishes are just easier to satisfy Yelpers than others.
In this project, we have developed an efficient SVM model for discriminating positive or negative sentiment on Yelp’s reviews with an accuracy of 88.906% on the test set. In fact, our model can also be used to automatically generate ratings for tips (short reviews that are not accompanied with ratings) on Yelp by assigning weights to tips using the sentiment score of words and thus giving more reasonable overall ratings for restaurants.
Based on our analysis, we found out that for most restaurant types, friendly ranks first before all the other positive comments, indicating that service might weight more than taste when people are judging a restaurant. In addition, different characteristics are shown for different restaurant categories. Japanese and Vietnamese food received positive feedback because of freshness, while Korean and Thai restaurants received positive reviews for their spicy food. It could also be noticed that most Asian restaurants are considered to be salty including Chinese, Japanese, Southern, Thai and Vietnamese. While French, Italian, Japanese and Korean food are regarded to be overpriced, which might have something to do with their relatively better environment and service. On the contrary, servers in Chinese and Vietnamese restaurants are mentioned to be rude.
Although the performance of our model is acceptable, there are still a lot of spaces for improvement. One of the suggestions for future work is to try boosting or Naïve Bayes algorithms or applying the neural network for classifications and to check whether it outperforms our algorithm. For the feature selection part, the tf-idf measure may also be considered instead a simple count of each word. Also, considering the large size of data, we suggest performing the work on a big data framework such as Spark to increase the computing efficiency.
Appendix 1. Top Negative Words for Different Types of Restaurants
Appendix 2. Top Positive Words for Different Types of Restaurants