Introduction

Sentiment analysis has become a popular method in recent years to learn about opinions of clients about product and services. It is used both for academic purposes and in commerce.

It is essentially mining for data which are then evaluated for subjective opinions or sentiments.

Valuable information can come from websites selling products and services, e.g. reviews on Amazon or Tripadvisor. Even larger data sets of sentiment data can be obtained from analysing data produced on social media platforms like Twitter, Instagram and others.

Historically, the first phase of sentiment analysis focused on determining the overall sentiment or sentiment polarity of sentences, paragraphs or entire documents.

However, companies have become more demanding in recent period and they are not only interested in overall sentiment of texts about their products and services. They want to know more details about what the customers are talking about:

  • which specific products are mentioned in customer opinions
  • what aspects of products or services are mentioned (e.g. for hotel possible aspects can be location, service, price)
  • what is the opinion, sentiment on these aspects as gathered from customer reviews

Aspect Based Sentiment Analysis

This approach is also known under a specific name – Aspect Based Sentiment Analysis (ABSA).

ABSA is essentially interested learning more about specific aspects of products or services. ABSA consists of several methods:

  • identification of relevant entities
  • extraction of their features and aspects (also sometimes called aspect extraction)
  • using so-called aspect terms to find out the sentiment about a particular feature or aspect (sentiment polarities are positive, neutral and negative)

How do we determine the aspects? One can use several approaches, from deep learning to dependency parsing. A great library to do dependency parsing and extract aspects is spacy. Also often used dependency library is Stanford CoreNLP.

Sentiment analysis, i.e. determining sentiment of aspects or whole sentences can be done by training machine learning or deep learning models. I will show you the code how you can train a rather large and accurate model for sentiment classification by yourself.

Training sentiment classifier using machine learning involves:

  • preparing a suitable labelled data set (we will use Stanford labelled data set of tweets)
  • using a specific machine learning model, e.g. Support Vector Machines are very suitable for this text classification taks
  • training the model on data set
  • evaluating the results (check precision, recall, f-score and accuracy)
  • use the model in production to produce insights

There are companies that have built sentiment classification systems and offer sentiment analysis consulting to build you a sentiment classification system that is customised for your needs.

Sentiment analysis can be applied on many types of texts

Sentiment analysis allows you to extract sentiment from a wide array of possible texts:

  • tweets
  • instagram posts
  • product reviews
  • restaurant reviews
  • hotel reviews
  • surveys
  • emails
  • tickets (support)

Sentiment analysis or opinion mining is a great solution for companies that have big data in form of unstructured texts, e.g. email communications with customers. It allows them to gain valuable information and actionable insights from this repositories of data.

Training a sentiment classifier based on SVM (Support Vector Machines) and using Stanford 140 data set

We will use the Scikit-learn library to train a sentiment classifier based on SVM.

We will be using the Stanford 140 data set. You can download it from this website:

http://help.sentiment140.com/for-students

Note the data format, it has 6 fields:

0 – the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
1 – the id of the tweet
2 – the date of the tweet
3 – the query. If there is no query, then this value is NO_QUERY.
4 – the user
5 – the text of the tweet

Let us load the libraries:

Next step is loading the data from Stanford 140 data set and preprocess it:

We will use TF-IDF representation of tweets before feeding them to the SVM model:

We next train the model using the linear SVM from scikit-learn:

After converging, we can evaluate the accuracy of the model by calculating precision, recall and f1-score:

The sentiment classifier trained on Stanford 140 data set has a good accuracy of 82%: