Introduction
Sentiment analysis has become a popular method in recent years to learn about opinions of clients about product and services. It is used both for academic purposes and in commerce.
It is essentially mining for data which are then evaluated for subjective opinions or sentiments.
Valuable information can come from websites selling products and services, e.g. reviews on Amazon or Tripadvisor. Even larger data sets of sentiment data can be obtained from analysing data produced on social media platforms like Twitter, Instagram and others.
Historically, the first phase of sentiment analysis focused on determining the overall sentiment or sentiment polarity of sentences, paragraphs or entire documents.
However, companies have become more demanding in recent period and they are not only interested in overall sentiment of texts about their products and services. They want to know more details about what the customers are talking about:
- which specific products are mentioned in customer opinions
- what aspects of products or services are mentioned (e.g. for hotel possible aspects can be location, service, price)
- what is the opinion, sentiment on these aspects as gathered from customer reviews
Aspect Based Sentiment Analysis
This approach is also known under a specific name – Aspect Based Sentiment Analysis (ABSA).
ABSA is essentially interested learning more about specific aspects of products or services. ABSA consists of several methods:
- identification of relevant entities
- extraction of their features and aspects (also sometimes called aspect extraction)
- using so-called aspect terms to find out the sentiment about a particular feature or aspect (sentiment polarities are positive, neutral and negative)
How do we determine the aspects? One can use several approaches, from deep learning to dependency parsing. A great library to do dependency parsing and extract aspects is spacy. Also often used dependency library is Stanford CoreNLP.
Sentiment analysis, i.e. determining sentiment of aspects or whole sentences can be done by training machine learning or deep learning models. I will show you the code how you can train a rather large and accurate model for sentiment classification by yourself.
Training sentiment classifier using machine learning involves:
- preparing a suitable labelled data set (we will use Stanford labelled data set of tweets)
- using a specific machine learning model, e.g. Support Vector Machines are very suitable for this text classification taks
- training the model on data set
- evaluating the results (check precision, recall, f-score and accuracy)
- use the model in production to produce insights
There are companies that have built sentiment classification systems and offer sentiment analysis consulting to build you a sentiment classification system that is customised for your needs.
Sentiment analysis can be applied on many types of texts
Sentiment analysis allows you to extract sentiment from a wide array of possible texts:
- tweets
- instagram posts
- product reviews
- restaurant reviews
- hotel reviews
- surveys
- emails
- tickets (support)
Sentiment analysis or opinion mining is a great solution for companies that have big data in form of unstructured texts, e.g. email communications with customers. It allows them to gain valuable information and actionable insights from this repositories of data.
Training a sentiment classifier based on SVM (Support Vector Machines) and using Stanford 140 data set
We will use the Scikit-learn library to train a sentiment classifier based on SVM.
We will be using the Stanford 140 data set. You can download it from this website:
http://help.sentiment140.com/for-students
Note the data format, it has 6 fields:
0 – the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
1 – the id of the tweet
2 – the date of the tweet
3 – the query. If there is no query, then this value is NO_QUERY.
4 – the user
5 – the text of the tweet
Let us load the libraries:
1 2 3 4 5 6 7 8 |
import pandas as pd from IPython.display import display from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import learning_curve from sklearn.externals import joblib |
Next step is loading the data from Stanford 140 data set and preprocess it:
1 2 3 4 5 6 7 8 9 10 11 12 |
def clean_tweets(df): df['Text']=df['Text'].str.lower() df['Text']=df['Text'].apply(lambda x: ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())) return df # preprocess data df=pd.read_csv('training.csv',encoding='ISO-8859-1') df.columns=['Sentiment','Id','Date','Query','User','Text'] del df['Id'], df['Date'], df['Query'], df['User'] df['Sentiment']=df['Sentiment'].map({0:'Negative',2:'Neutral',4:'Positive'}) df = df.sample(frac=0.1).reset_index(drop=True) df=clean_tweets(df) |
We will use TF-IDF representation of tweets before feeding them to the SVM model:
1 2 3 4 5 6 7 8 9 |
# preparing training, test data X=df['Text'].to_list() y=df['Sentiment'].to_list() X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20, random_state=10) Vectorizer = TfidfVectorizer(max_df=0.9,ngram_range=(1, 2)) TfIdf=Vectorizer.fit(X_train) X_train=TfIdf.transform(X_train) le = LabelEncoder() y_train = le.fit_transform(y_train) |
We next train the model using the linear SVM from scikit-learn:
1 2 3 |
# training the model model =sklearn.svm.LinearSVC(C=0.1) model.fit(X_train,y_train) |
After converging, we can evaluate the accuracy of the model by calculating precision, recall and f1-score:
1 2 3 4 5 |
# evaluation X_test=TfIdf.transform(X_test) y_pred=model.predict(X_test) y_test=le.transform(y_test) print(classification_report(y_test, y_pred)) |
The sentiment classifier trained on Stanford 140 data set has a good accuracy of 82%:
1 2 3 4 5 6 7 8 |
precision recall f1-score support 0 0.82 0.82 0.82 79725 1 0.82 0.82 0.82 80275 micro avg 0.82 0.82 0.82 160000 macro avg 0.82 0.82 0.82 160000 weighted avg 0.82 0.82 0.82 160000 |