Bitcoin sentiment analysis

Crypto markets are much more susceptible to the hype and mood of investors than e.g. stock markets. Latter is a different market as the companies value is defined by discounted cash flow valuation and is largely dependent on some observable – financial statements. Most of the time spent by financial analysis is thus on analysing financial statements and predicting revenue, net profit, cashflow, e.g. free cash flow.

In case of cryptocurrencies there is no comparable thing. This is why the value of given coin is much more dependent on the opinion of others, investors, about the coin.

This has led to an enlarged importance of opinions and sentiments about coins in terms of given altcoin or bitcoin valuation.

Bitcoin sentiment analysis can thus be an important in assessing or predicting the bitcoin price developments.

Bitcoin sentiment analysis is done by regularly collecting tweets and other social media posts about bitcoin and then determining the sentiment (positive or negative) for each social media post. This is usually done by using data science and machine learning techniques, first training machine learning model, using e.g. sklearn that is able to predict the sentiment (positive or negative, 1 or 0) for given text. And then deploying this on the stream of tweets and other social media posts.

The latter needs to be managed with some kind of data pipeline, e.g. employing spark or other libraries for this purpose.

Sentiment analysis is just one of many text classification models. Others include product categorization, news classification, product tagging and others.

Product categorization is e.g. especially important for the eCommerce ecosystem where the online stores often want to determine categories of product that they sell. They can thus allow their customers an easier search for their products.

Product tagging on the other hand is a more modern variant of the product categorization, where online stores do not assign one category to given product but rather one or more tags. This allows the user of online stores an even more refined search for their products.

Sentiment Analysis with Machine Learning, Opinion Mining

Introduction

Sentiment analysis has become a popular method in recent years to learn about opinions of clients about product and services. It is used both for academic purposes and in commerce.

It is essentially mining for data which are then evaluated for subjective opinions or sentiments.

Valuable information can come from websites selling products and services, e.g. reviews on Amazon or Tripadvisor. Even larger data sets of sentiment data can be obtained from analysing data produced on social media platforms like Twitter, Instagram and others.

Historically, the first phase of sentiment analysis focused on determining the overall sentiment or sentiment polarity of sentences, paragraphs or entire documents.

However, companies have become more demanding in recent period and they are not only interested in overall sentiment of texts about their products and services. They want to know more details about what the customers are talking about:

  • which specific products are mentioned in customer opinions
  • what aspects of products or services are mentioned (e.g. for hotel possible aspects can be location, service, price)
  • what is the opinion, sentiment on these aspects as gathered from customer reviews

Aspect Based Sentiment Analysis

This approach is also known under a specific name – Aspect Based Sentiment Analysis (ABSA).

ABSA is essentially interested learning more about specific aspects of products or services. ABSA consists of several methods:

  • identification of relevant entities
  • extraction of their features and aspects (also sometimes called aspect extraction)
  • using so-called aspect terms to find out the sentiment about a particular feature or aspect (sentiment polarities are positive, neutral and negative)

How do we determine the aspects? One can use several approaches, from deep learning to dependency parsing. A great library to do dependency parsing and extract aspects is spacy. Also often used dependency library is Stanford CoreNLP.

Sentiment analysis, i.e. determining sentiment of aspects or whole sentences can be done by training machine learning or deep learning models. I will show you the code how you can train a rather large and accurate model for sentiment classification by yourself.

Training sentiment classifier using machine learning involves:

  • preparing a suitable labelled data set (we will use Stanford labelled data set of tweets)
  • using a specific machine learning model, e.g. Support Vector Machines are very suitable for this text classification taks
  • training the model on data set
  • evaluating the results (check precision, recall, f-score and accuracy)
  • use the model in production to produce insights

There are companies that have built sentiment classification systems and offer sentiment analysis consulting to build you a sentiment classification system that is customised for your needs.

Sentiment analysis can be applied on many types of texts

Sentiment analysis allows you to extract sentiment from a wide array of possible texts:

  • tweets
  • instagram posts
  • product reviews
  • restaurant reviews
  • hotel reviews
  • surveys
  • emails
  • tickets (support)

Sentiment analysis or opinion mining is a great solution for companies that have big data in form of unstructured texts, e.g. email communications with customers. It allows them to gain valuable information and actionable insights from this repositories of data.

Training a sentiment classifier based on SVM (Support Vector Machines) and using Stanford 140 data set

We will use the Scikit-learn library to train a sentiment classifier based on SVM.

We will be using the Stanford 140 data set. You can download it from this website:

http://help.sentiment140.com/for-students

Note the data format, it has 6 fields:

0 – the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
1 – the id of the tweet
2 – the date of the tweet
3 – the query. If there is no query, then this value is NO_QUERY.
4 – the user
5 – the text of the tweet

Let us load the libraries:

Next step is loading the data from Stanford 140 data set and preprocess it:

We will use TF-IDF representation of tweets before feeding them to the SVM model:

We next train the model using the linear SVM from scikit-learn:

After converging, we can evaluate the accuracy of the model by calculating precision, recall and f1-score:

The sentiment classifier trained on Stanford 140 data set has a good accuracy of 82%: