The following is a tutorial for conducting a quality sentiment analysis of social media data (in this case Twitter). I describe what sentiment analysis is, how it started, and why it is important. I also offer a sentiment analysis process that I believe sums up the technique. I then introduce a valuable tool called SentiStrength. Following data cleaning and analysis, sentiment is visualized.
SentiStrength has already been employed by researchers and findings have been published in a range of scholarly research journals. I am quite confident that you will find this sentiment analysis tutorial beneficial.
What is Sentiment Analysis?
Sentiment analysis is the automated process of understanding opinions and emotions about a given subject from written or spoken language. Sentiment analysis is also known as opinion mining, opinion extraction, sentiment mining, subjectivity analysis, affect analysis, emotion analysis, and review mining.
According to the Merriam-Webster’s Collegiate Dictionary, sentiment is defined as an attitude, thought, or judgment prompted by feeling.
Sentiment analysis presents an active area of research in natural language processing (NLP). NLP is considered a sub-field in artificial intelligence whereby computers are able to interpret and process human language.
How it all started?
Sentiment analysis has been used across various disciplines. It is believed to have started from computer science. Later, management and then social sciences adopted sentiment analysis. Sentiment analysis has been extensively used in linguistic and machine learning studies.
Large corporations have built their own in-house capabilities (e.g., Microsoft, Google, IBM, SAP, and SAS).
Basic Sentiment Analysis: Classifying the polarity of a given text at the document, sentence, or tweet—positive, negative, or neutral.
Advanced Sentiment Analysis: Understanding emotional states. For example, happy, angry, and sad.
Why is it important?
Sentiment analysis has attracted interest from researchers, journalists, companies, and governments. Opinions and sentiments are extracted to create structured and actionable knowledge that can be used by a decision maker.
The advent of social media has increased the value of sentiment analysis. Social networks are not only fueling the digital revolution, but also enabling the expression and spread of emotions and opinions through the network.
Leveraging of new media requires constant monitoring of information. In the political arena, sentiments can determine election outcomes; business carefully guard their brand image and user sentiment on social media needs to be constantly monitored.
Issues in Sentiment Analysis
The most problematic figures of speech in NLP are irony and sarcasm. Another issue is of the rules to detect implicit sentiment (e.g., through misspellings or exclamation marks).
A sentiment analysis program typically achieves 70% accuracy in classifying sentiment.
Human raters typically only agree about 80% (Ogneva, 2012)
Sentiment Analysis Process
- Topic Identification
What are you interested in knowing? State the research question. Why does it matter? Who cares?
- Medium Identification
Identify where you want to study the sentiment. Will it be user generated content on social media? (YouTube comments, tweets on Twitter, Facebook posts, blogposts etc.)
- Content Search
Define keywords through which you will get the desired data. Clearly defined search parameters are of vital importance in getting the right kind of data that relates to the initial research questions.
- Data Cleaning
Raw data is full of noise. Data cleaning (especially social media data) requires ample sifting. Spam, fake accounts, data produced by bots, different languages etc. need to be cleaned or removed to create a clean data file.
- Sentiment Analysis
The clean data file can then be used to run the sentiment analysis.
Once the sentiment analysis is completed, data needs to visualized or put in an organized format to make sense of it.
SentiStrength is free for academic research and can be tried live online or downloaded (Windows only) from http://sentistrength.wlv.ac.uk.
- SentiStrength is a program that compares social media text against a lexicon-based classifier of sentiments.
- SentiStrength measures sentiment strength by assigning scores ranging from -.5 to +5.
- Positive numbers indicate favorable attitudes while negative numbers indicate negative attitudes.
- The program also provides a separate score for each word within a sentence thereby giving the average sentiment strength of the content (e.g. tweet).
- Psychologists believe that human emotion can be positive and negative at the same time (Norman et al., 2011). These are commonly known as mixed emotions. Inspired by this psychological reasoning SentiStrength was created to detect both positive and negative sentiment at the simultaneously.
- Emotions are socially constructed (Cornelius, 1996; Fox, 2008).
- SentiStrength uses a lexical approach. At its heart is a lexicon of I, 125 words and 1,364-word stems, each with a score for positive or negative sentiment. When these match a word in a text then this suggests the presence of sentiment and its strength.
For example, ailing has a score of -3 in the lexicon, and so sentences containing this word may have a moderate negative sentiment.
- Positive sentiments can include words such as: good, happy, great, fantastic, wonderful, lovely, excited, lovely, nice, and kind. Negative sentiments can include words such as: terrible, lazy, crazy, hurt, bad, and disappointed.
- Negation is commonly used when expressing opinions. A positive term that is preceded by a negating word (e.g., not, don’t) has its sentiment flipped by SentiStrength(e.g., I don’t like it), whereas negative terms are neutralized (e.g., I don’t hate you).
- Terms preceded by booster words like very and extremely have their positive or negative sentiment strength increased, whereas quite decreases the sentiment strength of the next word.
There are also rules for questions, idioms, spelling correction and punctuation as well as rules that are specific to computer- mediated communication methods of expressing sentiment.
- As part of this, SentiStrength has a list of emoticons, together with sentiment strength scores for them (e.g., smiley faces like=) score +2).
- SentiStrength is very fast and can process 14,000 tweets per second on a standard PC), is transparent (shows how its scores were calculated), and includes other languages (Vural, Cambazoglu, Senkul, & Tokgoz 2013).
Familiarizing with Twitter Data and data cleaning
Before we start the analysis of social media data (in this case tweets), we need to clean the data and bring it in .txt file format so that it can be analyzed for sentiment in SentiStrength.
We will be analyzing the sentiment around the Boeing 737 Max airplane which had caught international news headlines. Twitter data was obtained for this purpose using the keywords “737 Max”.
For this tutorial, download the data file: “737_Max.xls”
Open the Excel file which contains the data for Bowing 737 Max tweets. View the raw datasheet “737_max_Raw”.
The raw data from Twitter is depicted in the screenshot below.
Now click on “Clean_737_Max” worksheet.
You will notice that the data file has been cleaned for (i) Retweets, (ii) languages other than English (iii) text that made no sense.
For example, Spanish language tweets were deleted using the “filter” function in Excel.
We will export this worksheet and save it as a .txt file.
The clean data file just containing the tweets is ready to be analyzed for sentiment in SentiStrength.
Sentiment Analysis with SentiStrength
Download program and zip file SentiStrength_Data.zip from http://sentistrength.wlv.a.uk/
Fill in the fields above with your name, email, and organization. You will be prompted to save the zip file on your computer. Save it in a new folder on your computer.
Unzip SentiStrength_Data.zip, then start SentiStrength.exe and point to the unzipped SentiStrength_Data folder.
Click on the .exe file and launch SentiStrength. As you will notice, the most recent version is 2.3
Explore the top menus. “Sentiment Strength Analysis” gives a list of options regarding the type of analysis that can be done. The following screenshot depicts the “Sentiment Analysis Options” which allows you to choose how you want your analysis to be done.
For this tutorial, we will be selecting “Analyse All Texts in File (each line separately)”. This is because our data file in .txt format contains all tweets in separate lines.
We can leave the default options selected.
From the “Sentiment Strength Analysis” menu, we will be selecting “Analyse All Texts in File (each line separately)”. You will be prompted to choose the data file. Select the clean data file in .txt format.
SentiStrength will now analyze the data and prompt to save a data file in which the sentiment has been performed (the file name will have “+results”).
This new file is in .txt format and now has to be imported in Excel so that the analysis can be understood.
Excel has a text import wizard which works when you try to open a .txt file.
You will see a sentiment column for negative and positive. There is also a column for emotion rationale which provides the sentiment score next to each word in the tweet.
The final step is to visualize the overall sentiment by creating a new worksheet with the two sentiment columns. While selecting the sentiment columns, click on “Insert” and then select a “Column” chart to create a chart.
You can create even better visualizations using Excel. As you can see in the above depiction, a simple column chart gives a general idea about the overall sentiment from this dataset.