Sentiment Analysis for textual data
Data analysis often starts with structured data that’s already stored as numbers, dates, categories etc. However, unstructured data can yield crucial insights if you use appropriate techniques. Often you need to create a report for your CS team using NLP... In this tutorial, we'll run sentiment analysis on a textual dataset to figure out how positive and negative each phrase is, and turn the results into an interactive Datapane report.

The dataset

Let’s imagine we’re a data scientist working for a news company and we’re trying to figure out how ‘positive’ our news headlines are in comparison to the industry.
We’ll start with the UCI News Aggregator dataset which is a collection of news headlines from different publications in 2014. This is a fun dataset because it has articles from a wide range of publishers and contains useful metadata.
1
import pandas as pd
2
3
raw_data = pd.read_csv("~/uci-news-aggregator.csv")
4
5
# Convert UNIX timestamps in milliseconds since 1970 into datetimes
6
raw_data["TIMESTAMP"] = pd.to_datetime(raw_data["TIMESTAMP"], unit = 'ms')
7
8
# Add more informative category names
9
di = {"b": "business",
10
"t" : "science and technology",
11
"e" : "entertainment",
12
"m" : "health"}
13
raw_data.replace({"CATEGORY": di}, inplace=True)
14
15
raw_data.info()
16
raw_data.head()
Copied!
We have 8 columns and about 400k rows. We’ll use the ‘Title’ for the actual sentiment analysis, and group the results by ‘Publisher’, ‘Category’ and ‘Timestamp’.

Classifying the headlines

Through the magic of open-source, we can use someone else’s hard-earned knowledge in our analysis — in this case a pretrained model called the Vader Sentiment Intensity Analyser from the popular NLTK library.
To build the model, the authors gathered a list of common words and then asked a panel of human testers to rate each one on valence i.e. positive or negative, and intensity i.e. how strong the sentiment is. As the original paper says: :
[After stripping out irrelevant words] this left us with just over 7,500 lexical features with validated valence scores that indicated both the sentiment polarity (positive/negative), and the sentiment intensity on a scale from –4 to +4. For example, the word “okay” has a positive valence of 0.9, “good” is 1.9, and “great” is 3.1, whereas “horrible” is –2.5, the frowning emoticon “:(” is –2.2, and “sucks” and “sux” are both –1.5.
To classify a piece of text, the model calculates the valence score for each word, applies some grammatical rules e.g. distinguishing between ‘great’ and ‘not great’, and then sums up the result.
Interestingly, this simple lexicon-based approach has equal or better accuracy compared to machine-learning approaches, and is much faster. Let’s see how it works!
1
import nltk
2
nltk.download('vader_lexicon')
3
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
4
5
sia = SIA()
6
7
results = [sia.polarity_scores(line) for line in raw_data.TITLE]
8
scores_df = pd.DataFrame.from_records(results)
9
df = scores_df.join(raw_data, rsuffix="_right")
10
11
df.head()
Copied!
In this code we import the library, classify each title in our dataset then append the results to our original dataframe. We have added 4 new columns:
  • pos: positive score component
  • neu: neutral score component
  • neg: negative score component
  • compound: the sum of the three score components
As a sanity check, let’s take a look at the most positive, neutral and negative headline in the text by using pandas idxmax :
1
negative = df.iloc[df.neg.idxmax()]
2
neutral = df.iloc[df.neu.idxmax()]
3
positive = df.iloc[df.pos.idxmax()]
4
print(f'Most negative: {negative.TITLE} ({negative.PUBLISHER})')
5
print(f'Most neutral: {neutral.TITLE} ({neutral.PUBLISHER})')
6
print(f'Most positive: {positive.TITLE} ({positive.PUBLISHER})')
Copied!
Running that code gives us the following result:
1
Most negative: I hate cancer (Las Vegas Review-Journal \(blog\))
2
Most neutral: Fed's Charles Plosser sees high bar for change in pace of tapering (Livemint)
3
Most positive: THANK HEAVENS (Daily Beast)
Copied!
Fair enough — ‘THANKS HEAVENS’ is a lot more positive than ‘I hate cancer’!

Visualizing the results

When we're building our report, we need great visuals...
What does the distribution of our scores look like? Let’s visualize this in a couple of ways using the interactive plotting library Altair:
1
import altair as alt
2
3
df["compound_trunc"] = df.compound.round(1) # Truncate compound scores into 0.1 buckets
4
5
res = (df.groupby(["compound_trunc","CATEGORY"])["ID"]
6
.count()
7
.reset_index()
8
.rename(columns={"ID": "count"})
9
)
10
11
hist = alt.Chart(res).mark_bar(width=15).encode(
12
alt.X("compound_trunc:Q", axis=alt.Axis(title="")),
13
y=alt.Y('count:Q', axis=alt.Axis(title="")),
14
color=alt.Color('compound_trunc:Q', scale=alt.Scale(scheme='redyellowgreen')),
15
tooltip=['compound_trunc', 'count']
16
)
17
18
stacked_bar = alt.Chart(res).mark_bar().encode(
19
x = "CATEGORY",
20
y=alt.Y('count:Q', stack='normalize', axis=alt.Axis(title="", labels=False)),
21
color=alt.Color('compound_trunc', scale=alt.Scale(scheme='redyellowgreen')),
22
tooltip=['compound_trunc', 'CATEGORY', 'count'],
23
order=alt.Order(
24
# Sort the segments of the bars by this field
25
'compound_trunc',
26
sort='ascending')
27
).properties(width = 150)
28
29
hist
Copied!
Here we’re showing both a histogram for the overall distribution, as well as a 100% stacked bar chart grouped by category. Running that code, we get the following result:
Seems like most headlines are neutral, and health has overall more negative articles than the other categories.
To give more insight into how our model is classifying the articles, we can create two more plots, one showing a sample of how the model classifies particular headlines, and another showing the average sentiment score for our largest publishers over time:
1
# Plot a random sample of 5k articles
2
scatter = alt.Chart(df.sample(n=5000, random_state=1)).mark_point().encode(
3
alt.X("TIMESTAMP", axis=alt.Axis(title="")),
4
y=alt.Y('compound', axis=alt.Axis(title="")),
5
color=alt.Color('compound:Q', scale=alt.Scale(scheme='redyellowgreen')),
6
tooltip=['TITLE', 'PUBLISHER','compound:Q', 'TIMESTAMP']
7
)
8
9
# Get the 10 largest publishers
10
largest_10 = (df.groupby(by=["PUBLISHER"])["ID"]
11
.count()
12
.reset_index()
13
.rename(columns={"ID": "count"})
14
.nlargest(10, 'count')
15
)
16
17
# Truncate by 30-day periods
18
df["date"] = df['TIMESTAMP'].dt.floor(freq='30D')
19
20
line = alt.Chart(df[df.PUBLISHER.isin(largest_10.PUBLISHER)]).mark_line(clip=True).encode(
21
alt.X("date", axis=alt.Axis(title="")),
22
y=alt.Y('average(compound)', axis=alt.Axis(title=""), scale=alt.Scale(domain=(-0.15, 0.15))),
23
color=alt.Color('PUBLISHER:O'),
24
tooltip=['PUBLISHER','average(compound):Q', 'date']
25
)
26
27
line
Copied!
This is where Altair really shines - its declarative syntax means you can change just one or two keywords to get an entirely different view on the data. Running that code gives us the following result:
By creating interactive visualizations, you enable viewers to explore the data directly. They’ll be much more likely to trust your overall conclusions if they can drill down to the original datapoints.
Looking at the publishers chart it seems that HuffPost is consistently more negative and RTT more positive. Hmmm, seems like they have different editorial policies…

Creating a Datapane report

The final step is to package the results into an interactive Datapane report so that others can interact with and understand the data.
After logging into our Datapane account, we'll wrap our plots inside dp.Plot blocks, add some additional pages and written context:
1
import datapane as dp
2
3
# dp.login(token="yourtoken")
4
5
dp.Report(
6
dp.Page(
7
dp.Text(
8
"""
9
# Sentiment Analysis of News Headlines
10
This report uses a sentiment analysis model to determine the positivity/negativity of news headlines from the [UCI News Dataset](https://www.kaggle.com/uciml/news-aggregator-dataset).
11
"""
12
),
13
dp.Group(
14
dp.Plot(hist),
15
dp.Plot(stacked_bar),
16
columns=2
17
),
18
dp.Text("""
19
Scores are unimodal, with over 50% of headlines classified as 'neutral'. Health appears to be the most negative news category.
20
21
## Examples and publishers
22
23
To explore individual headlines, hover over the individual scatter points below:
24
"""
25
),
26
dp.Plot(scatter),
27
dp.Plot(line, label = "Top 10 publishers average monthly sentiment"),
28
dp.Text("""
29
Of our top 10 publishers, it looks like Huffpost is most consistently negative, and RTT Today most positive.
30
31
32
## Next Steps
33
....
34
35
"""),
36
title = "Charts"
37
), dp.Page(
38
dp.DataTable(df),
39
title = "Selected Data"
40
)
41
).upload(name="Distribution_of_Sentiment")
Copied!
Running that code gives us the following:
Last modified 16d ago