Reddit Headline Analysis
Biopsying r/worldnews to see if the news for the general public really is all 'negative'.
- Project At A Glance
- Dependencies
- Dataset Initialization
- Using NLTK's Sentiment Intensity Analyzer
- Generating DataFrame with Polarity Scores
- Labelling and Classification
- Examples
- Results and Visualization
- Exporting Labelled Dataset as (.csv)
Project At A Glance
Objective
: Discover sentiments associated with posts in the 'Hot' section of r/worldnews and classify them as Positive, Negative and Neutral.
Data
: 754x1 Dataset of the sub-reddit's headlines scraped using the Reddit API. [View Scraper Notebook] [Download]
Implementation
: Reddit API, PRAW, NLTK's Sentiment Intensity Analyzer (SIA)
Results
:
- More than half of the headlines were classified as Neutral (~55%).
- However, Negative Headlines (33%) still outweigh Positive Headlines (12%) by about 2.75x.
- Dataset generated with labelled values to formulate models with better intelligence in the future.
Deployment
: View this project on GitHub.
import pandas as pd
from pprint import pprint
df = pd.read_csv('reddit-headlines.csv')
df.head()
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
import matplotlib.pyplot as plt
import seaborn as sns
sia= SIA()
results = []
for line in df['headlines']:
pol_score = sia.polarity_scores(line)
pol_score['headline'] = line
results.append(pol_score)
pprint(results[:3], width=100)
df = pd.DataFrame.from_records(results)
df.sample(4)
df['label'] = 0
df.loc[df['compound']>0.33, 'label'] = 1
df.loc[df['compound']<-0.33, 'label'] = -1
df.sample(4)
df.label.value_counts()
df.label.value_counts(normalize=True)*100
print('Positive Headlines:\n')
pprint(list(df[df['label'] == 1].headline)[:5], width=200)
print('\n\n Negative Headlines:\n')
pprint(list(df[df['label'] == -1].headline)[:5], width=200)
print('\n\n Neutral Headlines:\n')
pprint(list(df[df['label'] == 0].headline)[:5], width=200)
fig, ax = plt.subplots(figsize=(8,8))
counts = df.label.value_counts(normalize=True)*100
sns.barplot(x=counts.index, y=counts, ax=ax)
ax.set_xticklabels(['Negative', 'Neutral', 'Positive'])
ax.set_ylabel('Percentage')
plt.show()
df_export = df[['headline', 'label']]
df_export.sample(4)
df_export.to_csv('reddit-headlines-labelled.csv', encoding='utf-8', index=True)