Reddit Data Analysis: Insights from Machine Learning Models

Visesh Agarwal - Aug 21 - - Dev Community

Introduction

In the age of social media, Reddit stands out as a unique platform where users engage in discussions across a wide range of topics. This article presents an in-depth analysis of Reddit comments from various subreddits related to data science, programming, and technology. We'll explore the sentiment, emotions, and content of these comments using several machine learning techniques, including sentiment analysis, topic modeling, and text classification.

Data Collection and Preprocessing

Our analysis begins with data collection from eight subreddits: Python, DataScience, MachineLearning, DataAnalysis, DataMining, Data, DataSets, and DataCenter. We used the PRAW (Python Reddit API Wrapper) library to scrape comments from these subreddits.

Here's a snippet of the code used for data collection:

async def get_comments(subreddit_name, num_comments=2000):
    subreddit = await reddit.subreddit(subreddit_name)
    comments = []
    async for comment in subreddit.comments(limit=num_comments):
        comments.append({
            "subreddit": subreddit_name,
            "comment_body": comment.body,
            "upvotes": comment.score,
        })
    return comments
Enter fullscreen mode Exit fullscreen mode

After collecting the data, we performed several preprocessing steps to clean and prepare the text for analysis:

  1. Removing missing values and duplicates
  2. Filtering out comments with less than three words
  3. Tokenizing the text
  4. Removing special characters and words with digits
  5. Converting to lowercase
  6. Removing stopwords
  7. Lemmatizing the words

Here's a snippet of the preprocessing function:

def clean_text(text):
    text = word_tokenize(text)
    text = [re.sub(r"[^a-zA-Z0-9]+", ' ', word) for word in text]
    text = [word for word in text if not any(c.isdigit() for c in word)]
    text = [word.lower() for word in text]
    text = [word for word in text if word not in stopwords.words('english')]
    lemmatizer = WordNetLemmatizer()
    text = [lemmatizer.lemmatize(word) for word in text]
    text = ' '.join(text)
    text = re.sub(r'[^\w\s]', '', text)
    words = ['http','com','www','reddit','comment','comments','http','https','org','jpg','png','gif','jpeg']
    text = ' '.join(word for word in text.split() if word not in words)
    return text
Enter fullscreen mode Exit fullscreen mode

Exploratory Data Analysis

Comment Distribution Across Subreddits

We first examined the distribution of comments across the different subreddits:

print(df['subreddit'].value_counts())
Enter fullscreen mode Exit fullscreen mode

The results showed:

DataCenter        934
Python            859
DataMining        847
Data              803
DataScience       763
DataSets          763
MachineLearning   733
DataAnalysis      645
Enter fullscreen mode Exit fullscreen mode

This distribution gives us insight into the relative activity levels of these subreddits during our data collection period.

Word Cloud Visualization

To get a quick overview of the most frequent words in our dataset, we created a word cloud:

text = " ".join(comment for comment in df.cleaned_comment)
wordcloud = WordCloud(width=800, height=400, background_color ='white').generate(text)
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Enter fullscreen mode Exit fullscreen mode

WordCloud Image

The word cloud highlights the most frequent terms across all comments, giving us a visual representation of the dominant topics and terms in our dataset.

Sentiment Analysis

We performed sentiment analysis using the TextBlob library to understand the overall sentiment of the comments:

df['polarity'] = df['cleaned_comment'].apply(lambda x: TextBlob(x).sentiment.polarity)
df['sentiment'] = df['polarity'].apply(lambda x: 'positive' if x > 0 else 'negative' if x < 0 else 'neutral')
Enter fullscreen mode Exit fullscreen mode

We then visualized the sentiment distribution:

plt.figure(figsize=(10, 6))
sns.histplot(df['polarity'], kde=True)
plt.title('Sentiment Distribution')
plt.xlabel('Polarity')
plt.ylabel('Count')
plt.show()

plt.figure(figsize=(10, 6))
df['sentiment'].value_counts().plot.pie(autopct='%1.1f%%')
plt.title('Sentiment Distribution')
plt.ylabel('')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Sentiment Polarity Analysis

Sentiment Distribution

The sentiment analysis revealed that the majority of comments had a neutral to slightly positive sentiment. This suggests that discussions in these tech-related subreddits tend to be more informative and objective rather than highly emotional.

Topic Modeling

To uncover the main topics discussed across these subreddits, we employed Latent Dirichlet Allocation (LDA) for topic modeling:

lda_model = gensim.models.ldamodel.LdaModel(
    corpus, num_topics=5, id2word=dictionary, passes=15
)

topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)
Enter fullscreen mode Exit fullscreen mode

The LDA model identified five main topics:

  1. General discussion and etiquette (keywords: post, please, r, message, thank)
  2. Data center infrastructure (keywords: cooling, power, rack, ups, system)
  3. Data analysis and tools (keywords: data, n, like, would, get)
  4. Data science applications (keywords: data, n, use, would, need)
  5. Data center operations (keywords: data, power, center, get, like)

These topics provide insight into the main areas of discussion across the analyzed subreddits, ranging from technical discussions about data center operations to more general data science and analysis topics.

Emotion Analysis

To gain a deeper understanding of the emotional content of the comments, we performed emotion analysis using the NRCLex library:

df["emotions"] = df["cleaned_comment"].apply(analyze_emotions)
emotion_df = df["emotions"].apply(pd.Series).fillna(0)
df = pd.concat([df, emotion_df], axis=1)

emotion_totals = emotion_df.sum().sort_values(ascending=False)
plt.figure(figsize=(12, 8))
sns.barplot(x=emotion_totals.index, y=emotion_totals.values, palette="viridis")
plt.title("Total Emotion Counts in Reddit Comments")
plt.xlabel("Emotion")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Emotional Analysis

The emotion analysis revealed that the most prevalent emotions in the comments were:

  1. Trust
  2. Anticipation
  3. Joy
  4. Fear
  5. Sadness

This distribution suggests that while the overall sentiment tends to be neutral or slightly positive, there's a complex emotional landscape in these tech-related discussions. The high levels of trust and anticipation might indicate a generally optimistic and collaborative atmosphere in these communities.

Named Entity Recognition

To identify key entities mentioned in the comments, we performed Named Entity Recognition (NER) using the spaCy library:

def extract_entities(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

df["entities"] = df["cleaned_comment"].apply(extract_entities)

entities_df = pd.DataFrame(all_entities, columns=["Entity", "Label"])
label_counts = entities_df["Label"].value_counts()

plt.figure(figsize=(12, 6))
label_counts.plot(kind="bar", color="skyblue")
plt.title("Distribution of Entity Labels")
plt.xlabel("Entity Label")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()
Enter fullscreen mode Exit fullscreen mode

NER Image

The NER analysis highlighted the most common types of entities mentioned in the comments, which included:

  1. Organizations (ORG)
  2. People (PERSON)
  3. Products (PRODUCT)
  4. Locations (GPE)

This distribution gives us insight into the types of entities that are frequently discussed in these tech-related subreddits, with a focus on organizations and people involved in the field.

Text Classification Models

To predict the subreddit of a given comment, we implemented and compared several machine learning models:

  1. Support Vector Machine (SVM)
  2. Logistic Regression
  3. Random Forest
  4. K-Nearest Neighbors (KNN)
  5. Long Short-Term Memory (LSTM) neural network

Here's a summary of the performance metrics for each model:

Model                 Accuracy  Precision  Recall    F1 Score  ROC AUC
SVM                   0.523622  0.533483   0.523622  0.525747  0.853577
Logistic Regression   0.541732  0.545137   0.541732  0.536176  0.857676
Random Forest         0.485039  0.488424   0.485039  0.477831  0.819124
KNN                   0.230709  0.326545   0.230709  0.151028  0.720319
LSTM                  0.483302  0.483103   0.483302  0.478265  NaN
Enter fullscreen mode Exit fullscreen mode

We visualized the performance of these models:

plt.figure(figsize=(12, 6))
plt.plot(models, accuracies, marker="o", label="Accuracy")
plt.plot(models, precisions, marker=".", label="Precision")
plt.plot(models, recalls, marker=".", label="Recall")
plt.plot(models, f1_scores, marker=".", label="F1 Score")
plt.plot(models, roc_auc_scores, marker=".", label="ROC AUC")
plt.title("Model Comparison")
plt.xlabel("Model")
plt.ylabel("Score")
plt.legend()
plt.xticks(rotation=45)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Model Comparision

Model Performance Analysis

  1. Logistic Regression performed the best overall, with the highest accuracy (54.17%), precision (54.51%), recall (54.17%), and F1 score (53.62%). It also had the highest ROC AUC score (0.8577), indicating good discrimination ability.

  2. SVM was a close second, with performance metrics very similar to Logistic Regression. This suggests that both linear models (Logistic Regression and SVM) are well-suited for this text classification task.

  3. Random Forest performed slightly worse than the linear models but still achieved reasonable results. Its lower performance might indicate that the decision tree-based approach is less effective for capturing the nuances in the text data compared to linear models.

  4. The LSTM model showed comparable performance to Random Forest in terms of accuracy, precision, recall, and F1 score. However, we couldn't calculate its ROC AUC score due to limitations in the implementation.

  5. KNN performed significantly worse than the other models across all metrics. This poor performance suggests that the nearest neighbor approach might not be suitable for high-dimensional text data.

The relatively close performance of different models (except KNN) suggests that the task of predicting subreddits based on comment content is challenging. This could be due to overlapping topics across different subreddits or the presence of general discussion that isn't specific to any particular subreddit.

Conclusion

Our analysis of Reddit comments from tech-related subreddits has provided valuable insights into the nature of discussions in these online communities:

  1. Sentiment and Emotions: The overall sentiment tends to be neutral to slightly positive, with trust and anticipation being the dominant emotions. This suggests a generally constructive and forward-looking atmosphere in these tech-focused discussions.

  2. Topics: The main topics identified through LDA include general discussion etiquette, data center infrastructure, data analysis tools, data science applications, and data center operations. This diverse range of topics reflects the broad scope of discussions in these tech-related subreddits.

  3. Entities: Organizations and people are the most frequently mentioned entities, highlighting the importance of industry players and thought leaders in these discussions.

  4. Text Classification: While our models achieved moderate success in predicting subreddits based on comment content, the task proved challenging. Logistic Regression and SVM performed best, suggesting that linear models are well-suited for this type of text classification task.

These findings provide valuable insights for community managers, data scientists, and researchers interested in understanding the dynamics of tech-related discussions on Reddit. Future work could explore more advanced natural language processing techniques, such as transformers-based models like BERT, to potentially improve classification performance and extract even more nuanced insights from the text data.

GitHub Repo with all the code and detailed analysis
Github Repo Link

. .
Terabox Video Player