Name: ✈️ Model Comparison for Sentiment Analysis Using the Airline Tweet Dataset
Rating: 3.1 (8566 reviews)
Author: saivishwa

In today's data-driven world, sentiment analysis has become a powerful tool for understanding public opinion. Whether it's gauging customer satisfaction or monitoring brand reputation, the ability to analyze textual data at scale is invaluable. In this blog post, we'll walk through a model comparison using the airline tweet dataset, showcasing how different machine learning models perform on the task of sentiment analysis.

📊 Dataset Overview

The airline tweet dataset consists of tweets directed at various airlines. Each tweet is labeled with one of three sentiments: positive, neutral, or negative. This makes it an ideal dataset for sentiment classification tasks. The goal is to build a model that can accurately classify the sentiment of new, unseen tweets.

🛠️ Importing the Required Libraries

Before diving into the analysis, let's import the necessary libraries:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from nltk.corpus import stopwords
nltk.download('stopwords')

🧹 Data Loading and Preprocessing

The first step is to load the dataset and perform some basic preprocessing, such as cleaning the text and converting the labels into a format suitable for modeling.

# Load the dataset
data = pd.read_csv("AirlineTwitterData.csv", encoding = "ISO-8859-1")

df = data[['text', 'airline_sentiment']].copy()

df.rename(columns = {"text" : "tweet", "airline_sentiment" : "sentiment"}, inplace = True)

le = LabelEncoder()
df.sentiment = le.fit_transform(df.sentiment)

label_mapping = dict(zip(le.classes_, range(len(le.classes_))))

stop_words = set(stopwords.words('english'))

df['clean_tweet'] = df['tweet'].apply(lambda x : ' '.join([word for word in x.split() if word.lower() not in stop_words]))

def clean_text(text):
    text = re.sub(r'@\w+|#\w+|https?://(?:www\.)?[^\s/$.?#].[^\s]*', '', text)  # Remove mentions, hashtags, and URLs
    text = re.sub(r"[^a-zA-Z0-9\s]", '', text)  # Remove non-alphanumeric characters
    return text.strip().lower()  # Strip whitespace and convert to lowercase

# Apply the cleaning function to the DataFrame
df['clean_tweet'] = df['clean_tweet'].apply(clean_text)

df.drop('tweet', axis = 1, inplace = True)

🧠 Feature Extraction

To feed the text data into our machine learning models, we need to convert it into numerical features. We'll use TF-IDF (Term Frequency-Inverse Document Frequency) vectorization for this purpose.

Tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1, 3))
X = Tfidf.fit_transform(df.clean_tweet).toarray()
y = df['sentiment']

# Handling class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

✂️ Train-Test Split

We'll split the dataset into training and testing sets to evaluate our models.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

🆚 Model Comparison

We'll compare the performance of several machine learning models:

Logistic Regression
Naive Bayes
Random Forest
K-Nearest Neighbors
Decision Tree
XGBoost

Each of these models has its strengths and weaknesses, making it crucial to evaluate them on the same dataset.

Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg))
print(classification_report(y_test, y_pred_logreg))

Naive Bayes

nb = MultinomialNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))

Random Forest

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

K-Nearest Neighbors

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print("K-Nearest Neighbors Accuracy:", accuracy_score(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))

Decision Tree

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))

XGBoost

xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))

📈 Results and Discussion

After running each model, we can compare their performance based on accuracy, precision, recall, and F1-score. Typically, you'll find that models like XGBoost and Random Forest may outperform simpler models like Naive Bayes, but this depends on the dataset and the specific task. The accuracy comparison graph can be seen below.

And to get the complete project of Airline Twitter Sentiment Analysis take a look at my GitHub repo: https://github.com/SaiVishwa021/Airline_TwitterSentimentAnalysis

And the app is live now: https://airline-twittersentimentanalysis-1.onrender.com/