Building a Fraud Detection System in Python with Machine Learning

Agustin Bereciartua - Oct 11 - - Dev Community

Hello everyone! Today, I'd like to share a step-by-step guide on how to build a simple fraud detection system using Python and machine learning. We'll be leveraging libraries like scikit-learn and pandas to identify anomalous patterns in financial transactions.

Introduction

Financial institutions are constantly battling fraud in transactions. Traditional methods often fall short due to the sheer volume and complexity of data. Machine learning offers a promising solution by automatically detecting unusual patterns that may indicate fraudulent activity.

In this post, we'll:

  • Prepare and clean financial transaction data.
  • Handle imbalanced datasets using techniques like oversampling.
  • Implement a machine learning model for fraud detection.
  • Evaluate and validate the model using appropriate metrics.

Prerequisites

Before we begin, make sure you have the following installed:

  • Python 3.7 or higher
  • pandas
  • scikit-learn
  • imbalanced-learn
  • matplotlib and seaborn (for data visualization)

You can install the required libraries using pip:

pip install pandas scikit-learn imbalanced-learn matplotlib seaborn
Enter fullscreen mode Exit fullscreen mode

Step 1: Data Preparation

For this tutorial, we'll use the Credit Card Fraud Detection dataset from Kaggle. This dataset contains transactions made by credit cards in September 2013 by European cardholders.

Let's start by loading the data:

import pandas as pd

# Load the dataset
df: pd.DataFrame = pd.read_csv('creditcard.csv')
Enter fullscreen mode Exit fullscreen mode

Exploring the Data

print(df.head())
print(df.info())
print(df['Class'].value_counts())
Enter fullscreen mode Exit fullscreen mode
  • The dataset has 284,807 transactions.
  • The 'Class' column is the target variable (0 for legitimate, 1 for fraud).
  • The dataset is highly imbalanced.

Step 2: Handling Imbalanced Data

Imbalanced data can bias the model towards predicting the majority class. We'll use the Synthetic Minority Over-sampling Technique (SMOTE) to address this.

from imblearn.over_sampling import SMOTE

# Separate features and target
X: pd.DataFrame = df.drop('Class', axis=1)
y: pd.Series = df['Class']

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
Enter fullscreen mode Exit fullscreen mode

Step 3: Splitting the Data

We'll split the data into training and testing sets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.2, random_state=42
)
Enter fullscreen mode Exit fullscreen mode

Step 4: Building the Model

We'll use a Random Forest Classifier for this task.

from sklearn.ensemble import RandomForestClassifier

# Initialize the model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Step 5: Evaluating the Model

We'll evaluate the model using accuracy, precision, recall, and F1-score.

from sklearn.metrics import classification_report, confusion_matrix

# Make predictions
y_pred = model.predict(X_test)

# Classification report
print(classification_report(y_test, y_pred))

# Confusion matrix
import seaborn as sns
import matplotlib.pyplot as plt

conf_mat = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_mat, annot=True, fmt='d')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Step 6: Interpreting the Results

Image description

The classification report provides insight into how well our model is performing:

  • Precision: The proportion of positive identifications that were actually correct.
  • Recall: The proportion of actual positives that were identified correctly.
  • F1-Score: The harmonic mean of precision and recall.

Conclusion

By following these steps, we've built a basic fraud detection system using machine learning. While this is a simplified example, it serves as a foundation for more complex models.


Full Code

Here's the complete code for reference:

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df: pd.DataFrame = pd.read_csv('creditcard.csv')

# Separate features and target
X: pd.DataFrame = df.drop('Class', axis=1)
y: pd.Series = df['Class']

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled: pd.DataFrame
y_resampled: pd.Series
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split the data
X_train: pd.DataFrame
X_test: pd.DataFrame
y_train: pd.Series
y_test: pd.Series
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.2, random_state=42
)

# Initialize and train the model
model: RandomForestClassifier = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred: pd.Series = model.predict(X_test)

# Classification report
print(classification_report(y_test, y_pred))

# Confusion matrix
conf_mat: pd.DataFrame = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_mat, annot=True, fmt='d')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Next Steps

To improve the model:

  • Experiment with different algorithms like XGBoost or Neural Networks.
  • Perform feature engineering to select the most relevant features.
  • Use cross-validation for a more robust evaluation.

Feel free to ask questions or share your thoughts. Let's learn together!

. . .
Terabox Video Player