If you are new to machine learning or have just started, you have come to the perfect place!!
Today, we make use of Machine Learning to create a full-fledged Machine Learning project. So, in this project, you will have the opportunity to work with the following technologies and tools:
- Where we build our model: Google Colab
- Data Preprocessing : Numpy,Pandas.
- ML Model Creation: scikit-learn
- Validation and Testing ML Model: Deepcheck's Platform.
So now we have the items in our toolbox, but we need a problem statement to show how we will use them to create something wonderful.
So, let's put our technologies to work on developing Online Payment Fraud Detection.
Now we know what we're going to use or what's our ultimate outcome will be. So, let's begin :
Step 1: Import libraries which we use in this project :
import pandas as pd
import numpy as np
Step 2: Load Data
So in this project we are using genuine datasets from Kaggle for this project. That dataset is available for download at this link:
Online Payments Fraud Detection
The dataset is ready for use when you download it, rename it (if you want), and upload it to Google Colab:
df = pd.read_csv('payment_fraud_detection.csv')
df.head()
df.head()
shows us the top 5 results of csv file as shown :
Step 3: Get familiar with features
Let's explore the features:
step: represents a unit of time where 1 step equals 1 hour
type: type of online transaction
amount: the amount of the transaction
nameOrig: customer starting the transaction
oldbalanceOrg: balance before the transaction
newbalanceOrig: balance after the transaction
nameDest: recipient of the transaction
oldbalanceDest: initial balance of recipient before the transaction
newbalanceDest: the new balance of recipient after the transaction
isFraud: fraud transaction
Step 4: Data Cleaning
There are numerous steps involved in the data cleansing process, but we will focus on the most crucial ones here, which are as follows:
- Eliminating null data
- Eliminating rows which doesn't affect if payment is fraud or not.
Let's check is there any null data available using :
df.isnull().sum()
We can clearly see that, there is null data available here :
We can handle this in 2 ways :
- Dropping that rows from our dataset ( Not preferable if percentage of null data is more than 1%)
- Replace null with the mean or median value for numerical data and for categorical data we can use mode.
In the record isFraud
is our output variable which is used to check if payment is fraud or not so we can't put any average here also the amount of data is less than 1% so we can drop that using :
df = df.dropna(subset=['newbalanceDest', 'isFraud', 'isFlaggedFraud'])
Now let's check which rows doesn't affect our results if we remove that :
On careful consideration, we have find it out these fields 'isFlaggedFraud','nameOrig','nameDest'
doesn't affect to check if payment is fraud or not. So let's remove that :
df.drop(['isFlaggedFraud','nameOrig','nameDest'], axis = 1, inplace = True)
After running that code let's again check our dataframe(df) head using df.head()
:
Step 5: Convert Categorical Data into Numerical Data
After careful observation we have found that, only type
is non-numerical column. So let's check how many value count it have :
df["type"].value_counts()
We can clearly see that it have 5 unique values as :
let's encode this into numerical data :
data = pd.get_dummies(df, columns=['type'], drop_first=True)
data.head()
The modified data now appears as follows:
Step 6 : Split Features and Target values :
Features : It represents the input variables
Target : It represents output variables.
As it is understood in supervised machine learning, input and output variables must be supplied to the model in order for it to use what it has learned from the dataset to predict future values.
From dateset we can clearly identified that isFraud
is Target Variables and remaining columns are Features.
So let's spit that :
X = data.loc[:, data.columns.difference(['isFraud'])].values
y = data.loc[:,"isFraud"].values
Step 6: Split Data into Training and Test Set.
We divide data into training and test sets primarily so that we can use the training data to train our machine learning model and the test data to confirm whether or not the model has been trained correctly.
This can be easily done using scikit-learn :
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.3, random_state = 42)
Here test_size
represents the amount of test data we want so here I want 30% as test data from the entire dataset and this train_test_split
method will select 30% of test data randomly from dataset.
random_state
represents every time we want the same data for training and test set. So this will split data randomly but every time we will run that project or share it to someone this randomness will remains same in that conditions.
Step 7: Training ML Model
This is the main part of our project where we actually building our ML models.
From Target variables we can identify that the result will be either fraud
or notFraud
. So here we apply classification algorithm.
There are couple of Classification algorithms are available but here we will be using 1 model which is RandomForestClassifier
NOTE : As an assignment one can try out different algorithms and pick best out of it.
Let's apply the algorithm using scikit-learn :
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
Our model is ready for evaluation :
Step 8: Model Evaluation
This is a very vital and critical step where we can determine whether our model is ready for use or still need some adjustments.
In the past, writing a tonne of additional code was required for our model's evaluation. However, the Deepchecks platform offers us the current solutions that the world needs today.
With Deepchecks, you can completely verify your data and models from research to production, providing an all-inclusive open-source solution for all your AI & ML validation needs.
The Deepcheck's features that businesses enjoy the most are listed below:
Evaluation of Data Quality:
- Finding data that is inconsistent or missing.
- Finding abnormalities and outliers in the dataset.
Validation of the Model:
- Examining the fairness and bias of the model.
- Analysing the model's performance using several metrics.
- Ensuring the model's stability and ability to adapt well to fresh data.
Interpretability and Explainability:
- Supplying clarification on model predictions to improve comprehension.
- Displaying the contribution of features to predictions and their relevance.
Integration and Automation:
- Streamlining the model deployment process by automating the validation procedure.
- Integration with widely used deep learning technologies and frameworks.
Let's Integrate it in our platform :
The integration procedure only consists of two steps:
- Install the library.
- Copy the code directly from the documentation, adjust the settings, and you're good to go.
There are lot's of solutions provided here, so today we are using for evaluation of our model. Also we have structured data ( csv format) so we use their Tabular Section.
Let's install :
pip install deepchecks --upgrade
After successful installation, let's utilise this to check our model validation. Also one can directly jump into this documentation and try by oneself to integrate it into our project or follow along with me.
- Let's Create Deepchecks Dataset Object
from deepchecks.tabular import Dataset
train_ds = Dataset(X_train, label=y_train, cat_features=[])
test_ds = Dataset(X_test, label=y_test, cat_features=[])
- Let's Evaluate our model:
from deepchecks.tabular.suites import model_evaluation
evaluation_suite = model_evaluation()
suite_result = evaluation_suite.run(train_ds, test_ds, rf)
suite_result.show()
It shows us results as :
Let's explore our model's evaluation :
In Didn't Pass
section. It explains us it didn't pass some validations during Train-Test Split and it may affect our model.
No worries we can easily resolve this using Deepchecks Train Test Validation Suite.
This is an exercise for the viewers to integrate it and check this with this our Didn't Pass
test clear's. I am sure it surely clear with this.
Let's explore our Passed Section:
In the passed section it provides us lot's of information which we are manually doing by writing code but this platform provides us in literally 5-6 lines of code.
- Test case report :
- Our ROC Curve Plot : It also provide explanation as well that why I loves this platform.
- Prediction Drift Graph :
- Simple Model Comparison:
- Most Important Confusion Matrix :
There are many more items in that report that I haven't included here, but if you follow the code with this, I strongly advise you to verify this throughout your evaluation.
Also if you have any confusion related to it. You can directly go to their discussion section in github :
With over 3.2K ratings in this repository, Deepcheck offers excellent assistance.
Thatโs all in this blog. One can also fine-tune this model or even use different algorithms and create personalized ML/AI models. ๐ค๐ง โจ