My newsletter was overrun by bots! I decided to try a machine-learning solution. It was my first ML experiment and I learned a lot. Want to know how I built a bot detector and gained some ML skills along the way?
The bot invasion
I have a free newsletter that encourages you to read daily.
There are 100+ subscribers, and recently a lot of bots have signed up too.
Bots are signing up to market their own product, newsletters, etc.
They usually have a link in the name field and the message that they want to convey.
Name | |
---|---|
watcher2112@ecocryptolab.com | 🔶 Withdrawing 32 911 Dollars. Gо tо withdrаwаl >>> https://forms.yandex.com/cloud/65e6228102848f1a71edd8c9?hs=0cebe66d8b7ba4d5f0159e88dd472e8b& 🔶 |
These spammy signups aren't just annoying, they're a real headache!
I was tired of manually blocking bot emails and worrying about how they might hurt my email reputation.
I know I have numerous options to filter out the bot signups by embedding traditional methods like CAPTCHA, Double Opt-in, Regex patterns, or Honeypot Fields in the form.
At the same time, I also had a feeling like I was not trying to adapt to the newer technology, particularly the Machine Learning field, and wanted to get started but had no clue where to begin with.
Then one of my mentors, Shrijith suggested why not try creating a solution for the bot signup problem using ML.
I felt this was the right experiment I could begin with to learn ML.
And so, I am here with my first machine learning experiment!
What should I expect from the model?
Picture this: You've built a website with a newsletter signup form. You want to make sure your subscribers are real people, not automated bots.
So, you implement a bot detection system. But what does it mean when someone tells you their system is "95% accurate"?
Let me break it down:
Catching true bots
Imagine 100 signups are actually bots.
A 95% sensitive system should correctly identify 95 of them as bots.
5 bots might slip through the cracks and be mistaken for humans (false negatives), which is okay and not a big deal.
Not mistaking humans
Now, imagine 100 signups are from real humans.
A 95% specific system should accurately recognize 95 of these as humans.
However, 5 people could be mistakenly labeled as bots (false positives), this is very bad as the human is ignored, which is a loss of potential business lead(in general injustice).
The formulas
Sensitivity = True Bots Detected / (True Bots Detected + Bots Missed)
The system's ability to find true bots.
Specificity = True Humans Detected / (True Humans Detected + Humans Mistaken for Bots)
The system's ability to avoid mislabeling real people.
Accuracy = (True Bots Detected + True Humans Detected) / (Total Signups)
Overall correctness, but it can be misleading if your dataset has way more of one type (bots or humans).
If all three are 1.0 then congrats you have the perfect model.
One big mental mistake
I used to underestimate the power of data when training machine learning models.
I assumed that algorithms would simply "figure it out" no matter what I fed them.
With a small dataset of 103 signups (only 12 bots!), I threw it at Decision Trees, Logistic Regression, and Random Forest models.
I got an initial accuracy of 77%, but that was a classic overfitting trap.
My models were just memorizing the training data, useless for real-world scenarios.
Frustrated, I jumped to transformers, thinking the solution lay in fancy algorithms.
I got a slight boost to 87.4%, which was a relief but still left much to be desired.
To hit that 90% target, I needed to debug. Using a confusion matrix,
I finally saw the light: it was the data, not the models, holding me back.
I used SMOTE and simply balanced my dataset with equal numbers of bot and human signups, i.e 90 Human and 90 Bots then my accuracy shot up to 94%!
Long story short: How I got to the 100% accuracy bot detector
Note: my training data is 180 rows
1. Preparation
- Imports for models and packages
- Extracting data from my newsletter database to CSV.
2. Creating the Dataset
- I cannot input the database data directly for the BERT to understand.
- I need to use a tokenizer to break the text into tokens (suitable units for BERT).
Created a class(
NewsletterCollectionDataset
) to do the above things.
3. Splitting data and loading
- I split the data into three sets
- training (to teach the model) 144 rows,
- validation (to check progress during training) 18 rows, and
- testing (for final evaluation) 18 rows.
- Then a function(
create_data_loader
) turns each of those data splits into 'DataLoaders' which the model can easily train on.
4. Building the model
-
BotClassifier
is a class where my bot-detection model is defined. - It's based on BERT but adds some extra layers:
- bert: Loads the pre-trained BERT model.
- drop: A technique called 'dropout' to help prevent overfitting (the model memorizing too much about the training data).
- out: A final output layer to turn BERT's output into the prediction (bot or human).
- Setting up the Model:
- Get the model ready to run.
- Specify an optimizer (AdamW).
- Learning rate scheduler for how the model's learning changes over time.
5. Training the model
- Setting the model to training mode.
- Looping through the data and updating the model's knowledge(backward propagation) using the optimizer. ### 6. The main function
- A function(
start_training
) where a loop is present. - This loop runs for a fixed number of epochs (training cycles). In each epoch:
7. Final Evaluation
- A function(
evaluate_model
) to get the truest sense of how well the model has learned to generalize to unseen data. - After training was done, I evaluated the model one more time on the held-out testing set (
test_data_loader
). - A function(
test_with_single_data
) to test out a signup on the model.
Now I will try to explain the above stages as simple as possible.
How did I create the Dataset?
I have mainly name and email fields in the newsletter signup and there is no verification.
Then I manually blacklisted all the bots in the email service Listmonk.
So the raw data was in the format of
Status | Name | |
---|---|---|
Available | athreyac4@gmail.com | athreya c |
Blocklisted | watcher2112@ecocryptolab.com | 🔶 Withdrawing 32 911 Dollars. Gо tо withdrаwаl >>> https://forms.yandex.com/cloud/65e6228102848f1a71edd8c9?hs=0cebe66d8b7ba4d5f0159e88dd472e8b& 🔶 |
This was good enough for me to do an experiment.
I used the above data to get it in a simple format so that I could train it easily.
df = pd.read_csv('https://raw.githubusercontent.com/usrername/repo/dataset.csv')[['name_email', 'bot']]
df.head(2)
What are the numbers for training and testing?
I had 103 signup emails. 91 were human and 12 were bot.
I used SMOTE and generated data in a such way that I had 90 bots and 90 humans.
Finally used 144 signup data for training the model,
18 for testing and 18 for validating.
### Data preparation
We use Pandas, Torch, and Sklearn packages to make use of their utils for splitting data into training and testing sets.
sklearn.model_selection import train_test_split as tts
INITIAL_TEST_SIZE = 0.2
RANDOM_SEED = 42
VALIDATION_SIZE = 0.5
# Splits the dataset into a training set (for model training) and a testing set (for evaluating its performance).
df_train, df_test = tts(df,
test_size=INITIAL_TEST_SIZE,
random_state=RANDOM_SEED
)
# Further splits the testing set into a validation set (for tuning model parameters) and a final testing set.
df_val, df_test = tts(df_test,
test_size=VALIDATION_SIZE,
random_state=RANDOM_SEED,
)
Custom Dataset
NewsletterCollectionDataset
Class
This class defines a dataset that can be used with PyTorch models.
It takes care of preprocessing the raw name email data using a BERT tokenizer and
converting it into suitable input for a machine-learning model.
# Provide tools for creating custom datasets and loading data in batches for machine learning.
import torch
from torch.utils.data import Dataset
class NewsletterCollectionDataset(Dataset):
"""
Args:
bot: Labels for each sample (0 or 1).
name_emails: List of name email text samples.
tokenizer: BERT tokenizer for preprocessing.
max_len: Maximum sequence length.
"""
def __init__(self, bots, name_emails, tokenizer, max_len):
self.name_emails = name_emails
self.bots = bots
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.name_emails)
This is the heart of the class. Here's what happens:
- Grabs a name email signup and its bot/human label.
- Uses the BERT tokenizer to turn the text into numbers the model understands.
- Bundles everything neatly with labels ready for PyTorch.
def __getitem__(self, i):
name_email = str(self.name_emails[i])
bot = self.bots[i]
encoding = self.tokenizer.encode_plus(
name_email,
add_special_tokens=True,
max_length=self.max_len,
truncation=True,
return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt'
)
return {
'name_email': name_email,
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'bot': torch.tensor(bot, dtype=torch.long)
}
Data Loaders
create_data_loader
Function
Creates DataLoader objects, which handle loading data in batches and
shuffling for the training, validation, and testing sets.
from torch.utils.data import DataLoader
from transformers import BertTokenizer
def create_data_loader(df, tokenizer, max_len, batch_size):
"""
Args:
df (pandas.DataFrame): The DataFrame containing email name data and 'bot' labels.
tokenizer: The BERT tokenizer for text preprocessing.
max_len (int): The maximum length for tokenized sequences.
batch_size (int): Number of samples per batch.
Returns:
torch.utils.data.DataLoader: A DataLoader instance for iterating over the dataset.
"""
ds = NewsletterCollectionDataset(
bots=df['bot'].to_numpy(),
name_emails=df['name_email'].to_numpy(),
tokenizer=tokenizer,
max_len=max_len
)
return DataLoader(
ds,
batch_size=batch_size,
num_workers=4
)
Creating model data for training, validation, and testing using the data loaders.
# Loads the BERT tokenizer for text preprocessing.
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
TOKENIZER = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
# Maximum sequence length for tokenization.
MAX_LEN=512
# Batch size for training.
BATCH_SIZE=16
train_data_loader = create_data_loader(df_train, TOKENIZER, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(df_test, TOKENIZER, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(df_val, TOKENIZER, MAX_LEN, BATCH_SIZE)
## The Model: BERT Plus a Bit More
My core model (BotClassifier
) isn't crazy complex. Think of it like this:
BERT Does the Heavy Lifting: I feed BERT those name email signups and it turns them into meaningful representations.
import torch.nn as nn
from transformers import BertModel
class BotClassifier(nn.Module):
"""
Args:
n_classes (int): The number of output classes (e.g., 2 for bot vs. human).
"""
def __init__(self, n_classes):
super(BotClassifier, self).__init__()
self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
self.drop = nn.Dropout(p=0.3)
self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
Dropout: Little Bit of Randomness Dropout randomly zeroes out some connections during training, making the model less prone to overfitting.
The Output Layer: "Bot" or "Not"? A simple linear layer takes BERT's output and makes the final prediction.
Defines the forward pass through the spam classification model.
def forward(self, input_ids, attention_mask):
"""
Args:
input_ids (torch.Tensor): Tokenized input sequences.
attention_mask (torch.Tensor): Attention mask indicating real vs. padded tokens.
Returns:
torch.Tensor: The model's output logits (un normalized class probabilities).
"""
pooled_output = self.bert(
input_ids=input_ids,
attention_mask=attention_mask
)[1]
output = self.drop(pooled_output)
return self.out(output)
# Check for CUDA (GPU) availability; otherwise defaults to CPU.
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = BotClassifier(n_classes=2)
model = model.to(DEVICE)
## What did the training involve?
The train
function is where I teach this model to spot the bots.
import numpy as np
def train(
model,
loss_fn,
optimizer,
scheduler,
device,
data_loader,
n_examples
):
"""
Args:
model (nn.Module): The PyTorch model to train.
loss_fn (nn.Module): The loss function for calculating error.
optimizer (torch.optim.Optimizer): The optimizer used for updating model parameters.
scheduler: A learning rate scheduler to adjust learning rate during training.
device (torch.device): The device where the model and data should be loaded ('cpu' or 'cuda')
data_loader (torch.utils.data.DataLoader): A DataLoader providing batches of training data.
n_examples (int): The total number of training examples in the dataset.
Returns:
tuple: A tuple containing:
* train_acc (float): Training accuracy for the epoch.
* train_loss (float): Average training loss for the epoch.
"""
model = model.train() # Sets the model to training mode
losses = []
correct_predictions = 0
For each batch of data, it:
- Feeds data to the model.
for d in data_loader:
# Data preparation
input_ids = d['input_ids'].to(device)
attention_mask = d['attention_mask'].to(device)
targets = d['bot'].to(device)
# Forward pass
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
- Calculates how wrong the model was (that's the loss).
# Loss calculation
loss = loss_fn(outputs, targets)
# Accuracy calculation
_, preds = torch.max(outputs, dim=1)
correct_predictions += torch.sum(preds == targets)
losses.append(loss.item())
- Tweaks the model to be better next time (backpropagation and the optimizer).
- Learning rate magic: The scheduler adjusts the learning rate, so the model learns quickly at first and then fine-tunes itself.
# Back propagation
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # Gradient clipping
# Optimization
optimizer.step()
scheduler.step()
optimizer.zero_grad()
train_acc = correct_predictions.double() / n_examples
train_loss = np.mean(losses)
return train_acc, train_loss
from collections import defaultdict
history = defaultdict(list)
EPOCHS=5
def start_training():
best_accuracy = 0
for epoch in range(EPOCHS):
print(f'Epoch {epoch + 1}/{EPOCHS}')
print('-' * 10)
This is where the core learning happens for one epoch. Accuracy and loss (how wrong the model is) are calculated on your training data.
train_acc, train_loss = train(
model,
loss_fn,
optimizer,
scheduler,
DEVICE,
train_data_loader,
len(df_train)
)
print(f'Train loss {train_loss} accuracy {train_acc}')
The evaluate_model
function tests how well the model is doing on a validation dataset it hasn't seen before.
This helps prevent overfitting.
val_acc, val_loss = evaluate_model(
model,
loss_fn,
DEVICE,
val_data_loader,
len(df_val)
)
print(f'Validation loss {val_loss} accuracy {val_acc}\n')
If the model beats its previous best performance on the validation set, it's saved.
history['train_acc'].append(train_acc)
history['train_loss'].append(train_loss)
history['val_acc'].append(val_acc)
history['val_loss'].append(val_loss)
if val_acc > best_accuracy:
torch.save(model.state_dict(), 'best_detector_model.bin')
best_accuracy = val_acc
start_training()
Is it working?
Testing the model with a signup
Single Signups: The test_with_single_data
Function
demonstrates how to use the model on one signup at a time
Prepping the Input: Just like during training, we use our trusty BERT tokenizer (TOKENIZER) to turn a new signup into the right format.
def test_with_single_data(data_to_test):
"""Tests a single signup to determine if it's likely from a bot or human.
Args:
data_to_test (str): The name and email data from a newsletter signup.
Prints:
The input signup data along with the model's prediction (bot or human).
"""
# Tokenize and prepare input data for the model
encoding = TOKENIZER.encode_plus(
data_to_test,
add_special_tokens=True,
max_length=MAX_LEN,
truncation=True,
return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors="pt",
)
input_ids = encoding["input_ids"].to(DEVICE)
attention_mask = encoding["attention_mask"].to(DEVICE)
To the Model!: The model spits out a prediction, and we turn its numbers into a probability using torch.nn.functional.softmax.
# Set model to evaluation mode and run prediction
model.eval()
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
prob = torch.nn.functional.softmax(outputs, dim=1)
# Get the class prediction (0 = human, 1 = bot)
prediction = torch.argmax(prob, dim=1).item()
Bot or Not? Based on that probability, we decide whether it's likely a bot or a real human signup.
# Print the input data and the prediction result
print(f"Input Name Email: {data_to_test}", )
if prediction == 1:
print("The signup is likely from a bot. \n")
else:
print("The signup is likely from a human. \n")
email = "rishic2013@gmail.com"
name = "Rishi C "
email2 = "lama2@hexmos.com"
name2 = "🔶Lama2. G t 12 "
test_with_single_data(name+email)
test_with_single_data(name2+email2)
The Method I used for debugging and achieved 94% from 87%
When I wanted to gain more accuracy, I didn't exactly know what was going wrong.
So when I implemented and understood the Confusion Matrix,
it was showing one False Positive.
So let me exlain what is confusion matrix is
The confusion matrix is a simple and powerful tool that provides a clear picture of how well the classification happens.
Sklearn provides a function called confusion_matrix to visualize the classification.
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_test.numpy(), y_pred.numpy())
custom_colors = ['#f0a9b1', '#a9f0b9']
sns.heatmap(cm, annot=True, cmap=custom_colors, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
For plotting the confusion matrix, I used Matplotlib and the Seaborn library in Python.
Think of it like a truth table for your model. It lays everything out:
True Negative (Top left): 6 - The model correctly identified 7 human signups.
False Positive (Top right): 0 - The model incorrectly identified 0 human signup as a bot.
False Negative (Bottom Left): 1 - The model incorrectly identified 1 bot signup as human.
True Positive (Bottom right): 11 - The model correctly identified 11 bot signups.
Coming back to the original problem, I had one False Positive
That meant the model was wrongly flagging a real person as a bot! A quick look at my data with my show_misclasified()
function
I realized I had mislabeled data during my balancing act.
A single human mislabeled as a bot was causing the dip.
One fix, one retrain, and done – 94% accuracy!
Conclusion
My bot detector achieved a 91.6% success rate catching bots, with a perfect score (100%) identifying real subscribers.
Not bad, since accidentally blocking a real person (false positive) is a much bigger concern than missing a sneaky bot.
This is a good start, but I'm always looking to improve. I'll be gathering more data and experimenting to see if I can boost the accuracy even further.
Want to stay updated on my progress? Subscribe to our journal for next week's content on fine-tuning Stable Diffusion!
Originally published at https://journal.hexmos.com on March 31, 2024.