Using Open Source WAF to Address Crawlers Occupying Significant Network Bandwidth

Carrie - Sep 19 - - Dev Community

1. Background

For some automated bots or malicious crawlers, their website access frequency is high and they stay for a long time. When you open the management console of your cloud server, you often find that most of the network traffic is concentrated on one or a few IPs. This situation can be easily managed by rate limiting the access IPs on the server.

However, rate limiting access IPs is usually unrelated to business logic, and developers typically do not want to maintain an IP access frequency table themselves. Additionally, manually maintaining all visitor information under conditions of distribution and concurrency is costly in terms of development.

Chaitin’s SafeLine WAF perfectly addresses these issues. SafeLine offers rate limiting, port forwarding, manual IP black/white lists, and its core function—defending against web attacks.

2. Installing SafeLine

The official website provides three installation methods: online installation, offline installation. For details, refer to:
SafeLine WAF Installation Guide

3. Log in and Entering SafeLine Management Interface

Image description
Image description

4. Configuring Sites and Rate Limiting

4.1 SafeLine Site Configuration
SafeLine’s site configuration is comprehensive, including automatically uploading TLS certificate and private key, specifying multiple forwarding ports, and more, no need for developers to configure nginx forwarding.

Image description

Image description

4.2 Configuring Rate Limiting
You can customize the blocking strategy. It is recommended to set it to 100 requests per 10 seconds, blocking for 10 minutes.

Image description

ps: If you test for personal use or find out false positives, you can disable the blocking capability manually.

5. Testing and Others

5.1 Testing
A simple server is prepared in the backend, providing a "hello" interface with an "a" parameter. Here is a simple crawler test code:

import requests
import random

def send_request(url, request_method="GET", header=None, data=None):
    try:
        if header is None:
            header = {"User-Agent": "Mozilla/5.0"}
        response = requests.request(request_method, url, headers=header)
        return response
    except Exception as err:
        print(err)
        pass
    return None

if __name__ == '__main__':
    for i in range(0, 100):
        char = random.choice('abcdefghijklmnopqrstuvwxyz')
        resp = send_request("http://a.com/hello?a=" + char)
        print(resp.content)
Enter fullscreen mode Exit fullscreen mode

Output examples:

b'{"a":"u"}'
b'{"a":"m"}'
b'{"a":"y"}'
b'{"a":"o"}'
b'<!DOCTYPE html>\n\n<html lang="zh">\n  <head>\n .... (followed by a long HTML text)
Enter fullscreen mode Exit fullscreen mode

At this point, when you revisit the page, you will find that it has been blocked.

Image description

5.2 What if Crawlers Spoofing X-Forwarded-For Header
SafeLine can directly set the Source IP acquisition method under ‘General Settings’.

Image description

If the crawler spoofs the TCP Source IP field, the HTTP connection will fail during the TCP handshake, and the crawler will lose the ability to scrape information. The request will be discarded when it passes through nginx.

This guide should help you effectively manage bandwidth usage caused by crawlers with SafeLine WAF.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player