The Drawbacks of Traditional WAF

Traditional Web Application Firewalls (WAFs) typically use regular expressions to define attack characteristics based on keywords.

Taking the well-known ModSecurity engine as an example, around 80% of the world’s WAFs are driven by it. However, these WAFs often have rules that can be easily evaded by attackers. Let’s dissect a few rules to illustrate:

union[\w\s]*?select: This rule considers any traffic containing the words union and select as a SQL injection attack.
\balert\s(*: This rule considers any traffic containing the word alert followed by a left parenthesis “(” as an XSS attack.

Sophisticated attackers can easily bypass these keyword-based rules. Here are some examples of evasion:

union /**/ select: By inserting a comment between “union” and “select” the keyword pattern is broken, and the attack goes undetected.
window['\x61lert'](): By using \x61 to replace the letter “a” the keyword pattern is broken, and the attack goes undetected.

From these examples, we can conclude that traditional regular expression-based WAFs cannot truly stop attacks and can always be bypassed by hackers.

Moreover, regular expressions often cause a significant number of false positives, which can interfere with legitimate users and disrupt normal business operations. Here are some examples of false positives:

The union select members from each department to form a committee: This phrase would be mistakenly flagged as a SQL injection attack, but it is just a simple English sentence.
Her down on the alert(for the man) and walked into a world of rivers: This sentence would be mistakenly flagged as an XSS attack, but it is just a simple English sentence.

Here are two slides that show how experts from the Black Hat conference automatically bypass regex-based WAF protection:

AutoSpear: Towards Automatically Bypassing and Inspecting Web Application Firewalls
Web Application Firewalls: Attacking Detection Logic Mechanisms

Principles of Semantic Analysis

The semantic analysis algorithm is the core capability of SafeLine WAF. Instead of using simple attack characteristics to match traffic, it genuinely understands user input in the traffic and deeply analyzes potential attack behaviors.

Example: SQL Injection

To complete a SQL injection attack, there are two prerequisites:

1.The traffic must contain a SQL statement that conforms to the syntax.

union select xxx from xxx where: This is a syntactically correct SQL statement fragment.
union select xxx from xxx xxx xxx xxx where: This is not a syntactically correct SQL statement fragment.
1 + 1 = 2: This is a syntactically correct SQL statement fragment.
1 + 1 is 2: This is not a syntactically correct SQL statement fragment.

2.The SQL statement must exhibit malicious behavior, not just be a useless statement.

union select xxx from xxx where: This shows potential malicious behavior.
1 + 1 = 2: This is meaningless.

SafeLine detects attacks based on the essence of SQL injection attacks with the following process:

Parse HTTP traffic to find potential input parameters.
Perform deep recursive decoding on parameters to revert to the original user input.
Check if the user input conforms to SQL syntax.
Determine the possible intent of the SQL syntax.
Assign a malicious score based on the actual intent and decide whether to block the request.

SafeLine has built-in compilers covering common programming languages. After deep decoding of HTTP payloads, it matches the syntax of the corresponding language compilers, matches the threat model to get a threat rating, and decides whether to block or allow access requests.

Why Semantic Analysis is Stronger

Students who learn computer science are familiar with compiler theory, which introduces the Chomsky hierarchy, dividing formal languages in the computer world into four types:

Type-0 Grammar (Unrestricted Grammar): Recognized by a Turing machine.
Type-1 Grammar (Context-Sensitive Grammar): Recognized by a linear-bounded automaton.
Type-2 Grammar (Context-Free Grammar): Recognized by a pushdown automaton.
Type-3 Grammar (Regular Grammar): Recognized by a finite state automaton.

The expressive power of these grammars weakens sequentially from Type-0 to Type-3. The programming languages we use daily, like SQL, HTML, and JavaScript, usually fall into Type-2 Grammar (or even include some Type-1 Grammar capabilities). In contrast, regular expressions correspond to the weakest expressive power of Type-3 Grammar.

To illustrate the weakness of regular expressions, a classic example is that regular expressions cannot count. You cannot even use regular expressions to recognize a valid parenthesis matching string.

Using such weak Type-3 Grammar to match ever-changing attack payloads is inherently impossible. The root cause is the innate limitations of rule-based attack detection methods. From a grammatical expression ability comparison, Type-3 Grammar is contained within Type-2 Grammar. Therefore, regular expression-based rule descriptions cannot fully cover program language-based attack payloads. This is the fundamental reason why rule-based attack detection WAFs fall short of expectations.

In contrast, semantic analysis-based threat detection is more accurate and has a lower false-positive rate compared to regular expression-based threat detection.

Why Rule-based WAFs Can be Easily Bypassed and What is Semantic Analysis Algorithm

The Drawbacks of Traditional WAF

Principles of Semantic Analysis

Why Semantic Analysis is Stronger