Everything on the internet has a Uniform Resource Locator (URL) that uniquely identifies it — allowing Internet users to gain access to files and other media. For instance, this article has a unique URL that helps search engine optimization (SEO) crawlers index it for users to find.
The first definition of the URL syntax is in the 1994 Request for Comments (RFC) 1738. Since then, the structure of URLs has gone through many revisions to improve their security. However, developers often fail to use the RFC definition as intended, contributing to many malicious attacks.
A recent example is the RCE 0-day exploit found in Log4j, a popular Java logging package. This attack occurred when the Java Naming and Directory Interface (JNDI) evaluated a malicious log string. The JNDI is a Java API for a directory service that allows Java software clients to discover and look up data and resources (in the form of Java objects) using a name. When it evaluates a malicious log string, it connects to a remote server and executes malicious Java code. Therefore, enterprises must continually validate the URLs provided by external agents, such as customers, partners, and so on.
Another example is the server-side request forgery attack that can compromise a server program when it accesses an insecure URL. Malicious users can leverage the provided input fields to cause damage to our web solutions — and worse, our organization’s public image.
In this article, we’ll explore the challenges of corrupted URLs, how they can damage your applications, and ways to tackle the problem. To follow along, make sure you have Python installed and set up on your machine.
Security risks with URLs
URLs can be risky, as they can direct users from a legitimate web page to a malicious one. They can also open multiple attack vectors — including cross-site scripting (XSS), a security vulnerability in some web applications. The XSS vulnerability enables attackers to inject client-side scripts into web pages viewed by other users.
This security risk can have serious consequences in specific industries, such as banking. An unknowing customer can open a genuine-looking page controlled by an attacker because they believe it’s a login page for their banking application. This is a type of phishing attack, built on social engineering, where an attacker sends fraudulent, fake, or otherwise deceptive messages designed to trick a person into revealing sensitive information to the attacker. They could also deploy malicious software on the victim’s infrastructure, including ransomware. If the attacker captures a banking user’s username and password, the attacker can use this information to log in to the user’s bank account and cause damage.
More advanced attacks with URLs can happen due to server-side request forgery. This attack occurs in the servers that send requests to the URLs provided by the customers. When making a request, an attacker can take over the server to perform actions that aren’t allowed, such as scanning the ports and network information, requesting metadata of the infrastructure, printing, and outputting sensitive information such as passwords or tokens.
The following sections demonstrate how to validate and sanitize URLs you don’t control. For example, if we’re building a social platform where users can communicate and share data, we can control which URLs are allowed and which we can hide.
How to perform URL validation in Python
Python, a programming language, is widely used to build web applications and sites for millions of customers worldwide. While Python doesn’t have a built-in URL scanner and validator, community-driven URL parsers and validators are available for Python — which can be used to validate our URLs and make them more secure. Some web application frameworks also provide URL scanners and validators for developing web applications.
Validating URLs
One popular URL validation method in Python is the validators Python package. We can use the validators package to determine whether a URL is valid. To be valid, a URL must:
- Be well-formed, meaning it follows all the rules of the HTTP or HTTPS specifications
- Have a resource at that address, since the URL is invalid without an associated resource
The Python validators package exposes a URL method that verifies the URL, and tests if it is secure, and searches for invalid keywords and characters. If a URL is available in the public domain (meaning it’s not behind a firewall, paywall, or other barrier to access), it skips any internal IP addresses.
To use the validators package, download and set up the dependency on your local Python environment using pip. To download the validators package, run the following code in your local terminal or command-line interface:
$ python3 -m pip install validators
Once this command finishes, we’ll have the validators package on our machine.
Next, create a Python file named main.py
, and write the Python code to test a URL:
import validators
validation = validators.url("http:/www.google.com")
if validation:
print("URL is valid")
else:
print("URL is invalid")
After putting this code in the main.py
file, execute the code in Python interpreter using the Python command-line interface:
$ python main.py
The code prints "URL is invalid"
because there is a /
character missing in the URL after http:/
.
The package can also determine if a URL is publicly accessible, which is helpful when trying to validate whether the user is trying to request an internal IP address. Add the following code in the same Python file and run it to try and validate a URL.
validation = validators.url("https://10.0.0.1", public=True)
if validation:
print("URL is valid")
else:
print("URL is invalid")
Once again, the output states that the URL is invalid. This is because the URL isn’t available in public domains — despite the fact that there aren’t any issues with the URL itself. To learn more about the validators package’s features, consult its documentation.
Parsing using regular expressions
Another approach we can use to validate URLs is using regular expressions. We can, for example, use regular expressions to require a URL to include HTTPS before it can be validated. The code to complete this validation looks like this:
^https:\/\/[0-9A-z.]+.[0-9A-z.]+.[a-z]+$
This regular expression matches the term https://www.google.com
, but not http://www.google.com
, although both are valid URLs. You can learn more about regular expressions on this website. We can try the expression above in Python code:
import re
pattern = "^https:\/\/[0-9A-z.]+.[0-9A-z.]+.[a-z]+$"
result = re.match(pattern, "https://www.google.com")
if result:
print(result)
else:
print("Invalid URL")
The output for the code above is the matched URL. If you modify the URL string above and remove the HTTPS or make it HTTP, you get a None
object — indicating there was no matching URL.
However, a regular expression is complicated and not practical for a real-world scenario. Regular expressions are hard to read and complex to debug and scale. This is why libraries are generally a better solution.
For a more advanced use case of a regular expression that parses the URL with all the conforming structures and syntax, check out this Stack Overflow thread.
One benefit of using regular expressions is that you can also find invalid URLs within input strings. This is only possible with regular expressions — and not with common libraries. The validators package doesn’t work, even if the string contains a trailing or preceding whitespace. To best use the package, you must sanitize the input string and pass it to the validator package.
Using urllib
Another package that parses the URL and exposes the parts of the URL is urllib. We can use it with the Python 3 interpreter.
The following code verifies if a URL is valid:
from urllib.parse import urlparse
result = urlparse("https:/www.google.com")
if result.scheme and result.netloc:
print("Success")
else:
print("Failed")
print(result)
When we set the scheme and the netloc
field for the result
variable, the URL is valid and can be used. Otherwise, the URL is invalid, and we should take caution.
Identifying common vulnerabilities
Some frameworks in Python, such as Django, provide built-in validator packages that allow us to validate URLs within that framework. However, the challenge with relying on these libraries is that, even if we’re familiar with that framework, we still have to ensure the package itself is secure. Andopen source only makes trusting the package more complex.
We can use a security tool like the Snyk Advisor to quickly review all open source packages we’re using — and new packages that we’d like to implement — for common vulnerabilities. From this review, Snyk provides us with a security report that we can use to determine whether or not the package should be included. For example, we can use the Snyk Advisor to ensure our packages, like the validators package we used in the demonstration above, are secure.
We can also use Snyk Open Source alongside Snyk Advisor to discover licensing problems, vulnerabilities, and other security-related concerns that may exist in our open source tool stack. And, we can consult the Snyk Vulnerability Database (VulnDB) to search for known URL-related vulnerabilities — like this vulnerability in the Flask framework, which redirects a user to a location without URL validation. By consulting this database, we can proactively secure our webpages and applications.
Validating URLs in Python
In this article, we explored the challenges that unsanitized URLs bring to a web application, and how to sanitize the URLs as needed. We started with basic validations of the URL using the validator and urllib packages. And demonstrated how to use these packages to confirm whether the URLs are all public, or if some are internal URLs used to attack the web application. Then, we used a primary regular expression to showcase how a simple one-line code can scan the URLs for basic required information, such as all URLs being HTTPS. We also discussed why regular expressions are a complex and challenging way to validate the URLs.
Finally, we covered how to best select secure open-source libraries. Snyk Advisor provides a valuable source of truth when using open source packages. For more information, review the Python security best practices cheat sheet to help prepare the continuous integration and continuous development pipelines (CI/CD) to review any vulnerability added to your code.
Secure your open source software for free
Create a Snyk account today to find and fix licensing problems, vulnerabilities, and other security-related concerns.