Python regular expression methods re.match() and re.sub()

DoriDoro - Sep 19 - - Dev Community

Introduction

Let's go over the two methods, re.sub() and re.match() from Python's re module with examples.

1. re.sub():

The re.sub() function is used for substituting occurrences of a pattern in a string. It takes three main arguments:

  • The pattern you want to replace (a regular expression).
  • The replacement string (what you want to replace it with).
  • The original string in which you want to replace the occurrences of the pattern.

Syntax:

re.sub(pattern, replacement, string, count=0, flags=0)
Enter fullscreen mode Exit fullscreen mode
  • pattern: The regex pattern to search for.
  • replacement: The string to replace the matched pattern.
  • string: The input string where the replacement will occur.
  • count: (Optional) Limits the number of replacements. By default, all occurrences are replaced.
  • flags: (Optional) Allows modification of matching behavior (like case-insensitivity).

Example:

Let's replace all the digits in a string with the word NUM.

import re

text = "The price is 123 dollars and 45 cents."
new_text = re.sub(r'\d+', 'NUM', text)

print(new_text)
Enter fullscreen mode Exit fullscreen mode

Output:

The price is NUM dollars and NUM cents.
Enter fullscreen mode Exit fullscreen mode

Here, \d+ is the regex pattern that matches one or more digits. The re.sub() function replaces all occurrences of this pattern with the string 'NUM'.


2. re.match():

The re.match() function checks for a match only at the beginning of the string. If the match is found at the start of the string, it returns a match object. Otherwise, it returns None.

Syntax:

re.match(pattern, string, flags=0)
Enter fullscreen mode Exit fullscreen mode
  • pattern: The regex pattern to match.
  • string: The input string to be searched.
  • flags: (Optional) Allows modification of matching behavior.

Example:

Let's check if a string starts with a word followed by numbers.

import re

text = "Price123 is the total cost."
match = re.match(r'\w+\d+', text)

if match:
    print(f"Matched: {match.group()}")
else:
    print("No match found")
Enter fullscreen mode Exit fullscreen mode

Output:

Matched: Price123
Enter fullscreen mode Exit fullscreen mode

Here, \w+ matches one or more word characters (letters, digits, and underscores), and \d+ matches one or more digits. Since the string starts with "Price123", it successfully matches and prints it.


Key Differences:

  • re.sub() is used for substitution and applies to the whole string.
  • re.match() checks if the string starts with a match, and it doesn’t search beyond the first match in the string.

Would you like more examples or a deeper dive into regular expressions?


Let’s dive deeper into re.sub() and re.match() with more advanced examples and explanations of regular expressions (regex) patterns.

re.sub() Advanced Example:

Suppose we want to format phone numbers by replacing their format. We have phone numbers like 123-456-7890 and we want to replace them with a format that looks like (123) 456-7890.

Example:

import re

text = "Contact me at 123-456-7890 or 987-654-3210."
formatted_text = re.sub(r'(\d{3})-(\d{3})-(\d{4})', r'(\1) \2-\3', text)

print(formatted_text)
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • \d{3}: This matches exactly 3 digits.
  • (\d{3}): Parentheses () are used for capturing groups. In this case, we're capturing the first three digits as one group.
  • r'(\1) \2-\3': This is the replacement string. It uses \1, \2, and \3 to refer to the captured groups (the area code, first three digits, and last four digits, respectively).
  • So, this example finds phone numbers in the 123-456-7890 format and converts them to (123) 456-7890.

Output:

Contact me at (123) 456-7890 or (987) 654-3210.
Enter fullscreen mode Exit fullscreen mode

re.match() Advanced Example:

Let's now look at how we can use re.match() with more complex patterns. Assume you want to validate whether a given string is a valid email address, but we only want to check if it starts with an email format.

Example:

import re

email = "someone@example.com sent you a message."

# Basic email pattern matching the start of a string
pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'

match = re.match(pattern, email)

if match:
    print(f"Valid email found: {match.group()}")
else:
    print("No valid email at the start")
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • ^[a-zA-Z0-9_.+-]+: This part matches one or more alphanumeric characters, dots (.), underscores (_), plus signs (+), or hyphens (-). The ^ ensures the match starts at the beginning of the string.
  • @[a-zA-Z0-9-]+: This matches the @ symbol followed by one or more alphanumeric characters or hyphens (the domain name).
  • \.[a-zA-Z0-9-.]+: Matches a dot (.) followed by alphanumeric characters, hyphens, or additional dots (the top-level domain).

This pattern will match valid email addresses at the beginning of the string.

Output:

Valid email found: someone@example.com
Enter fullscreen mode Exit fullscreen mode

Explaining Common Regex Patterns:

  1. \d: Matches any digit (equivalent to [0-9]).
  2. \w: Matches any word character (alphanumeric plus underscore). Equivalent to [a-zA-Z0-9_].
  3. +: Matches 1 or more occurrences of the preceding character or group.
  4. *: Matches 0 or more occurrences of the preceding character or group.
  5. .: Matches any character except newline.
  6. ^: Anchors the pattern to the start of the string.
  7. $: Anchors the pattern to the end of the string.
  8. {m,n}: Matches between m and n occurrences of the preceding character or group.
  9. [ ]: Used to define a character set. For example, [a-z] matches any lowercase letter.
  10. (): Used for capturing groups, allowing us to extract parts of the match and reference them later (like in re.sub()).

Combining re.sub() with Functions:

You can also use a function as the replacement in re.sub() if you want more dynamic behavior. Let’s see how.

Example: Capitalize every word in a sentence.

import re

text = "this is a test sentence."

def capitalize(match):
    return match.group(0).capitalize()

new_text = re.sub(r'\b\w+\b', capitalize, text)

print(new_text)
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • \b: Word boundary.
  • \w+: Matches one or more word characters.
  • The capitalize() function is called for each match, and it capitalizes the first letter of each word.

Output:

This Is A Test Sentence.
Enter fullscreen mode Exit fullscreen mode

re.match() vs re.search():

If you want to search for a pattern anywhere in the string (not just at the beginning), you should use re.search() instead of re.match().

Example using re.search():

import re

text = "This is my email someone@example.com"

# Search for an email pattern anywhere in the string
pattern = r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'

search = re.search(pattern, text)

if search:
    print(f"Email found: {search.group()}")
else:
    print("No email found")
Enter fullscreen mode Exit fullscreen mode

Output:

Email found: someone@example.com
Enter fullscreen mode Exit fullscreen mode

Here, re.search() looks for the pattern anywhere in the string, unlike re.match(), which only checks the start.

Summary:

  • re.sub(): Replaces matches of a pattern within a string. Can use captured groups for dynamic replacements or even a function.
  • re.match(): Checks for a match at the beginning of a string. Useful for validation or checking the start of a string.
  • re.search(): Searches for a pattern anywhere in the string, not limited to the start.

These examples should give you a more comprehensive understanding of how regex works in Python! Would you like to explore any specific patterns or problems further?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player