Python String Matching Without Complex RegEx Syntax

Originally posted on kdnuggets.

Learn how to use human-friendly programmable regular expressions for complex Python string matching.

I have a love-and-hate relationship with regular expressions (RegEx), especially in Python. I love how you can extract or match strings without writing multiple logical functions. It is even better than the String search function.

What I don’t like is how it is hard for me to learn and understand RegEx patterns. I can deal with simple String matching, such as extracting all alpha-numerical characters and cleaning the text for NLP tasks. Things get harder when it comes to extracting IP addresses, emails, and IDs from junk text. You have to write a complex RegEx String pattern to extract the required item.

To make complex RegEx tasks simple, we will learn about a simple Python Package called pregex. Furthermore, we will also look at a few examples of extracting dates and emails from a long string of text.


Getting Started with PRegEx


Pregex is a higher-level API built on top of the `re` module. It is a RegEx without complex RegEx patterns that make it easy for any programmer to understand and remember regular expressions. Moreover, you don’t have to group patterns or escape metacharacters, and it is modular.

You can simply install the library using PIP.

pip install pregex


To test the powerful functionality of PRegEx, we will use modified sample code from the documentation.

In the example below, we are extracting either HTTP URL or an IPv4 address with a port number. We don’t have to create complex logic for it. We can use built-in functions `HttpUrl` and `IPv4`.

  1. Create a port number using AnyDigit(). The first digit of the port should not be zero, and the next three digits can be any number.
  2. Use Either() to add multiple logics to extract, either HTTP URL or IP address with a port number.
from pregex.core.pre import Pregex
from pregex.core.classes import AnyDigit
from pregex.core.operators import Either
from pregex.meta.essentials import HttpUrl, IPv4

port_number = (AnyDigit() - '0') + 3 * AnyDigit()

pre = Either(
    HttpUrl(capture_domain=True, is_extensible=True),
    IPv4(is_extensible=True) + ':' + port_number


We will use a long string of text with characters and descriptions.

text = """IPV4--


Before we extract the matching string, let’s look at the RegEx pattern.

regex_pattren = pre.get_pattern()



As we can see, it is hard to read or even understand what is going on. This is where PRegEx shines. To provide you with a human-friendly API for performing complex regular expression tasks.



Just like `re.match`, we will use `.get_matches(text)` to extract the required string.

results = pre.get_matches(text)



We have extracted both the IP address with port number and two web URLs.

['', '', '']


Example 1: Date Format


Let’s look at a couple of examples where we can understand the full potential of PRegEx.

In this example, we will be extracting certain kinds of date patterns from the text below.

text = """


By using Exactly() and AnyDigit(), we will create the day, month, and year of the date. The day and month have two digits, whereas the year has 4 digits. They are separated by “-” dashes.

After creating the pattern, we will run `get_match` to extract the matching String.

from pregex.core.classes import AnyDigit
from pregex.core.quantifiers import Exactly

day_or_month = Exactly(AnyDigit(), 2) 
year = Exactly(AnyDigit(), 4)

pre = (
    day_or_month +
    "-" +
    day_or_month +
    "-" +

results = pre.get_matches(text)



['04-15-2023', '06-20-2023']


Let’s look at the RegEx pattern by using the `get_pattern()` function.

regex_pattren = pre.get_pattern()



As we can see, it has a simple RegEx syntax.



Example 2: Email Extraction


The second example is a bit complex, where we will extract valid email addresses from junk text.

text = """


  • Create a user pattern with `OneOrMore()`. We will use `AnyButFrom()` to remove “@” and space from the logic.
  • Similar to a user pattern we create a company pattern by removing the additional character “.” from the logic.
  • For the domain,  we will use `MatchAtLineEnd()` to start the search from the end with any two or more characters except “@”, space, and full stop.
  • Combine all three to create the final pattern: user@company.domain.
from pregex.core.classes import AnyButFrom
from pregex.core.quantifiers import OneOrMore, AtLeast
from pregex.core.assertions import MatchAtLineEnd

user = OneOrMore(AnyButFrom("@", ' '))
company = OneOrMore(AnyButFrom("@", ' ', '.'))
domain = MatchAtLineEnd(AtLeast(AnyButFrom("@", ' ', '.'), 2))

pre = (
    user +
    "@" +
    company +
    '.' +

results = pre.get_matches(text)



As we can see, PRegEx has identified two valid email address.

['', '']


Note: both code examples are modified versions of work by The PyCoach




If you are a data scientist, analyst, or NLP enthusiast, you should use PRegEx to clean the text and create simple logic. It will reduce your dependency on NLP frameworks as most of the matching can be done using simple API.

In this mini tutorial, we have learned about the Python package PRegEx and its use cases with examples. You can learn more by reading the official documentation or solving a wordle problem using programmable regular expressions.

Source: kdnuggets