Originally posted on kdnuggets.
Learn how to use human-friendly programmable regular expressions for complex Python string matching.
I have a love-and-hate relationship with regular expressions (RegEx), especially in Python. I love how you can extract or match strings without writing multiple logical functions. It is even better than the String search function.
What I don’t like is how it is hard for me to learn and understand RegEx patterns. I can deal with simple String matching, such as extracting all alpha-numerical characters and cleaning the text for NLP tasks. Things get harder when it comes to extracting IP addresses, emails, and IDs from junk text. You have to write a complex RegEx String pattern to extract the required item.
To make complex RegEx tasks simple, we will learn about a simple Python Package called pregex. Furthermore, we will also look at a few examples of extracting dates and emails from a long string of text.
Getting Started with PRegEx
Pregex is a higher-level API built on top of the `re` module. It is a RegEx without complex RegEx patterns that make it easy for any programmer to understand and remember regular expressions. Moreover, you don’t have to group patterns or escape metacharacters, and it is modular.
You can simply install the library using PIP.
pip install pregex
To test the powerful functionality of PRegEx, we will use modified sample code from the documentation.
In the example below, we are extracting either HTTP URL or an IPv4 address with a port number. We don’t have to create complex logic for it. We can use built-in functions `HttpUrl` and `IPv4`.
- Create a port number using AnyDigit(). The first digit of the port should not be zero, and the next three digits can be any number.
- Use Either() to add multiple logics to extract, either HTTP URL or IP address with a port number.
from pregex.core.pre import Pregex from pregex.core.classes import AnyDigit from pregex.core.operators import Either from pregex.meta.essentials import HttpUrl, IPv4 port_number = (AnyDigit() - '0') + 3 * AnyDigit() pre = Either( HttpUrl(capture_domain=True, is_extensible=True), IPv4(is_extensible=True) + ':' + port_number )
We will use a long string of text with characters and descriptions.
text = """IPV4--192.168.1.1:8000-- address--https://www.abid.works-- website--https://kdnuggets.com--text"""
Before we extract the matching string, let’s look at the RegEx pattern.
regex_pattren = pre.get_pattern() print(regex_pattren)
As we can see, it is hard to read or even understand what is going on. This is where PRegEx shines. To provide you with a human-friendly API for performing complex regular expression tasks.
Just like `re.match`, we will use `.get_matches(text)` to extract the required string.
results = pre.get_matches(text) print(results)
We have extracted both the IP address with port number and two web URLs.
['192.168.1.1:8000', 'https://www.abid.works', 'https://kdnuggets.com']
Example 1: Date Format
Let’s look at a couple of examples where we can understand the full potential of PRegEx.
In this example, we will be extracting certain kinds of date patterns from the text below.
text = """ 04-15-2023 2023-08-15 06-20-2023 06/24/2023 """
By using Exactly() and AnyDigit(), we will create the day, month, and year of the date. The day and month have two digits, whereas the year has 4 digits. They are separated by “-” dashes.
After creating the pattern, we will run `get_match` to extract the matching String.
from pregex.core.classes import AnyDigit from pregex.core.quantifiers import Exactly day_or_month = Exactly(AnyDigit(), 2) year = Exactly(AnyDigit(), 4) pre = ( day_or_month + "-" + day_or_month + "-" + year ) results = pre.get_matches(text) print(results)
Let’s look at the RegEx pattern by using the `get_pattern()` function.
regex_pattren = pre.get_pattern() print(regex_pattren)
As we can see, it has a simple RegEx syntax.
Example 2: Email Extraction
The second example is a bit complex, where we will extract valid email addresses from junk text.
text = """ firstname.lastname@example.org editorial@@kdnuggets.com email@example.com. firstname.lastname@example.org """
- Create a user pattern with `OneOrMore()`. We will use `AnyButFrom()` to remove “@” and space from the logic.
- Similar to a user pattern we create a company pattern by removing the additional character “.” from the logic.
- For the domain, we will use `MatchAtLineEnd()` to start the search from the end with any two or more characters except “@”, space, and full stop.
- Combine all three to create the final pattern: email@example.com.
from pregex.core.classes import AnyButFrom from pregex.core.quantifiers import OneOrMore, AtLeast from pregex.core.assertions import MatchAtLineEnd user = OneOrMore(AnyButFrom("@", ' ')) company = OneOrMore(AnyButFrom("@", ' ', '.')) domain = MatchAtLineEnd(AtLeast(AnyButFrom("@", ' ', '.'), 2)) pre = ( user + "@" + company + '.' + domain ) results = pre.get_matches(text) print(results)
As we can see, PRegEx has identified two valid email address.
Note: both code examples are modified versions of work by The PyCoach.
If you are a data scientist, analyst, or NLP enthusiast, you should use PRegEx to clean the text and create simple logic. It will reduce your dependency on NLP frameworks as most of the matching can be done using simple API.
In this mini tutorial, we have learned about the Python package PRegEx and its use cases with examples. You can learn more by reading the official documentation or solving a wordle problem using programmable regular expressions.