Regular expressions in Python

A guide to implementing regular expression efficiently and easily in Python.

Jayashree domala
Level Up Coding

--

Regular expressions are also called regex. They are text matching patterns and have a formal syntax. The regular expression Python module “re” will be used.

>>> import re

Pattern searching

Let’s start off by searching for patterns in text.

>>> patterns = ['python', 'fun']>>> sent = 'This is a sentence which says python is easy'

To search, “re.search()” is used where the first argument is what to search and the second one is from where to search.

So below we try searching “is” in “python is fun” and get the match.

>>> re.search('is', 'python is fun')
<re.Match object; span=(7, 9), match='is'>

Using this we will search for the pattern.

>>> for pattern in patterns:
print('Searching for "{}" in: \n"{}"'.format(pattern, sent))
if re.search(pattern,sent):
print("\nPattern found\n")
else:
print("\nPattern not found\n")
Searching for "python" in:
"This is a sentence which says python is easy"
Pattern foundSearching for "fun" in:
"This is a sentence which says python is easy"
Pattern not found

Now let’s get a closer look at this match object.

>>> match = re.search(patterns[0], sent)>>> type(match)
re.Match

This match object has its own methods which can be called. start() tells the index of the start of the match. end() tells index of the end of the match.

match.start()
30
match.end()
36

Splitting

>>> splitter = ','>>> text = 'Hey, is this your book, No'>>> re.split(splitter, text)
['Hey', ' is this your book', ' No']

Finding instances of pattern

The arguments passed are the term you want to match and the text.

>>> re.findall('python', 'python is fun and python is easy')
['python', 'python']

Finding specific patterns

Using metacharacters this can be done.

A pattern followed by metacharacter * is repeated zero or more times.

A pattern followed by metacharacter + must appear at least once.

Using ? means the pattern appears zero or one time.

For a specific number of occurrences use {m} after the pattern. m is the number of times the pattern should repeat.

Use {m,n} where m in minimum and n is the maximum number of repetitions. ({m,}) means the value appears at least m times with no max.

>>> def re_find(patterns, phrase):
'''
Input: list of regex patterns
Output: list of matches
'''
for pattern in patterns:
print('Searching the phrase "{}"'.format(pattern))
print(re.findall(pattern, phrase))
print('\n')

Repetition

>>> txt = 'jdjd..jjjddd...jdddjddd...djdj...djjjjj...jdddd'
>>> patterns = ['jd*', 'jd+', 'jd?', 'jd{3}', 'jd{2,3}']
>>> re_find(patterns,txt)
Searching the phrase "jd*"
['jd', 'jd', 'j', 'j', 'jddd', 'jddd', 'jddd', 'jd', 'j', 'j', 'j', 'j', 'j', 'j', 'jdddd']
Searching the phrase "jd+"
['jd', 'jd', 'jddd', 'jddd', 'jddd', 'jd', 'jdddd']
Searching the phrase "jd?"
['jd', 'jd', 'j', 'j', 'jd', 'jd', 'jd', 'jd', 'j', 'j', 'j', 'j', 'j', 'j', 'jd']
Searching the phrase "jd{3}"
['jddd', 'jddd', 'jddd', 'jddd']
Searching the phrase "jd{2,3}"
['jddd', 'jddd', 'jddd', 'jddd']

Character set

It is used when we want to match any one of a group of characters at a point. Brackets are used to construct these character sets.

[ab] searches for occurrence of either a or b.

>>> txt = 'jdjd..jjjddd...jdddjddd...djdj...djjjjj...jdddd'
>>> patterns = ['[jd]',
'j[jd]+'] # j followed by one or more j or d
>>> re_find(patterns,txt)
Searching the phrase "[jd]"
['j', 'd', 'j', 'd', 'j', 'j', 'j', 'd', 'd', 'd', 'j', 'd', 'd', 'd', 'j', 'd', 'd', 'd', 'd', 'j', 'd', 'j', 'd', 'j', 'j', 'j', 'j', 'j', 'j', 'd', 'd', 'd', 'd']
Searching the phrase "j[jd]+"
['jdjd', 'jjjddd', 'jdddjddd', 'jdj', 'jjjjj', 'jdddd']

Exclusion

To exclude the terms we use ^ in the bracket syntax notation.

[^…] will match any single character, not in the brackets.

>>> txt = 'Hello! My name is liza, what is yours? Sir. '

check for matches that are not !.,? or space. The + sign is to check if the match appears at least once. This way we can remove the punctuations.

>>> re.findall('[^!,.? ]+', txt)
['Hello', 'My', 'name', 'is', 'liza', 'what', 'is', 'yours', 'Sir']

Character ranges

It helps you define a character set to include all the characters between the start to stop point.

[b-j] would return matches with any instances of letters between b and j.

>>> txt = 'Hey. I love coding in python. It is fun and easy.'
>>> patterns = ['[a-z]+', '[A-Z]+',
'[a-zA-Z]+', # lower or upper case
'[A-Z][a-z]+'] # one upper followed by lower case
>>> re_find(patterns, txt)
Searching the phrase "[a-z]+"
['ey', 'love', 'coding', 'in', 'python', 't', 'is', 'fun', 'and', 'easy']
Searching the phrase "[A-Z]+"
['H', 'I', 'I']
Searching the phrase "[a-zA-Z]+"
['Hey', 'I', 'love', 'coding', 'in', 'python', 'It', 'is', 'fun', 'and', 'easy']
Searching the phrase "[A-Z][a-z]+"
['Hey', 'It']

Escape codes

They are used to find specific types of patterns in data.

\d — digit

\D — non-digit

\s — whitespace (space, tab, newline)

\S — non-whitespace

\w — alphanumeric

\W — non-alphanumeric

To differentiate between string escape characters and regex escape characters, regex escape characters are preceded by a “r”.

>>> txt = 'Python is a #easy language and I am using it since 123 months'
>>> patterns = [r'\d+', r'\D+', r'\s+', r'\S+', r'\w+', r'\W+']
>>> re_find(patterns, txt)
Searching the phrase "\d+"
['123']
Searching the phrase "\D+"
['Python is a #easy language and I am using it since ', ' months']
Searching the phrase "\s+"
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
Searching the phrase "\S+"
['Python', 'is', 'a', '#easy', 'language', 'and', 'I', 'am', 'using', 'it', 'since', '123', 'months']
Searching the phrase "\w+"
['Python', 'is', 'a', 'easy', 'language', 'and', 'I', 'am', 'using', 'it', 'since', '123', 'months']
Searching the phrase "\W+"
[' ', ' ', ' #', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Refer to the notebook here.

Beginner-level books to refer to learn Python:

Advance-level books to refer to learn Python:

Reach out to me: LinkedIn

Check out my other work: GitHub

--

--

Self-driven woman who wishes to deliver creative and engaging ideas and solutions in the field of technology.