A Comprehensive Guide of Regular Expressions using Python

https://miro.medium.com/max/1200/0*39rKif55n3z-uZoP

Original Source Here

A Comprehensive Guide of Regular Expressions using Python

Handy concepts of the ‘re’ module in text analysis

Photo by Annie Spratt on Unsplash

What is a regular expression?

A Regular expression, also known as RegEx, is a unique sequence of characters that helps to match or find a set of strings, a word, a letter, or even a number. We can accomplish this by using a special syntax followed by a pattern. The ‘re’ python module is similar to ‘Perl’-like regular expressions in python.

These expressions completely function on special symbols as each of these symbols have their meaning. These symbols are used in search engines, to replace dialogs of word processors, text editors, text processing utilities, and lexical analysis.

Topics to be covered:

1. RegEx functions
2. Using r as a prefix before a String
3. Metacharacters
4. Special Sequences
5. Using SETS
6. Match Objects

What to know before writing a RegEx program in python

A RegEx program consists of a module and statements containing functions, metacharacters, sets, and sequences. Let us now look at each of these topics in detail.

RegEx module

Python consists of a built-in package called “re”. So whenever we use Regex it is always required to import this function.

Example: import re

RegEx functions

In python, the re module provides a set of in-built functions that allows us to search a string for a match. Some of these functions are as shown below:

1. re.search(): The function returns a match object if there is a match in the string.

Program:

import restring = "Happy vaccine to us"
result = re.search( "to", string)
print(result)
Output:
<re.Match object; span=(14, 16), match='to'>

Explanation: Here, we use the in-built function search() to find a match for “bad” in the string. Since the word “to” matches the string, the position of the word is displayed. If there is no match, it returns as “None”.

2. re.findall(): The function returns a list of all the matches present in the string.

Program:

import restring = "Happy vaccine to us"
result = re.findall( "t", string)
print(result)
Output:
[‘t’]

Explanation: Here, the findall function searches and prints the letter ‘“t” based on the number of times it has occurred in the sentence.

3. re.split(): The function returns a list where the string splits at each match.

Program:

import restring = "Happy vaccine to us"
result = re.split( "\s", string, 1)
print(result)
Output:
['Happy', 'vaccine to us']

Explanation: Here, the split function is programmed to split at the first occurrence of a white space character. (\s means white space character)

4. re.sub(): The function replaces the stated text with our choice.

Program:

import restring = "Happy vaccine to us"
result = re.sub( "\s", "-", string)
print(result)
Output:
Happy-vaccine-to-us

Explanation: In this program, we use the sub() function to replace every white space character with the symbol ‘-’.

Using r as a prefix before a String

Let us look at this concept using an example.

Program:

import restring = 'Happy vaccine to us'
result = re.findall(r "\AHappy", string)
print(result)

Explanation: In this program, the line, ‘r “\AHappy”’ means that ‘\’, ‘A’ ‘The’ are all treated as ordinary characters. Whereas if “r” is not used, using the symbol, ‘\’ as a prefix before a letter becomes a metacharacter.

Metacharacters

In regular expressions, we know that symbols play a majority role as each of these symbols contains a special meaning. Metacharacters is one such thing. They are characters with a special meaning. Let us look at a few of these symbols and understand what they mean.

1. []: This character is to show whether a set of characters is present in the string.

Program:

import retring = "Happy vaccine to us"
result = re.findall( "[a-h]", string)
print(result)
Output:
['a', 'a', 'c', 'c', 'e']

Explanation: This program finds all the letters present between ‘a’ and “h”. If there is no letter present, the output will show as ‘None’.

2. . : This character is to show a character other than newline.

Program:

import restring = "Happy vaccine to us"
r = re.findall( "v..", string)
s = re.findall( "v.", string)
print(r)
print(s)
Output:
['vac']
['va']

Explanation: This program searches whether there is a match for the searched string. If there is a match based on the number of dots present, the remaining letters get filled. Hence the output is shown as ‘v’. This process applies to the next search.

3. ^: This character shows whether a string starts with a particular word/character.

Program:

import restring = "Happy vaccine to us"
result = re.findall( "^The", string)
if result:
print("The string starts with 'The' ")
else:
print("Not matching")
Output:
Not matching

Explanation: Here, the program searches whether the string starts with the word ‘The’. To do this, we use the ‘^’ character. If the string matches the search, it prints the true statement, or else the false statement will display as the output.

4. *: This character finds whether a string contains a particular character zero or more times.

Program:

import restring = "Happy vaccine to us"
result = re.findall( "to*", string)
print(result)
Output:
['to']

Explanation: The program searches for “to” in the string followed by either 0 or more “to” characters.

5. +: This character finds whether a string contains a particular character zero or more times.

Program:

import restring = "Happy vaccine to us"
result = re.findall( "a+", string)
print(result)
Output:
['a', 'a']

Explanation: The program searches for “a” present in the string followed by one or more than 1 “a” characters.

6. {}: This character finds the exactly specified number of occurrences.

Program:

import restring = "Happy vaccine to us"
result = re.findall( "vac{2}", string)
print(result)
Output:
['vacc']

Explanation: This program searches for a string ‘vacc’ followed by exactly two “cc” characters.

7. \: This character signals a unique sequence.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall( "\d", string)
print(result)
Output:
['2']

Explanation: Here, the program finds all the digits present in the string i.e, ‘2’.

8. |: This character finds whether at least one of the defined strings matched is true.

Program:

import restring = "Happy vaccine to us"
result = re.findall( "hap | vac", string)
print(result)
if result:
print("There is a match")
else:
print("Not matching")
Output:
[' vac']
There is a match

Explanation: This program finds whether there is a match for at least one word we are searching “(“hap | vac”)”. The program searches for both the words in the string. Since only the word ‘vac’ is present, it returns the output as such.

9. $: This character shows whether a string ends with a particular word/character.

Program:

import restring = "Happy vaccine to us"
result = re.findall( "us$", string)
print(result)
if result:
print("The string ends with 'us'")
else:
print("Not matching")
Output:
['us']
The string ends with 'us'

Explanation: The above program shows whether the matched search is present at the end of the string. Here, the word “friends” is at the end of the string (matches the search). Hence prints the first statement.

Special Sequences

1. \d: This kind of sequence returns a list of all digits present in a string.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall( "\d", string)
print(result)
Output:
[‘2’]

2. \D: This kind of sequence returns a list of all characters other than digits present in a string.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall( "\D", string)
print(result)
Output:
['H', 'a', 'p', 'p', 'y', ' ', ' ', 'v', 'a', 'c', 'c', 'i', 'n', 'e', ' ', 't', 'o', ' ', 'u', 's']

3. \s: This kind of sequence returns a list containing white space characters present in the string.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall( "\s", string)
print(result)
Output:
[' ', ' ', ' ', ' ']

4. \S: This kind of sequence returns a list of characters other than white space characters present in the string.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall( "\S", string)
print(result)
Output:
['H', 'a', 'p', 'p', 'y', '2', 'v', 'a', 'c', 'c', 'i', 'n', 'e', 't', 'o', 'u', 's']

5. \w: This kind of sequence returns a list of every word character (a-z, 0–9 &_) present in the string.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall( "\w", string)
print(result)
Output:
['H', 'a', 'p', 'p', 'y', '2', 'v', 'a', 'c', 'c', 'i', 'n', 'e', 't', 'o', 'u', 's']

6. \W: This kind of sequence returns a list of all characters other than words present in the string.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall( "\W", string)
print(result)
Output:
[' ', ' ', ' ', ' ']

7. \Z: This sequence returns a list if the specified characters are present at the end of the string.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall( "us\Z", string)
print(result)
if result:
print("There is a match")
else:
print("Not matching")
Output:
['us']
There is a match

8. \A: This sequence returns a list of the specified characters are present at the beginning of the string.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall( "\AHap", string)
print(result)
if result:
print("There is a match")
else:
print("Not matching")
Output:
['Hap']
There is a match

9. \b: This sequence returns a list of the specified characters are present at the beginning or at the end of a word.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall(r"\bat", string)
output = re.findall(r"at\b", string)
print(result)
print(output)
Output:
[]
[]

10. \B: This sequence returns a list of the specified characters other than the first and last words of the string.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall(r"\Bac", string)
output = re.findall(r"ap\b", string)
print(result)
print(output)
Output:
['ac']
[]

Using SETS

1. [ahf]: This set returns a list containing the specified characters, “a”, “h” or “f” present in the string.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall("[act]", string)
print(result)
Output:
['a', 'a', 'c', 'c', 't']

2. [a-n]: This set returns a list of all lower case characters present in the string between ‘a’ and ‘n’.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall("[a-h]", string)
print(result)
Output:
['a', 'a', 'c', 'c', 'e']

3. [^arn]: This set returns a list of all characters present in the string except ‘a’, ‘r’, and ‘n’.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall("[^ahf]", string)
print(result)
Output:
['H', 'p', 'p', 'y', ' ', '2', ' ', 'v', 'c', 'c', 'i', 'n', 'e', ' ', 't', 'o', ' ', 'u', 's']

4. [0123]: This set returns a list of the specified digits, ‘0123’ present in the string.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall("[015]", string)
print(result)
Output:
[]

5. [0–9]: This set returns a list of any digit between 0 and 9 present in the string.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall("[0-9]", string)
if result:
print("There is a match")
else:
print("No match")
Output:
There is a match

6. [0–5][0–9]: This set returns a list of any two-digit numbers between 00 and 59 present in the string.

Program:

import restring = "Happy 22 vaccine to us"
result = re.findall("[0-6][0-5]", string)
print(result)
Output:
[‘22’]

7. [a-zA-Z]: This set returns a list of uppercase and lowercase alphabets (from a-z or A-Z) present in the string.

Program:

import restring = "Happy 2 vaccine to us"
result = re.findall("[a-zA-Z]", string)
print(result)
Output:
['H', 'a', 'p', 'p', 'y', 'v', 'a', 'c', 'c', 'i', 'n', 'e', 't', 'o', 'u', 's']

8. [+]: This set returns a list containing the ‘+’ symbol. This concept is also applicable to ‘*’, ‘.’, ‘|’, ‘()’, ‘$’, ‘{}’.

Program:

import restring = "Happy 2 + vaccine to us"
result = re.findall("[+]", string)
print(result)
Output:
['+']

Match Objects

A match object is an object that contains the information of the search and results.

1. match.group()

The group method() is used to print a part of a string only if the search matches.

Program:

import restring = "The book is on the box"
x = re.search(r"\bb\w+", string)
print(x.group())
Output:
book

Explanation: Here, we have executed the program for finding a search for the letter ‘b’. And if the search matches, that word is printed as a result.

2. match.span()

The span() method prints the start to end position of the first match occurrence. Let us look at the example below:

Program:

import restring = "The book is on the box"
x = re.search(r"\bb\w+", string)
print(x.span())
Output:
(4, 8)

Explanation: Here, we have executed the program for searching the position of the letter ‘b’ in the ‘string’. The output will show as (4, 8) as the letter ‘b’ is in the 4th position, out of 8 characters(including white-space characters).

3. match.string

The string property returns the string that we have searched.

Program:

import restring = "The book is on the box"
x = re.search(r"\bb\w+", string)
print(x.string)
Output:
The book is on the box

Explanation: The above program states that the entire string gets returned if the character searched matches the string passed in to the function.

Conclusion

Hey, readers!! This article has covered all the basic concepts of regular expressions. I encourage you to explore more on this topic by executing programs and hope you had fun reading this article.

I hope you like the article. Reach me on my LinkedIn and twitter.

Recommended Articles

1. NLP — Zero to Hero with Python
2. Python Data Structures Data-types and Objects
3. Exception Handling Concepts in Python
4. Why LSTM more useful than RNN in Deep Learning?
5. Neural Networks: The Rise of Recurrent Neural Networks
6. Fully Explained Linear Regression with Python
7. Fully Explained Logistic Regression with Python
8. Differences Between concat(), merge() and join() with Python
9. Data Wrangling With Python — Part 1
10. Confusion Matrix in Machine Learning

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: