MarsDevs introduces you to the Regular Expressions in Python

A Regular Expression (RegEx or RE) is a special sequence of characters that uses a search pattern to find a string or set of strings. These are supported by Python, Java, R, etc.

Various common uses of regular expressions are given below.

  • To find patterns in a string or file.
  • To find a string or a substring in a file.
  • To split the string into substrings.
  • To replace part of a string with another string.
  • To validate email format.

Python has a module "re" that supports the use of regex. It either returns the first match or none. Consider the following example,

The r character (r'portal') is for RAW, not a regex. The raw string is different from the regular string. It interprets the \ character as a \ character but not as an escape character. The regular expression has its own \ character for escaping on purpose.

Consider another example,

Let's discuss how to actually write a regex using metacharacters or special sequences.

Meta Characters

MetaCharacters are important as they will be used in the functions of module re. These are briefly explained below.

(1) Backslash (\)

It is used to remove or drop the special meaning of a character. Consider the following examples. Dot (.) is a special character here, if you want to find it in the given string, we use \ -

(2) Square Brackets ([])

It is used to search a set of characters. 

For example,

  • [0, 4] is same as [01234]
  • [a-d] is same as [abcd]
  • [^a-d] is same as any number except a, b, c, or d
  • [^0-3] is same as any number except 0, 1, 2, 3, or 4

We can invert using the caret(^) symbol.

(3) Caret (^)

It is used to match the beginning of the string i.e. it checks whether the string starts with the given character or not.

For example,

  • ^M checks if the input string starts with M or not
  • ^Moh checks if the input string starts with Moh or not.

(4) Dollar ($)

It is used to match the end of the string i.e. it checks whether the string ends with the given character or not.

For example,

  • $i checks if the input string ends with i or not
  • $dhi checks if the input string ends with dhi or not.

(5) Dot (.) -

It is used to match only a single character except for the newline character (\n).

For example,

  • x.y will allow any string in place of dot(.), number of characters should be at least 1, i.e., xay, xaby, xbbby etc.
  • .. will have at least 2 characters.

(6) Or (|)

It is an operator that checks whether the pattern is present in the string, before or after the or symbol.

For example, 

  • X|y will match any string that contains x or y such as xxx, yyy, xaby, etc.

(7) Question Mark (?)

It checks whether the string before the question mark occurs at least once in the regex in the sequence.

For example,

  • xy?z will be matched for the string xz, xzy, wxyz but not matched for xyyz because there are two y's. Similarly, it will not match xywz because y is not followed by z.

(8) Star (*)

It matches zero or more occurrences of the regex before the * symbol in the sequence.

For example,

  • xy*z will be matched for the string xz, xyz, xyyyc, etc. but not xywz because of out-of-sequence.

(9) Plus (+)

It matches one or more occurrences of the regex before the * symbol in the sequence.

For example,

  • xy+z will be matched for the string xyz, xyyz, xyyyc, etc. but not xz, xywz, etc.

(10) Braces ( {m, n} )

This matches any repetition before the regex that includes both m to n.

For example,

  • x{2, 4} will be matched for the string xxxy, yxxxxz, fxxd, etc., but not xy, xyz, etc.

(11) Group ( () )

It is used to group various regular expressions together and then find a match in a string.

For example,

  • (ab) is a group that can be matched in string ababaahabdyy.

Special Sequences

These do not match for the actual character in the string, rather it specifies the specific location in the search string where the match should occur. This makes it easier to write commonly used patterns.

Sequence

Description

Syntax

Example

\A

It matches if the string begins with the given character.

\Amars

marsdevs

\b

It matches if the word begins or ends with the given character. \b(string) - for the beginning check.
(string)\b - for the ending check.

\bmars

marsdevs

\B

The string must not begin or end with the given regex. It is the opposite of \b.

\Bde

marsdevs

\d

It matches any decimal digit [0-9].

\d

marsdevs1

\D

It matches any non-digit character [^0-9]. It is the opposite of \d.

\D

marsdevs1

\s

It matches any whitespace character.

\s

mars devs

\S

It matches any non-whitespace character. It is the opposite of \s.

\S

mars devs

\w

It matches any alphanumeric character [a-zA-Z0-9_]. 

\w

MarsDevs1

\W

It matches any non-alphanumeric character. It is the opposite of \w.

\W

#@%<

\Z

It matches if the string ends with the given regex.

devs\Z

marsdevs

To implement the above sequences consider the following example code with the given string.

Regex Module

In Python, there is a module named re that is used for regular expressions in Python. Import this module by using the import statement.

Syntax

There are various functions provided by this module for working with regex in Python. We will briefly discuss these functions here.

(1) re.findall()

It returns as a list all non-overlapping matches of the pattern in the string. The string is scanned from left to right, and matches are returned in the order they were found.

Consider following example,

(2) re.compile()

Regular expressions (RE) are compiled into a Pattern object, which contains methods for various tasks such as finding pattern matching or string substitutions.

Consider following example,

(3) re.split()

First, a string is split by the occurrences of a character or pattern, then it returns the remaining characters (other than that pattern) from the string.

Syntax

re.split(pattern, string, maxsplit=0, flags=0)

It denotes

  • pattern - regular expression 
  • maxsplit - considered to be 0, if 1 then the string will be split only once, resulting in a list of length 2. It is an optional parameter.
  • flag - helps to shorten code. eg: flags = re.IGNORECASE. It is an optional parameter.

Consider the following example,

(4) re.sub()

It means substring. It is used to find the pattern/substring in the given string. If found then replace by "repl". Counts, checks and maintains how often it happens.

Syntax

Consider following example,

(5) re.subn()

It is the same as sub() except in its own way of providing output. subn() returns a tuple with number of replacements and new string.

Syntax

Consider following example,

(6) re.escape()

It returns the string with all non-alphanumeric backslashes, which helps match an arbitrary literal string that can contain regular expression metacharacters in it.

Syntax

Consider the following example,

It prints

The\ Python\ programming\ language\ was\ first\ released\ on\ February\ 20,\ 1991\.

(7) re.search()

It returns either None (if not matched) or a ‘re’.MatchObject contains information about the matched part of the string. This method stops after the first match.

Consider the following example,

Match Object

It contains all the information about the search and the result and if no matches are found then none will be returned. There are various commonly used methods and properties of the Match object. These are briefly explained below.

  1. Getting the string and regex
  2. Getting the index of the matched object
  3. Getting matched substring

Consider the following example to understand these.