Data Cleaning in Python using Regular Expressions

Using string manipulation to clean strings

In this post, we will go over some Regex (Regular Expression) techniques that you can use in your data cleaning process. Regex techniques are mostly used while string manipulating. We will get to that in a second.

Data Science is more about understanding the data, and data cleaning is a very essential part of this process. What makes our data more valuable really depends on how much we can get from it. Let’s start with understanding what is string manipulation and why it is important. Then, we will do couple of common examples to practice.

What is String Manipulation

String manipulation is a must while data cleaning because most of the world’s data is unstructured text. Also making string manipulation is a way to make your datasets more consistent with each other, this helps you to combine and work together with different datasets.

Let get started!

Monetary Values

There are many ways monetary values can be represented. Here are some example we can come across in our data:

21
$21
$21.56
$21.561

We want to find a way to validates these values, and make sure they fit our dataset. Python has built-in methods and libraries to help us accomplish this. We will re library, it is a library mostly used for string pattern matching. Regular expressions give us a formal way to specify those patterns. Now, we will write expression to match for each of the values.

21 (Regex: “\d*”)
$21 (Regex: “\$\d*”)
$21.56 (Regex: “\$\d*\.\d{2}”)
$21.561 (Regex: “^\$\d*\.\d{2}$”)

We put are the beginning and dollar sign at the end. The caret will tell the pattern to start the pattern match at the beginning of the value, where the dollar sign will tell the pattern to match the end of the pattern. This way it will match exactly what we specified in our regex.

So, how do we use regular expressions

We will compile the pattern. (Compiling helps us to use the same regex variable over and over in our dataset).
Then we will use the compiled pattern to match our values.

This method is useful especially when we use pandas, because we want to match the same regex for the whole column values.

Here is a basic example of using regular expression

import re

pattern = re.compile('\$\d*\.\d{2}')

result = pattern.match('$21.56')

bool(result)

This will return a match object, which can be converted into boolean value using Python built-in method called bool.

Let’s do an example of checking the phone numbers in our dataset. In this exercise we will define a regular expression to match US phone numbers, which mean it has to fit the following pattern: “xxx-xxx-xxxx”.

# Import the regular expression module
import re

# Compile the pattern: phone
phone = re.compile('\d{3}\-\d{3}\-\d{4}')

# Check if the pattern matches
result = phone.match("123-456-7890")
print(bool(result))

#True

result2 = phone.match("1231-456-7890")
print(bool(result2))

#False

Let’s do another example of extracting numeric values from strings. This is helpful especially when working log files.

# Import the regular expression module
import re

# Find the numeric values: matches
matches = re.findall('\d+', 'Smoothie ingredients: 3 bananas and 2 strawberries')

# Print the matches
print(matches)

“\d” is the pattern used to find digits. This should be followed with a + so that the previous element is matched one or more times.

That’s all for this post, now we have a better understanding if string manipulation and using regular expressions. What we have covered in this post is just to give some idea, if you want to learn more, I highly recommend checking the official Python regex documentation.

Join me and thousands of other great writers on Medium. Make money writing.

Here is another data cleaning post you may find interesting:

Data Cleaning in Python (Data Types)

Thanks for reading this story, I hope you enjoyed and learnt something new today. I am planning to share more about data cleaning process in the following posts. Follow my blog to stay connected.