Cleaning Data in Python (Data Types)

In this post, we will go through some of challenging case that we face when cleaning our data, and how we can solve them using some helpful pandas methods.

Data Science is more about understanding the data, and data cleaning is very important part of this process. What makes the data more valuable depends on how much we can get from it. Let’s get started!

First, let’s import pandas library

import pandas as pd

If you are wondering how to create a dataframe using pandas, here is the code:

data = {‘Name’: [‘Joe’,’Liz’,’Kristen’],
        ‘Sex’: [‘Male’,’Female’,’-’],
        ‘Age’: [30,32,27]
}
df = pd.DataFrame(data, [‘Name’,’Sex’,’Age’])

And now let’s print out the dataframe that we just created.

print(df)
print(df.dtypes)

As you can see, our data has three columns and one missing value, which is the age of Kristen. We also printed out the data types of each column, and all of them are object.

We will do three things for this data:

  1. We will find a solution for the string value in a numeric column so that when we print out the data types, it gives us a better picture of the column.
  2. We will learn about categorical data type and how to convert a column to a categorical data.
  3. We will convert the name data type from object to string using the method that we will you in the second step.

To Numeric Method

  • Numeric data is loaded as a string in our dataframe. (For example in our data, age of Kristen is loaded as “-” which is a string in this case)
  • Since there is an empty value “-” in Age column, the dataframe doesn’t see that value as empty but it sees it as a string and this effects the data type of the whole column.
  • We will use to_numeric method to convert the column data type and fill not numeric values with NaN, which means Null.
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

df.dtypes
print(df)

As you can see, now our Age column is float data type, which gives a better picture of the column. Now we can go to our next step, which is converting a column to categorical data type and reason behind it.

Categorical Data Type

  • Can make your dataframe much smaller in memory
  • Can make the your data to be utilized by different Python libraries for analysis
  • For our data, Sex column’s data type is object, and if we convert to categorical data type it will help us in the future when we start playing with the data.

 

df['Sex'] = df['Sex'].astype('category')

df.dtypes

As you can above, the data type of our Sex column is now category data type. In this step, we also learnt how to convert a data type of a column. We are using astype method and putting the data type we want to convert inside the parenthesis. If you want to learn more about astype method, you can check the official pandas library page.

Converting Data Type

In this step, we will convert Name column data type from object to string. We will the same method we used in the previous step.

df[‘Name’] = df[‘Name’].astype(‘string’)

df.dtypes

 

That’s all for this post, now we have a cleaner data but there many other methods we need to know for a better data cleaning.

Thanks for reading this story, I hope you enjoyed and learnt something new today. I am planning to share more about data cleaning process in the following posts. Follow my blog to stay connected.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s