It’s my third week as a data scientist mentee with She Codes Africa and I’ve been assigned couple of dirty dataset to clean up.
Do you know this common proverbial phrase that says “When life gives you lemon,make lemonade (i.e turn a difficult situation into something positive or desirable) . Right! But in this context its not a situation ; it is a data , so let get to it.
What is a Data Set?
It is a collection of data. In this case we’re referring to tabular data set that corresponds to one or more database tables, here every column of a table represents a particular variable, and each row corresponds to a given record of the data set in question. It includes values such as real numbers or integers, nominal data, missing values etc
What is a Dirty Data Set?
A dirty data set also known as a rogue data is inaccurate, incomplete or inconsistent data in a computer system or database. It consist of mistakes such as spelling or punctuation errors, incorrect data associated with a field, incomplete or outdated data, or even data that has been duplicated in the database.
There are seven types of dirty data :-
- Duplicated data
2 . Outdated data
3. Insecure data
4. Incomplete data
5. Incorrect/Inaccurate data
6. Inconsistent data
7. Too much data
The dirty data sets came in a plain text file format, therefore I had to read the files to csv format using this code df1 = pd.read_csv() as seen in the image below by collecting the plain text files and converting it to CSV but that was after importing the module Pandas as its an important function in the package.
Pandas :-
It is a fast , powerful ,flexible and easy to use package which makes importing and analyzing data much easier. It is an open source data analysis and manipulation tool,built on top of the Python programming language.
Combining of datasetfiles.
Given three dirty data sets files to work with required me combining all of them into one data set to reduce the stress of cleaning one after the other. Combining multiple datasets together can be done in several ways.
1. Merge( ) :-
for combining data on common columns or indices.
2. Join( ):-
for combining data on a key column or an index.
3. Concat( ):-
for combining DataFrames across rows or columns.
I choose to concatenate i.e concat( ) because my datasets were dataframes across rows and columns.
Pandas DataFrames
They are data structures that contain Data organized in two dimensions, rows and column and Labels that correspond to the rows and columns.
The first row contains the column labels (name ,city, age and py-score ). The first column holds the row labels (101, 102 ,103,104, 105,106,107). All other cells are filled with the data values.
After Concatenating the three dataframe sets (df1, df2 and df3 )as seen in the image below.
we have the merged dataframe below by calling the head function merged_df.head( ) which tells you the top records in the data set , by default python shows you only 5 records:-
Checking for the dataset information.
Information [ info( ) ]:-
After merging the dataframes to a single one , it is important you check the information about the data and datatypes of each respective attribute you will be dealing with because this shows us what we need to work on.
we have 21 data columns , 4345 rows ,26897 entries , datatypes (7 float64 , 14 object .
Converting the datatype of a date column to datetime.
From the info( ) we realized the approveddate , birthdate, closeddate, creationdate, firstduedate and firstrepaiddate are in object datatype instead of datetime object datatype but not to worry Pandas has a built-in function called to_datetime()
that can be used to convert strings to datetime so far the date is valid. Learn more about this built in function here.
The date class in the DateTime module of Python deals with dates in the Gregorian calendar. It accepts three integer arguments: year, month, and day as shown above. The good thing about the datetime module is that it helps you to join together both properties i.e date and time into a single object.
Below is the conversion of the columns ; approveddate , birthdate, closeddate, creationdate, firstduedate and firstrepaiddate object to datetime object datatype.
Extraction of Date ,Month and Year
One cool thing that occurs after converting to a datetime datatype is the ease with which you can extract your day, month and year seperately.
line 23
Handling Missing Values
What are missing values?
Missing values are data that occur when no information is provided for one or more item for a whole unit. They are referred to as NA
(Not Available) values in pandas. Many datasets simply arrive with missing data, either because it exists and was not collected or it never existed.
In this dataset , NaN (an acronym for Not a Number), was used to represent our missing values. It is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation
Although there are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame :-
I used the isnull() functionwith my data frame to check the missing values as seen in the image below.
As we can see we have various missing values in the respective columns. Therefore we need to fill them in . We can achieve this by using any of these functions , they replace NaN values with some value of their own.
fillna( ), replace ( ) and interpolate( )
To replace my data set I used the fillna( ) function , to replace it with value zero.
The missing values (NaN) in the column termdays dataframe was filled in with value zero and also for other columns which had a missing value.
All the data Cleaning steps in this article are the ones I implemented while cleaning my datasets,here is a link to the work in my github .There are other processes involved in data Cleaning and analysis but it all boils down to how dirty your data is and your target feature.
I definitely would share more steps as I get my hands dirty , wiping clean more data sets .