Data wrangling with python

As an analyst, you always dream to work with data that was collected properly. But most of the time you will have to deal with messy datasets. In this particular project, we can see in fig 1 that we have user data collected in little tables, one after another.

This kind of format is quite usual in Excel users as they use excel sheets as reports.

Note that data is useless for analysis when it is collected in this way. This data set has 76377 rows and 11 columns, so any manual arrangement in excel is not viable in a simple way.

Here is when data wrangling and python save the day.

fig 1

You can take a peek of the whole python code on my Github profile.

I used Jupyter notebook in Google colab to develop this project.

Let’s start processing the data

After importing the csv file, I noticed that there were 2 empty columns, that were not visible in Excel. I dropped them.

df.drop(columns=['Unnamed: 9','Unnamed: 10'], inplace=True)

We need to relate each table to the user. We can do this by giving a number to each user to identify their particular data.

This numbers numbers will be crucial to join and merge tables a bit later.

# we need the 'row type' column without any NaN value, let's drop all NaN
df=df[df['Row Type'].notna()]

column_values=[]
counter=0
for i in df['Row Type']:
  if 'first name' in i:
    counter+=1
  column_values.append(counter)

#Create a new column with the user numbers
df['iteration']=column_values

Notice in fig 2 that we have rows with the names of the columns throughout the dataframe. We exclude these rows from the data frame (df).

fig 2.
#exclude columns name from the dataframe
df=df[df['Row Type']!='Row Type']

To continue with the data wrangling, we need to extract the first name, last name and date of every user.

I created a new dataframe called ‘names’ to process the user information.

Fig 3 shows the result.

#Extract the rows which contain 'first name' in 'Row Type' column
names=df[df['Row Type'].str.contains('first name')]
names.head()

#Drop unnecessary columns
names.drop(columns=['Speed1','Speed2','Electricity','Effort','Weight','Torque'], inplace=True)

#rename columns
names.rename(columns={'Row Type':'first_name','Iter Number':'last_name','Power1':'date'}, inplace=True)
names

#We don't need the words 'first name:' or 'last name:' in every row. Let's drop this extra words
names['first_name']=names['first_name'].str[12:]
names['last_name']=names['last_name'].str[11:]
names['date']=names['date'].str[6:]
fig 3.

Before joining both tables (names and df), we need the first table without any ‘first name’ or ‘last name’ row, because this information is in the ‘names’ table already.

no_names=df[~df['Row Type'].str.contains('first name')]

Finally, we merge both tables and drop the column ‘iteration’ since it did its job.

Fig 4. shows the final result.

#Now we can merge both tables
final_df=pd.merge(left=names, right=no_names,how='inner',on='iteration')

#iteration column is not necesary at this point, so let's drop it
final_df.drop(columns='iteration',inplace=True)
final_df.head(10)
Fig 4.

If you are interested in data cleaning, you should take a peek of my post of data cleaning with SQL

Are you looking for an analyst? Contact me

Whatsapp: +593 97 907 2682
Email: contact@ivan-condor.com