Data cleaning project with SQL server

Data cleaning with SQL (or other programs like python, R) could be the most important part of a data analysis project, The quality of the data we use determines the quality of the results and insights we get. Many professionals believe that we should dedicate more time to preparing and cleaning the data rather than other processes like EDA, ML, and others. Otherwise, we could finish with a bunch of inaccurate output.

In this project I drive a cleaning data process to prepare data for analysis by modifying incomplete data, removing irrelevant and duplicated rows, splitting addresses, and modifying improperly formatted data.

Data cleaning is not about erasing information to simplify the dataset, but rather finding a way to maximize the accuracy of the collected data.

Let’s go over cleaning techniques with a Housing dataset. It has 56K+ rows. Let’s get started!

You can take a peek of the entire SQL code on my Github profile.

I used Microsoft SQL Server MS to develop this project.

Fist glance to the data

At first glance, I identified the next things to do:

  1. Standarize SalesDate format
  2. Populate Property Address data
  3. Break Address into individual columns (Address, City, State)
  4. Change Y and N to YES and No in ‘SoldAsVacant’ column
  5. Remove Duplicates
  6. Delete Unused Columns
SELECT*
FROM Project.dbo.housing

Standarize date formart in SQL with CONVERT

Sale Date format includes the hour on it, as it is empty, we don’t really need it.

So, I created a new column named SalesDate2 filled with SaleDate converted to date format.

Now it could be use to do further analysis based on date in Tableau, Power BI, etc.

ALTER TABLE Project.dbo.housing
ADD SaleDate2 Date

UPDATE Project.dbo.housing
SET SaleDate2 = CONVERT (Date, SaleDate)

SELECT SaleDate, SaleDate2
FROM Project..housing
data cleaning with sql

Populate missing property addresses

Every property has only one Parcel ID, as well as a unique address.

I realized that missing PropertyAddress had their Parcel ID, so I populated the missing addresses based on the Parcel ID.

UPDATE a
SET PropertyAddress = ISNULL(a.PropertyAddress, b.PropertyAddress)
FROM Project..housing a
JOIN Project..housing b
	ON a.ParcelID = b.ParcelID
	AND a.[UniqueID ]<>b.[UniqueID ]
WHERE a.PropertyAddress IS NULL
data cleaning with sql to split addresses into city, state

Breaking address into individual columns (Address, city, State)

Usually, we need what analyze the data by zone, regione, city, etc. That’s why it’s more useful to have the house’ address, the citty and the state in differente columns.

I add new columns to split the Property and the owner address.

Note 1: I replaced the ‘,’ to ‘.’ becase PARSENAME recognize dots, not comas.

Note 2: To split the property address I applied a different method based on SUBSTRING, CHARINDEX and LEN, you can check it out in my Github.

UPDATE Project..housing
SET OwnerAddressClean = PARSENAME(REPLACE(OwnerAddress,',','.'),3)

UPDATE Project..housing
SET OwnerCity = PARSENAME(REPLACE(OwnerAddress,',','.'),2)

UPDATE Project..housing
SET OwnerState = PARSENAME(REPLACE(OwnerAddress,',','.'),+1)
data cleaning with sql for address spliting

Modifying inclomplete words

As we can see, there are 52 and 399 values that shows ‘Y’, ‘N’ instead of ‘Yes’, ‘No’.

It is important to modify these values so we could apply effective filters in further analysis.

SELECT Distinct( SoldAsVacant), COUNT(SoldAsVacant) 
FROM Project..housing
GROUP BY SoldAsVacant
ORDER BY 2

UPDATE Project..housing
SET SoldAsVacant=CASE
	WHEN SoldAsVacant = 'Y' THEN 'Yes'
	WHEN SoldAsVacant = 'N' THEN 'No'
	ELSE SoldAsVacant
END

Indentify and remove duplicates usig CTE

CTE are temporary, and can be used to run queries like SELECT, INSERT, UPDATE or DELETE.

ROW_NUMBER assigne a number to every row with information that should be unique in every property as ParcelID, Address, LegalReference, etc, And shows the number of times a row with the same data appears in the dataset.

Once the CTE is defined, it’s quite easy to DELETE the duplicates(rows that appered more than 1 time in the dataset).

with RowNumCTE as(
SELECT *,
	ROW_NUMBER () over (
	Partition by ParcelID,
				 PropertyAddress,
				 SalePrice,
				 SaleDate,
				 LegalReference
				 Order BY
				 UniqueID
				 ) as row_num
From Project..housing
)
DELETE
FROM RowNumCTE
WHERE row_num>1

Droping unused columns

Some of the columns that we modify are not neccesary now, as we splt the same data into separates columns.

Usualy, it’s better to keep these columns, and not delete any data. But in this case, I delete these columns because I have the same information in other columns.

ALTER TABLE Project..housing
DROP COLUMN ownerAddress, PropertyAddress, SaleDate

This was just an example of how to conduct cleaning techniques in SQL. I you are interest in Data cleaning, and how to use SQL to categorize keywords, check out the project below:

Cleaning and categorize keywords for SEO with SQL