Exploring and Cleaning Data with Python

In Week 1, we introduced Python and pandas, and learned how to load and explore data.

Now we move into one of the most important parts of data analysis:

👉 Cleaning data so it can be used for statistical analysis and regression

Because in real-world datasets, nothing is perfectly clean.

Why Data Cleaning Matters

Before running any statistical model, your data must be:

readable by Python
consistent in format
correctly typed

👉 If your data is messy, your results will be unreliable.

Data cleaning is what transforms raw data into usable data.

Understanding Data Types in Python

Every variable in Python has a type.

Example:

intValue = 10
floatValue = 10.5
stringValue = "Data analysis"

Now load a dataset:

df = pd.read_csv('sample_data.csv')

🔍 Checking Data Types

print(type(intValue))
print(type(floatValue))
print(type(stringValue))
print(type(df))

📊 Output Example

<class 'int'>
<class 'float'>
<class 'str'>
<class 'pandas.core.frame.DataFrame'>

🧠 What This Means

int → whole numbers
float → decimal numbers
str → text
DataFrame → structured dataset

👉 Only numeric types can be used in statistical models.

Exploring Your Dataset

Before cleaning, we inspect the dataset.

🔍 Structure of the Data

df.info()

📊 Summary Statistics

df.describe()

📊 Include All Columns

df.describe(include='all')

🧠 What You See

For text columns:

count
unique values
most frequent value
frequency

📁 Column Names

colNames = df.columns
print(colNames)

🔎 Unique Values in a Column

df['Program'].unique()

Example output:

'CS', 'Computer Science', 'Comp Sci', 'Electrical Engineering', 'Electrical Eng'

Detecting Data Quality Issues

Common issues include:

inconsistent labels
numbers stored as text
missing values
sentinel values

Cleaning Text Data (String Manipulation)

Example salary values:

"$50,000"
"$60,000"

🧹 Step 1: Remove symbols

df['Salary'] = df['Salary'].str.replace('$', '')
df['Salary'] = df['Salary'].str.replace(',', '')

🔄 Step 2: Convert to numeric

df['Salary'] = pd.to_numeric(df['Salary'])

Standardising Text Data

Example program values:

'CS', 'Computer Science', 'Comp Sci'

🧹 Standardise values

df['Program'] = df['Program'].replace('Comp Sci', 'Computer Science')

🔁 Alternative: Using string methods (.str.replace())

You can also standardise values using pandas string methods:

df['Program'] = df['Program'].str.replace('Comp Sci', 'Computer Science')

🔤 Related approach: Extracting structured values

Sometimes, instead of replacing text, you may need to extract part of a string when values contain both text and structured information.

For example:

df = pd.read_csv('sample_data.csv')

termCol = df['Term']
termCol = termCol.str[5:6]

print(termCol)

🧠 What this does

If values look like:

"Term 1", "Term 2", "Term 3"

Then:

.str[5:6] extracts the numeric part of the string
This isolates structured information embedded in text

🧠 What’s the difference?

.replace() → works at the value level (best for categories)
.str.replace() → works at the text level (best for patterns inside strings)
.str[ ] → works for extracting parts of structured strings

💡 When to use each

Use .replace() when:

You are fixing specific category values
You want exact matches
Your data is structured (e.g., “Comp Sci” → “Computer Science”)

Use .str.replace() when:

You need to clean parts of a string
The text may vary slightly
You are removing patterns (e.g., $, ,)

Use .str[ ] when:

Values follow a fixed format
You need to extract a specific character or segment
You want to isolate embedded information

⚠️ Important note

.str.replace() may replace partial matches, which can sometimes lead to unintended results.

👉 For standardising categories, .replace() It is usually the safer and clearer choice.

🧠 Why this matters

If not fixed:

duplicate categories appear
Regression creates extra dummy variables
interpretation becomes unclear

the word data spelled out in scrabble letters — Photo by Markus Winkler on Pexels.com

Handling Missing and Mixed Data

Example:

"25", "30", "unknown", "40"

🛠 Safe conversion

df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

🧠 What happens?

valid numbers stay
invalid values become NaN

⚠️ Sentinel Values

Examples:

“-“
“N/A”
“..”

These represent missing or unavailable data.

🧹 Convert safely

df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')

💡 Why NaN matters

prevents bias
clearly marks missing data
handled automatically in analysis

Full Cleaning Workflow (Best Practice)

While pandas allows you to clean data in a single line, it’s often better, especially when learning, to follow a step-by-step approach.

This makes your work easier to debug and validate.

🧹 Step 1: Extract the column

programCol = df['Program']

🔄 Step 2: Clean the values

programCol = programCol.replace('Comp Sci', 'Computer Science')

🔍 Step 2.5: Verify your cleaning

print(programCol.unique())

👉 This helps confirm that inconsistent values have been corrected.

💾 Step 3: Replace the original column

df['Program'] = programCol

🧠 What just happened?

We extracted a column
Cleaned it safely
Verified the result
Replaced the original data

Now, if you check:

print(df['Program'].unique())

You should no longer see inconsistent values like "Comp Sci".

💡 Why this approach is useful

Prevents accidental data loss
Makes debugging easier
Encourages step-by-step validation
Reflects how real data workflows are done

👉 This pattern (extract → clean → replace) is something you’ll use often in data analysis.

Not all data issues are about inconsistent labels; sometimes the problem is missing or mixed values.

Why Cleaning Matters Before Regression

Before moving to regression:

Your dataset must be:

✔ numeric where needed
✔ consistent in categories
✔ free of formatting issues
✔ properly handling missing values

⚠️ If you skip cleaning:

Regression becomes misleading
categories split incorrectly
Results lose meaning

Filtering & Transforming Data

Once your data is clean, you can begin exploring it more deeply.

We will cover this in detail in Week 3, but here is a preview.

🔍 Filtering data

Filtering allows you to focus on specific parts of your dataset:

df[df['Salary'] > 50000]

or:

df[df['Program'] == 'Computer Science']

🔄 Transforming data

Transformation means creating new variables or modifying existing ones:

df['Salary_in_k'] = df['Salary'] / 1000

🧠 Why this matters

These techniques help you:

Focus on subsets of data
prepare variables for analysis
build better statistical models later

👉 We will expand on this in Week 3.

🧠 Final Takeaway

Data cleaning is not optional.

It is the foundation of all statistical analysis.

Better data → better models → better decisions

Coming Next (Week 3)

Now that your data is clean, we move into:

📊 Filtering datasets
📈 Data transformation
📉 OLS Regression (statsmodels)
🧠 Interpreting regression results

Thank you for reading 🙂

Exploring and Cleaning Data with Python – Week 2

Why Data Cleaning Matters

Understanding Data Types in Python

🔍 Checking Data Types

📊 Output Example

🧠 What This Means

Exploring Your Dataset

🔍 Structure of the Data

📊 Summary Statistics

📊 Include All Columns

🧠 What You See

📁 Column Names

🔎 Unique Values in a Column

Detecting Data Quality Issues

Cleaning Text Data (String Manipulation)

🧹 Step 1: Remove symbols

🔄 Step 2: Convert to numeric

Standardising Text Data

🧹 Standardise values

🔁 Alternative: Using string methods (.str.replace())

🔤 Related approach: Extracting structured values

🧠 What this does

🧠 What’s the difference?

💡 When to use each

⚠️ Important note

🧠 Why this matters

Handling Missing and Mixed Data

🛠 Safe conversion

🧠 What happens?

⚠️ Sentinel Values

🧹 Convert safely

💡 Why NaN matters

Full Cleaning Workflow (Best Practice)

🧹 Step 1: Extract the column

🔄 Step 2: Clean the values

🔍 Step 2.5: Verify your cleaning

💾 Step 3: Replace the original column

🧠 What just happened?

💡 Why this approach is useful

Why Cleaning Matters Before Regression

⚠️ If you skip cleaning:

Filtering & Transforming Data

🔍 Filtering data

🔄 Transforming data

🧠 Why this matters

🧠 Final Takeaway

Coming Next (Week 3)

Share this:

Like this:

Related

yNeedthis

Related Articles 🌸

A Developer’s Guide to Testing That Actually Matters

Optimizing SQL Queries for Better Performance Using Execution Plan

How to use the Coalesce function in SQL

Leave a Reply Cancel reply