Coding

Exploring and Cleaning Data with Python – Week 2

YN
yNeedthis
Author
Exploring and Cleaning Data with Python

In Week 1, we introduced Python and pandas, and learned how to load and explore data.

Now we move into one of the most important parts of data analysis:

👉 Cleaning data so it can be used for statistical analysis and regression

Because in real-world datasets, nothing is perfectly clean.

Why Data Cleaning Matters

Before running any statistical model, your data must be:

  • readable by Python
  • consistent in format
  • correctly typed

👉 If your data is messy, your results will be unreliable.

Data cleaning is what transforms raw data into usable data.

Understanding Data Types in Python

Every variable in Python has a type.

Example:

intValue = 10
floatValue = 10.5
stringValue = "Data analysis"

Now load a dataset:

df = pd.read_csv('sample_data.csv')

🔍 Checking Data Types

print(type(intValue))
print(type(floatValue))
print(type(stringValue))
print(type(df))

📊 Output Example

<class 'int'>
<class 'float'>
<class 'str'>
<class 'pandas.core.frame.DataFrame'>

🧠 What This Means

  • int → whole numbers
  • float → decimal numbers
  • str → text
  • DataFrame → structured dataset

👉 Only numeric types can be used in statistical models.

Exploring Your Dataset

Before cleaning, we inspect the dataset.

🔍 Structure of the Data

df.info()

📊 Summary Statistics

df.describe()

📊 Include All Columns

df.describe(include='all')

🧠 What You See

For text columns:

  • count
  • unique values
  • most frequent value
  • frequency

📁 Column Names

colNames = df.columns
print(colNames)

🔎 Unique Values in a Column

df['Program'].unique()

Example output:

'CS', 'Computer Science', 'Comp Sci', 'Electrical Engineering', 'Electrical Eng'

Detecting Data Quality Issues

Common issues include:

  • inconsistent labels
  • numbers stored as text
  • missing values
  • sentinel values

Cleaning Text Data (String Manipulation)

Example salary values:

"$50,000"
"$60,000"

🧹 Step 1: Remove symbols

df['Salary'] = df['Salary'].str.replace('$', '')
df['Salary'] = df['Salary'].str.replace(',', '')

🔄 Step 2: Convert to numeric

df['Salary'] = pd.to_numeric(df['Salary'])

Standardising Text Data

Example program values:

'CS', 'Computer Science', 'Comp Sci'

🧹 Standardise values

df['Program'] = df['Program'].replace('Comp Sci', 'Computer Science')

🔁 Alternative: Using string methods (.str.replace())

You can also standardise values using pandas string methods:

df['Program'] = df['Program'].str.replace('Comp Sci', 'Computer Science')

🔤 Related approach: Extracting structured values

Sometimes, instead of replacing text, you may need to extract part of a string when values contain both text and structured information.

For example:

df = pd.read_csv('sample_data.csv')

termCol = df['Term']
termCol = termCol.str[5:6]

print(termCol)

🧠 What this does

If values look like:

"Term 1", "Term 2", "Term 3"

Then:

  • .str[5:6] extracts the numeric part of the string
  • This isolates structured information embedded in text

🧠 What’s the difference?

  • .replace() → works at the value level (best for categories)
  • .str.replace() → works at the text level (best for patterns inside strings)
  • .str[ ] → works for extracting parts of structured strings

💡 When to use each

Use .replace() when:

  • You are fixing specific category values
  • You want exact matches
  • Your data is structured (e.g., “Comp Sci” → “Computer Science”)

Use .str.replace() when:

  • You need to clean parts of a string
  • The text may vary slightly
  • You are removing patterns (e.g., $, ,)

Use .str[ ] when:

  • Values follow a fixed format
  • You need to extract a specific character or segment
  • You want to isolate embedded information

⚠️ Important note

.str.replace() may replace partial matches, which can sometimes lead to unintended results.

👉 For standardising categories, .replace() It is usually the safer and clearer choice.

🧠 Why this matters

If not fixed:

  • duplicate categories appear
  • Regression creates extra dummy variables
  • interpretation becomes unclear
the word data spelled out in scrabble letters
Photo by Markus Winkler on Pexels.com

Handling Missing and Mixed Data

Example:

"25", "30", "unknown", "40"

🛠 Safe conversion

df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

🧠 What happens?

  • valid numbers stay
  • invalid values become NaN

⚠️ Sentinel Values

Examples:

  • “-“
  • “N/A”
  • “..”

These represent missing or unavailable data.

🧹 Convert safely

df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')

💡 Why NaN matters

  • prevents bias
  • clearly marks missing data
  • handled automatically in analysis

Full Cleaning Workflow (Best Practice)

While pandas allows you to clean data in a single line, it’s often better, especially when learning, to follow a step-by-step approach.

This makes your work easier to debug and validate.

🧹 Step 1: Extract the column

programCol = df['Program']

🔄 Step 2: Clean the values

programCol = programCol.replace('Comp Sci', 'Computer Science')

🔍 Step 2.5: Verify your cleaning

print(programCol.unique())

👉 This helps confirm that inconsistent values have been corrected.

💾 Step 3: Replace the original column

df['Program'] = programCol

🧠 What just happened?

  • We extracted a column
  • Cleaned it safely
  • Verified the result
  • Replaced the original data

Now, if you check:

print(df['Program'].unique())

You should no longer see inconsistent values like "Comp Sci".

💡 Why this approach is useful

  • Prevents accidental data loss
  • Makes debugging easier
  • Encourages step-by-step validation
  • Reflects how real data workflows are done

👉 This pattern (extract → clean → replace) is something you’ll use often in data analysis.

Not all data issues are about inconsistent labels; sometimes the problem is missing or mixed values.

Why Cleaning Matters Before Regression

Before moving to regression:

Your dataset must be:

✔ numeric where needed
✔ consistent in categories
✔ free of formatting issues
✔ properly handling missing values

⚠️ If you skip cleaning:

  • Regression becomes misleading
  • categories split incorrectly
  • Results lose meaning

Filtering & Transforming Data

Once your data is clean, you can begin exploring it more deeply.

We will cover this in detail in Week 3, but here is a preview.

🔍 Filtering data

Filtering allows you to focus on specific parts of your dataset:

df[df['Salary'] > 50000]

or:

df[df['Program'] == 'Computer Science']

🔄 Transforming data

Transformation means creating new variables or modifying existing ones:

df['Salary_in_k'] = df['Salary'] / 1000

🧠 Why this matters

These techniques help you:

  • Focus on subsets of data
  • prepare variables for analysis
  • build better statistical models later

👉 We will expand on this in Week 3.

🧠 Final Takeaway

Data cleaning is not optional.

It is the foundation of all statistical analysis.

Better data → better models → better decisions

Coming Next (Week 3)

Now that your data is clean, we move into:

📊 Filtering datasets
📈 Data transformation
📉 OLS Regression (statsmodels)
🧠 Interpreting regression results

Thank you for reading 🙂

YN

yNeedthis

I’m Shareeza Hussain, a Software Engineer with 8+ years of experience building web applications across startups and emerging tech companies. I hold a Bachelor’s degree in Computer Science, postgraduate credentials in User Experience Design and Enterprise Software Development, and I’m currently pursuing a certification in Data Analytics for Behavioural Insights at the University of Waterloo. My work spans product-focused development, mentoring junior engineers, overseeing outsourced teams, and continuously testing new tools and technologies. This blog documents what I learn through hands-on experimentation — from coding and databases to AI-powered developer tools.

Leave a Reply

Your email address will not be published. Required fields are marked *