In Week 1, we introduced Python and pandas, and learned how to load and explore data.
Now we move into one of the most important parts of data analysis:
👉 Cleaning data so it can be used for statistical analysis and regression
Because in real-world datasets, nothing is perfectly clean.
Why Data Cleaning Matters
Before running any statistical model, your data must be:
- readable by Python
- consistent in format
- correctly typed
👉 If your data is messy, your results will be unreliable.
Data cleaning is what transforms raw data into usable data.
Understanding Data Types in Python
Every variable in Python has a type.
Example:
intValue = 10
floatValue = 10.5
stringValue = "Data analysis"
Now load a dataset:
df = pd.read_csv('sample_data.csv')
🔍 Checking Data Types
print(type(intValue))
print(type(floatValue))
print(type(stringValue))
print(type(df))
📊 Output Example
<class 'int'>
<class 'float'>
<class 'str'>
<class 'pandas.core.frame.DataFrame'>
🧠 What This Means
int→ whole numbersfloat→ decimal numbersstr→ textDataFrame→ structured dataset
👉 Only numeric types can be used in statistical models.
Exploring Your Dataset
Before cleaning, we inspect the dataset.
🔍 Structure of the Data
df.info()
📊 Summary Statistics
df.describe()
📊 Include All Columns
df.describe(include='all')
🧠 What You See
For text columns:
- count
- unique values
- most frequent value
- frequency
📁 Column Names
colNames = df.columns
print(colNames)
🔎 Unique Values in a Column
df['Program'].unique()
Example output:
'CS', 'Computer Science', 'Comp Sci', 'Electrical Engineering', 'Electrical Eng'
Detecting Data Quality Issues
Common issues include:
- inconsistent labels
- numbers stored as text
- missing values
- sentinel values
Cleaning Text Data (String Manipulation)
Example salary values:
"$50,000"
"$60,000"
🧹 Step 1: Remove symbols
df['Salary'] = df['Salary'].str.replace('$', '')
df['Salary'] = df['Salary'].str.replace(',', '')
🔄 Step 2: Convert to numeric
df['Salary'] = pd.to_numeric(df['Salary'])
Standardising Text Data
Example program values:
'CS', 'Computer Science', 'Comp Sci'
🧹 Standardise values
df['Program'] = df['Program'].replace('Comp Sci', 'Computer Science')
🔁 Alternative: Using string methods (.str.replace())
You can also standardise values using pandas string methods:
df['Program'] = df['Program'].str.replace('Comp Sci', 'Computer Science')
🔤 Related approach: Extracting structured values
Sometimes, instead of replacing text, you may need to extract part of a string when values contain both text and structured information.
For example:
df = pd.read_csv('sample_data.csv')
termCol = df['Term']
termCol = termCol.str[5:6]
print(termCol)
🧠 What this does
If values look like:
"Term 1", "Term 2", "Term 3"
Then:
.str[5:6]extracts the numeric part of the string- This isolates structured information embedded in text
🧠 What’s the difference?
.replace()→ works at the value level (best for categories).str.replace()→ works at the text level (best for patterns inside strings).str[ ]→ works for extracting parts of structured strings
💡 When to use each
Use .replace() when:
- You are fixing specific category values
- You want exact matches
- Your data is structured (e.g., “Comp Sci” → “Computer Science”)
Use .str.replace() when:
- You need to clean parts of a string
- The text may vary slightly
- You are removing patterns (e.g.,
$,,)
Use .str[ ] when:
- Values follow a fixed format
- You need to extract a specific character or segment
- You want to isolate embedded information
⚠️ Important note
.str.replace() may replace partial matches, which can sometimes lead to unintended results.
👉 For standardising categories, .replace() It is usually the safer and clearer choice.
🧠 Why this matters
If not fixed:
- duplicate categories appear
- Regression creates extra dummy variables
- interpretation becomes unclear

Handling Missing and Mixed Data
Example:
"25", "30", "unknown", "40"
🛠 Safe conversion
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
🧠 What happens?
- valid numbers stay
- invalid values become
NaN
⚠️ Sentinel Values
Examples:
- “-“
- “N/A”
- “..”
These represent missing or unavailable data.
🧹 Convert safely
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')
💡 Why NaN matters
- prevents bias
- clearly marks missing data
- handled automatically in analysis
Full Cleaning Workflow (Best Practice)
While pandas allows you to clean data in a single line, it’s often better, especially when learning, to follow a step-by-step approach.
This makes your work easier to debug and validate.
🧹 Step 1: Extract the column
programCol = df['Program']
🔄 Step 2: Clean the values
programCol = programCol.replace('Comp Sci', 'Computer Science')
🔍 Step 2.5: Verify your cleaning
print(programCol.unique())
👉 This helps confirm that inconsistent values have been corrected.
💾 Step 3: Replace the original column
df['Program'] = programCol
🧠 What just happened?
- We extracted a column
- Cleaned it safely
- Verified the result
- Replaced the original data
Now, if you check:
print(df['Program'].unique())
You should no longer see inconsistent values like "Comp Sci".
💡 Why this approach is useful
- Prevents accidental data loss
- Makes debugging easier
- Encourages step-by-step validation
- Reflects how real data workflows are done
👉 This pattern (extract → clean → replace) is something you’ll use often in data analysis.
Not all data issues are about inconsistent labels; sometimes the problem is missing or mixed values.
Why Cleaning Matters Before Regression
Before moving to regression:
Your dataset must be:
✔ numeric where needed
✔ consistent in categories
✔ free of formatting issues
✔ properly handling missing values
⚠️ If you skip cleaning:
- Regression becomes misleading
- categories split incorrectly
- Results lose meaning
Filtering & Transforming Data
Once your data is clean, you can begin exploring it more deeply.
We will cover this in detail in Week 3, but here is a preview.
🔍 Filtering data
Filtering allows you to focus on specific parts of your dataset:
df[df['Salary'] > 50000]
or:
df[df['Program'] == 'Computer Science']
🔄 Transforming data
Transformation means creating new variables or modifying existing ones:
df['Salary_in_k'] = df['Salary'] / 1000
🧠 Why this matters
These techniques help you:
- Focus on subsets of data
- prepare variables for analysis
- build better statistical models later
👉 We will expand on this in Week 3.
🧠 Final Takeaway
Data cleaning is not optional.
It is the foundation of all statistical analysis.
Better data → better models → better decisions
Coming Next (Week 3)
Now that your data is clean, we move into:
📊 Filtering datasets
📈 Data transformation
📉 OLS Regression (statsmodels)
🧠 Interpreting regression results
Thank you for reading 🙂