Data Cleaning in Python | Pandas & NumPy

Last Updated on: 25th March 2026, 06:04 pm

Data Cleaning in Python · brutalist guide

Python, with its rich ecosystem of libraries such as pandas and numpy, has become a go-to language for data cleaning tasks. In this comprehensive guide, we’ll explore a wide range of data cleaning techniques using Python, diving deep into each concept with detailed explanations and examples.

Whether you’re a beginner just starting your data science journey or an experienced analyst looking to refine your skills, this guide will provide valuable insights into the art and science of data cleaning.

The Importance of Data Cleaning

Accuracy: Clean data leads to more accurate analyses and predictions. Inaccurate data can lead to flawed conclusions and poor decision-making.
Consistency: It ensures that your data is consistent across all records, making it easier to analyze and interpret.
Completeness: Data cleaning helps identify and handle missing or incomplete information, providing a more comprehensive dataset for analysis.
Reliability: Clean data increases the reliability of your results and decisions. You can trust the insights derived from well-cleaned data.
Efficiency: While it may seem time-consuming initially, data cleaning saves time in the long run by preventing errors in downstream analyses and reducing the need for rework.

Setting Up Your Python Environment

📦 environment setup

import pandas as pd
import numpy as np
print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)

We import pandas as pd and numpy as np — conventional aliases that make data manipulation effortless. Pandas is the powerhouse for DataFrames, while NumPy enables fast numerical operations. We print versions for compatibility and documentation.

Loading and Inspecting Data

read data sources

# Reading a CSV file
df = pd.read_csv('your_data.csv')
print(df.head())

# Reading an Excel file
df_excel = pd.read_excel('your_data.xlsx')
print(df_excel.head())

# Reading a JSON file
df_json = pd.read_json('your_data.json')
print(df_json.head())

pd.read_csv(), pd.read_excel(), pd.read_json() load structured data into DataFrames. .head() gives a quick preview — essential first glance at structure, columns and sample values.

inspect dataframe

print(df.info())
print(df.describe())

df.info() reveals column dtypes, non-null counts, memory usage. df.describe() provides statistical summary (mean, std, percentiles) to spot outliers and distribution quirks.

Handling Missing Values

missing data tactics

# Check missing values
df.isnull().sum()

# Drop rows with any missing
df_cleaned = df.dropna()

# Fill with constant
df_filled = df.fillna(0)

# Fill numeric column with mean
df['numeric_col'] = df['numeric_col'].fillna(df['numeric_col'].mean())

# Forward fill & backward fill
df_ffill = df.ffill()
df_bfill = df.bfill()

dropna() removes rows with missing entries. fillna(value) imputes with constant or statistics. ffill() propagates last valid observation, great for time series.

Handling Duplicate Data

deduplication

data = {'A': [1,2,2,3,3], 'B': ['a','b','b','c','c']}
df = pd.DataFrame(data)

df.duplicated().sum()          # count duplicates
df_clean = df.drop_duplicates() # remove all dup rows
df_subset = df.drop_duplicates(subset=['B'])

duplicated() flags duplicate rows. drop_duplicates() keeps first occurrence by default. Using subset restricts duplicate detection to specific columns.

Converting Data Types

type coercion

df['A'] = df['A'].astype(int)               # to integer
df['B'] = pd.to_datetime(df['B'])           # to datetime

Handling Outliers

Z‑score & IQR outlier removal

# Z-score method (normal distribution assumption)
def remove_outliers_zscore(df, col, thresh=3):
    mean, std = df[col].mean(), df[col].std()
    return df[(df[col] > mean - thresh*std) & (df[col] < mean + thresh*std)]

# IQR method (robust)
def remove_outliers_iqr(df, col):
    Q1, Q3 = df[col].quantile(0.25), df[col].quantile(0.75)
    IQR = Q3 - Q1
    return df[(df[col] >= Q1 - 1.5*IQR) & (df[col] <= Q3 + 1.5*IQR)]

80%

of data science time is cleaning

pandas · numpy · scikit-learn · SMOTE

Normalizing and Scaling Data

min-max & z-score

# Min-Max scaling (range 0–1)
df['A_norm'] = (df['A'] - df['A'].min()) / (df['A'].max() - df['A'].min())

# Z-score standardization (mean=0, std=1)
df['B_std'] = (df['B'] - df['B'].mean()) / df['B'].std()

Handling Text Data

text cleaning

df['text'] = df['text'].str.lower().str.strip()
df['text'] = df['text'].str.replace('[^\w\s]', '')   # remove punctuation
df['has_data'] = df['text'].str.contains('data', case=False)

Renaming & Reorganizing Columns

column surgery

df = df.rename(columns={'A': 'Alpha', 'B': 'Beta'})
df = df[['Gamma', 'Alpha', 'Beta']]   # reorder

Filtering Data

subset & query

filtered = df[df['A'] > 3]
multi = df[(df['A'] > 2) & (df['B'] < 40)]
query_df = df.query('A > 3 and C == "x"')

Handling Categorical Data

one‑hot & label encoding

# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['color'])

# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['size_encoded'] = le.fit_transform(df['size'])

Combining and Merging Data

concat & merge

# Vertical/horizontal concatenation
df_concat = pd.concat([df1, df2], axis=0)
df_merged = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
df_key = pd.merge(df3, df4, on='key', how='outer')

Handling Time Series Data

datetime index & resample

df.set_index('date', inplace=True)
df_resampled = df.resample('3D').mean()
df_filled = df_gaps.asfreq('D').fillna(method='ffill')

Aggregating Data

groupby & pivot

grouped = df.groupby('category').agg({'value1': ['mean','sum'], 'value2': ['min','max']})
pivot = pd.pivot_table(df, values='value1', index='category', aggfunc=np.sum)

Handling Imbalanced Data (SMOTE)

synthetic minority oversampling

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

Data Validation

post‑clean checks

print(df.isnull().sum())         # missing values
print(df.dtypes)                 # correct types
print(df['B'].unique())          # categorical sanity
print(df.duplicated().sum())     # duplicates remain?

Handling Large Datasets

chunked reading & memory optimization

chunks = []
for chunk in pd.read_csv('large.csv', chunksize=1000):
    chunks.append(chunk)
df = pd.concat(chunks, ignore_index=True)
df = df.astype({'int_col': 'int32', 'float_col': 'float32'})

Documenting Your Data Cleaning Process

logging reproducibility

import logging
logging.basicConfig(filename='clean_log.txt', level=logging.INFO)
logging.info('Started cleaning')
df_clean = df.dropna()
logging.info(f'After dropna: {df_clean.shape}')

Automating Data Cleaning

reusable pipeline

def clean_data(df):
    df = df.dropna().drop_duplicates()
    if 'date_col' in df.columns:
        df['date_col'] = pd.to_datetime(df['date_col'])
    for col in df.select_dtypes(include=[np.number]).columns:
        df[f'{col}_norm'] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())
    return df

Handling Categorical Variables with Many Categories

group rare categories

def group_rare(series, threshold=50):
    counts = series.value_counts()
    rare = counts[counts < threshold].index
    return series.where(~series.isin(rare), 'Other')

Handling Skewed Data

log & exp transformations

df['right_log'] = np.log1p(df['right_skewed'])   # log(1+x)
df['left_exp'] = np.exp(df['left_skewed'])

Handling Multicollinearity (PCA)

principal component analysis

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled = scaler.fit_transform(df)
pca = PCA()
components = pca.fit_transform(scaled)

Handling Time-based Features

feature engineering from dates

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['dayofweek'] = df['date'].dt.dayofweek
df['is_weekend'] = df['dayofweek'].isin([5,6]).astype(int)
df['rolling_mean'] = df['value'].rolling(7).mean()

Conclusion

Data cleaning is a crucial step in any data analysis or machine learning project. It’s often said that data scientists spend up to 80% of their time cleaning and preparing data. The techniques and examples we’ve covered in this comprehensive guide should give you a solid foundation for handling a wide variety of data cleaning tasks in Python.

Remember, the specific cleaning steps you’ll need to take will depend on your particular dataset and the goals of your analysis. Always start by thoroughly understanding your data – its structure, content, and any potential issues. Then, apply the appropriate cleaning techniques to ensure your data is accurate, consistent, and ready for analysis.

As you gain more experience, you’ll develop an intuition for common data issues and efficient ways to address them. Keep practicing, stay curious, and always strive for clean, high-quality data.