Skip to content

Data Cleaning in Python | Pandas & NumPy

    data cleaning in python pandas numpy

    Last Updated on: 25th March 2026, 06:04 pm

    Data Cleaning in Python · brutalist guide

    Python, with its rich ecosystem of libraries such as pandas and numpy, has become a go-to language for data cleaning tasks. In this comprehensive guide, we’ll explore a wide range of data cleaning techniques using Python, diving deep into each concept with detailed explanations and examples.

    Whether you’re a beginner just starting your data science journey or an experienced analyst looking to refine your skills, this guide will provide valuable insights into the art and science of data cleaning.

    The Importance of Data Cleaning

    • Accuracy: Clean data leads to more accurate analyses and predictions. Inaccurate data can lead to flawed conclusions and poor decision-making.
    • Consistency: It ensures that your data is consistent across all records, making it easier to analyze and interpret.
    • Completeness: Data cleaning helps identify and handle missing or incomplete information, providing a more comprehensive dataset for analysis.
    • Reliability: Clean data increases the reliability of your results and decisions. You can trust the insights derived from well-cleaned data.
    • Efficiency: While it may seem time-consuming initially, data cleaning saves time in the long run by preventing errors in downstream analyses and reducing the need for rework.

    Setting Up Your Python Environment

    📦 environment setup
    import pandas as pd
    import numpy as np
    print("Pandas version:", pd.__version__)
    print("NumPy version:", np.__version__)
    We import pandas as pd and numpy as np — conventional aliases that make data manipulation effortless. Pandas is the powerhouse for DataFrames, while NumPy enables fast numerical operations. We print versions for compatibility and documentation.

    Loading and Inspecting Data

    read data sources
    # Reading a CSV file
    df = pd.read_csv('your_data.csv')
    print(df.head())
    
    # Reading an Excel file
    df_excel = pd.read_excel('your_data.xlsx')
    print(df_excel.head())
    
    # Reading a JSON file
    df_json = pd.read_json('your_data.json')
    print(df_json.head())
    pd.read_csv(), pd.read_excel(), pd.read_json() load structured data into DataFrames. .head() gives a quick preview — essential first glance at structure, columns and sample values.
    inspect dataframe
    print(df.info())
    print(df.describe())
    df.info() reveals column dtypes, non-null counts, memory usage. df.describe() provides statistical summary (mean, std, percentiles) to spot outliers and distribution quirks.

    Handling Missing Values

    missing data tactics
    # Check missing values
    df.isnull().sum()
    
    # Drop rows with any missing
    df_cleaned = df.dropna()
    
    # Fill with constant
    df_filled = df.fillna(0)
    
    # Fill numeric column with mean
    df['numeric_col'] = df['numeric_col'].fillna(df['numeric_col'].mean())
    
    # Forward fill & backward fill
    df_ffill = df.ffill()
    df_bfill = df.bfill()
    dropna() removes rows with missing entries. fillna(value) imputes with constant or statistics. ffill() propagates last valid observation, great for time series.

    Handling Duplicate Data

    deduplication
    data = {'A': [1,2,2,3,3], 'B': ['a','b','b','c','c']}
    df = pd.DataFrame(data)
    
    df.duplicated().sum()          # count duplicates
    df_clean = df.drop_duplicates() # remove all dup rows
    df_subset = df.drop_duplicates(subset=['B'])
    duplicated() flags duplicate rows. drop_duplicates() keeps first occurrence by default. Using subset restricts duplicate detection to specific columns.

    Converting Data Types

    type coercion
    df['A'] = df['A'].astype(int)               # to integer
    df['B'] = pd.to_datetime(df['B'])           # to datetime

    Handling Outliers

    Z‑score & IQR outlier removal
    # Z-score method (normal distribution assumption)
    def remove_outliers_zscore(df, col, thresh=3):
        mean, std = df[col].mean(), df[col].std()
        return df[(df[col] > mean - thresh*std) & (df[col] < mean + thresh*std)]
    
    # IQR method (robust)
    def remove_outliers_iqr(df, col):
        Q1, Q3 = df[col].quantile(0.25), df[col].quantile(0.75)
        IQR = Q3 - Q1
        return df[(df[col] >= Q1 - 1.5*IQR) & (df[col] <= Q3 + 1.5*IQR)]
    80%
    of data science time is cleaning
    pandas · numpy · scikit-learn · SMOTE

    Normalizing and Scaling Data

    min-max & z-score
    # Min-Max scaling (range 0–1)
    df['A_norm'] = (df['A'] - df['A'].min()) / (df['A'].max() - df['A'].min())
    
    # Z-score standardization (mean=0, std=1)
    df['B_std'] = (df['B'] - df['B'].mean()) / df['B'].std()

    Handling Text Data

    text cleaning
    df['text'] = df['text'].str.lower().str.strip()
    df['text'] = df['text'].str.replace('[^\w\s]', '')   # remove punctuation
    df['has_data'] = df['text'].str.contains('data', case=False)

    Renaming & Reorganizing Columns

    column surgery
    df = df.rename(columns={'A': 'Alpha', 'B': 'Beta'})
    df = df[['Gamma', 'Alpha', 'Beta']]   # reorder

    Filtering Data

    subset & query
    filtered = df[df['A'] > 3]
    multi = df[(df['A'] > 2) & (df['B'] < 40)]
    query_df = df.query('A > 3 and C == "x"')

    Handling Categorical Data

    one‑hot & label encoding
    # One-hot encoding
    df_encoded = pd.get_dummies(df, columns=['color'])
    
    # Label encoding
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    df['size_encoded'] = le.fit_transform(df['size'])

    Combining and Merging Data

    concat & merge
    # Vertical/horizontal concatenation
    df_concat = pd.concat([df1, df2], axis=0)
    df_merged = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
    df_key = pd.merge(df3, df4, on='key', how='outer')

    Handling Time Series Data

    datetime index & resample
    df.set_index('date', inplace=True)
    df_resampled = df.resample('3D').mean()
    df_filled = df_gaps.asfreq('D').fillna(method='ffill')

    Aggregating Data

    groupby & pivot
    grouped = df.groupby('category').agg({'value1': ['mean','sum'], 'value2': ['min','max']})
    pivot = pd.pivot_table(df, values='value1', index='category', aggfunc=np.sum)

    Handling Imbalanced Data (SMOTE)

    synthetic minority oversampling
    from imblearn.over_sampling import SMOTE
    smote = SMOTE(random_state=42)
    X_resampled, y_resampled = smote.fit_resample(X, y)

    Data Validation

    post‑clean checks
    print(df.isnull().sum())         # missing values
    print(df.dtypes)                 # correct types
    print(df['B'].unique())          # categorical sanity
    print(df.duplicated().sum())     # duplicates remain?

    Handling Large Datasets

    chunked reading & memory optimization
    chunks = []
    for chunk in pd.read_csv('large.csv', chunksize=1000):
        chunks.append(chunk)
    df = pd.concat(chunks, ignore_index=True)
    df = df.astype({'int_col': 'int32', 'float_col': 'float32'})

    Documenting Your Data Cleaning Process

    logging reproducibility
    import logging
    logging.basicConfig(filename='clean_log.txt', level=logging.INFO)
    logging.info('Started cleaning')
    df_clean = df.dropna()
    logging.info(f'After dropna: {df_clean.shape}')

    Automating Data Cleaning

    reusable pipeline
    def clean_data(df):
        df = df.dropna().drop_duplicates()
        if 'date_col' in df.columns:
            df['date_col'] = pd.to_datetime(df['date_col'])
        for col in df.select_dtypes(include=[np.number]).columns:
            df[f'{col}_norm'] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())
        return df

    Handling Categorical Variables with Many Categories

    group rare categories
    def group_rare(series, threshold=50):
        counts = series.value_counts()
        rare = counts[counts < threshold].index
        return series.where(~series.isin(rare), 'Other')

    Handling Skewed Data

    log & exp transformations
    df['right_log'] = np.log1p(df['right_skewed'])   # log(1+x)
    df['left_exp'] = np.exp(df['left_skewed'])

    Handling Multicollinearity (PCA)

    principal component analysis
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import StandardScaler
    
    scaler = StandardScaler()
    scaled = scaler.fit_transform(df)
    pca = PCA()
    components = pca.fit_transform(scaled)

    Handling Time-based Features

    feature engineering from dates
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['dayofweek'] = df['date'].dt.dayofweek
    df['is_weekend'] = df['dayofweek'].isin([5,6]).astype(int)
    df['rolling_mean'] = df['value'].rolling(7).mean()

    Conclusion

    Data cleaning is a crucial step in any data analysis or machine learning project. It’s often said that data scientists spend up to 80% of their time cleaning and preparing data. The techniques and examples we’ve covered in this comprehensive guide should give you a solid foundation for handling a wide variety of data cleaning tasks in Python.

    Remember, the specific cleaning steps you’ll need to take will depend on your particular dataset and the goals of your analysis. Always start by thoroughly understanding your data – its structure, content, and any potential issues. Then, apply the appropriate cleaning techniques to ensure your data is accurate, consistent, and ready for analysis.

    As you gain more experience, you’ll develop an intuition for common data issues and efficient ways to address them. Keep practicing, stay curious, and always strive for clean, high-quality data.

    Share this post on social!

    Comment on Post

    Your email address will not be published. Required fields are marked *