Skip to content

How to Perform Basic Statistical Analysis Using NumPy in Python

Introduction

Alright, let’s be honest here when you first hear the words “statistical analysis,” what comes to mind? For most people, it’s either terrifying flashbacks to high school math class or that overwhelming feeling of staring at spreadsheets until your eyes cross. But here’s the secret that data scientists know and the rest of us are slowly figuring out: Statistics isn’t about complicated math, it’s about telling stories with numbers.

And that’s where NumPy comes in. Picture this: You’ve got a mountain of data, maybe it’s sales figures for your small business, temperature readings from your smart home devices, or survey responses from your latest product launch. The numbers are there, but what do they actually mean? That’s where statistical analysis helps, and NumPy makes it possible without needing a PhD in mathematics.

I remember when I first started learning Python for data analysis. I’d spent hours writing loops to calculate averages, manually finding minimums and maximums, and basically reinventing the wheel for every simple calculation. Then I discovered NumPy, and it was like someone handed me a calculator after I’d been doing long division by hand for months.

What Exactly Is NumPy, and Why Should You Care?

The NumPy Origin Story

Let’s rewind a bit. NumPy stands for Numerical Python, and it was created back in 2005 by Travis Oliphant. Before NumPy, doing scientific computing in Python was… well, kinda painful. People were using lists for everything, and it was slow, like, “go make coffee while your code runs” slow.

NumPy changed all that by giving us arrays, not the fancy kind you wear to a party, but the data structure kind. These aren’t your grandma’s Python lists. NumPy arrays are:

  • Homogeneous (all elements are the same type)
  • Multidimensional (you can have 1D, 2D, 3D, even n-dimensional arrays)
  • Blazing fast (thanks to being implemented in C)
  • Memory efficient (they store data in contiguous blocks)

But here’s what really matters for our statistical journey: NumPy gives us a toolbox of functions that work on entire arrays at once, no loops required. This concept is called vectorization, and it’s the secret sauce that makes NumPy so powerful.

Why NumPy for Statistics?

You might be wondering: “Can’t I just use Excel? Or Google Sheets? Or regular Python lists?” Sure, you could. But here’s why NumPy wins:

  1. Speed: Operations that take seconds in NumPy might take minutes with regular Python
  2. Consistency: Once you learn the NumPy way, it works the same every time
  3. Integration: NumPy plays nicely with other data science libraries (Pandas, SciPy, Matplotlib)
  4. Precision: Built-in handling of edge cases and numerical stability

Think of it this way: Using Python lists for statistics is like using a butter knife to cut down a tree. It might work eventually, but there are much better tools for the job.

Getting Started – Your First Steps with NumPy

Installation Made Simple

First things first, you need NumPy installed. If you’re using Anaconda (which I recommend for beginners), it comes pre-installed. Otherwise, open your terminal or command prompt and type:

pip install numpy

Or if you’re using Jupyter Notebook, you can run:

!pip install numpy

The exclamation mark tells Jupyter to run a shell command.

The Import Ritual

Every NumPy script, notebook, or project starts the same way:

import numpy as np

That as np part is crucial, it’s a convention that every NumPy user follows. It saves typing and makes your code look familiar to other data scientists. Try using just numpy instead of np, and you’ll immediately look like a beginner. Not that there’s anything wrong with being a beginner, we all start somewhere, but why not look like a pro from day one?

Creating Your First Array

Arrays are where the magic happens. Let’s create a simple one:

# Creating a NumPy array from a list
my_data = np.array([1, 2, 3, 4, 5])
print(my_data)
print(type(my_data))

Output:

[1 2 3 4 5]
<class 'numpy.ndarray'>

See those missing commas in the print output? That’s your first hint that this isn’t a regular list. That ndarray stands for “n-dimensional array,” which is NumPy’s fancy name for its core data structure.

Quick Array Creation Methods

NumPy gives us shortcuts for common array patterns:

# Create an array of zeros
zeros_array = np.zeros(5)  # [0., 0., 0., 0., 0.]

# Create an array of ones
ones_array = np.ones(5)    # [1., 1., 1., 1., 1.]

# Create a range of numbers
range_array = np.arange(0, 10, 2)  # [0, 2, 4, 6, 8]

# Create evenly spaced numbers
linspace_array = np.linspace(0, 1, 5)  # [0., 0.25, 0.5, 0.75, 1.]

The decimal points in zeros() and ones() outputs? Those tell us NumPy created float arrays by default. Details matter in data analysis!

The Fundamentals – Understanding Your Data’s Shape

Before we dive into statistics, we need to understand our data’s structure. It’s like trying to cook without knowing what ingredients you have possible, but risky.

Checking Array Properties

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Shape tells us the dimensions
print("Shape:", data.shape)  # (10,) - 1D array with 10 elements

# Size tells us total number of elements
print("Size:", data.size)    # 10

# ndim tells us number of dimensions
print("Dimensions:", data.ndim)  # 1

# dtype tells us the data type
print("Data type:", data.dtype)  # int64

Reshaping Arrays

Sometimes your data doesn’t come in the shape you need. Let’s say you have monthly sales data for 3 years:

# 36 months of sales data
monthly_sales = np.arange(1, 37)

# Reshape to 3 years × 12 months
yearly_sales = monthly_sales.reshape(3, 12)
print(yearly_sales)

Output:

[[ 1  2  3  4  5  6  7  8  9 10 11 12]
 [13 14 15 16 17 18 19 20 21 22 23 24]
 [25 26 27 28 29 30 31 32 33 34 35 36]]

Now we can analyze yearly patterns! The reshape method doesn’t change the data, just how it’s organized in memory.

Measures of Central Tendency – Finding the “Middle”

Alright, let’s get to the actual statistics! Measures of central tendency answer the question: “Where’s the middle of my data?” It’s like finding the center of gravity for your dataset.

Mean

The mean is what most people call the “average,” though statisticians get twitchy when you use those terms interchangeably (we’ll get to why soon).

# Sample data: Daily temperatures for a week (°C)
temperatures = np.array([22, 24, 19, 25, 23, 21, 20])

# Calculate the mean
mean_temp = np.mean(temperatures)
print(f"Mean temperature: {mean_temp:.2f}°C")  # :.2f formats to 2 decimal places

Output: Mean temperature: 22.00°C

What’s happening mathematically?
Sum all values: 22 + 24 + 19 + 25 + 23 + 21 + 20 = 154
Divide by count: 154 ÷ 7 = 22

The mean’s superpower: It uses every data point.
The mean’s weakness: It’s sensitive to outliers.

Check this out:

# Add an outlier (heat wave!)
temperatures_with_outlier = np.array([22, 24, 19, 25, 23, 21, 20, 40])
print(f"Mean with outlier: {np.mean(temperatures_with_outlier):.2f}°C")

Output: Mean with outlier: 24.25°C

One unusually hot day (40°C) pulled the average up by over 2 degrees! This is why we need other measures…

Median

The median is the value that splits your data in half, 50% of values are below it, 50% are above.

# Odd number of values
temps_odd = np.array([19, 20, 21, 22, 23, 24, 25])
median_odd = np.median(temps_odd)
print(f"Median (odd count): {median_odd}")  # 22

# Even number of values
temps_even = np.array([19, 20, 21, 22, 23, 24])
median_even = np.median(temps_even)
print(f"Median (even count): {median_even}")  # 21.5 (average of 21 and 22)

The median’s superpower: Resistant to outliers!
Let’s test it:

# Same outlier example
temps_with_outlier = np.array([19, 20, 21, 22, 23, 24, 25, 40])
print(f"Median with outlier: {np.median(temps_with_outlier)}")  # 22.5

The median barely budged! That’s why median house prices or median incomes often give a better picture than means, they’re not skewed by billionaires or mansion sales.

Mode

NumPy doesn’t have a built-in mode function (that’s in SciPy), but we can understand it conceptually. The mode is the most frequent value.

# Let's use SciPy for mode
from scipy import stats

survey_responses = np.array([1, 2, 2, 3, 3, 3, 4, 4, 5])
mode_result = stats.mode(survey_responses)
print(f"Mode: {mode_result.mode[0]}, Count: {mode_result.count[0]}")

Output: Mode: 3, Count: 3

The mode is great for categorical data (like “most common car color”) or when values cluster around specific points.

Weighted Average: When Some Values Matter More

Sometimes, not all data points are created equal. Enter the weighted average:

# Student grades with different assignment weights
grades = np.array([85, 92, 78, 95, 88])
weights = np.array([0.1, 0.2, 0.2, 0.3, 0.2])  # Must sum to 1!

weighted_avg = np.average(grades, weights=weights)
print(f"Weighted average: {weighted_avg:.2f}")

# Compare with regular mean
regular_mean = np.mean(grades)
print(f"Regular mean: {regular_mean:.2f}")

Output:

Weighted average: 89.30
Regular mean: 87.60

The final project (weight 0.3, grade 95) pulled up the weighted average! This is how your GPA is calculated, not all courses carry the same weight.

Measures of Spread – How “Spread Out” Is Your Data?

Knowing the center is helpful, but it doesn’t tell the whole story. Two datasets can have the same mean but be completely different! Let’s measure spread.

Range

Range = Maximum – Minimum. Simple, but limited.

salaries_company_a = np.array([45000, 48000, 52000, 55000, 60000])
salaries_company_b = np.array([45000, 46000, 47000, 48000, 120000])

print(f"Company A range: {np.ptp(salaries_company_a)}")  # 15000
print(f"Company B range: {np.ptp(salaries_company_b)}")  # 75000

# ptp stands for "peak to peak" - same as max-min
print(f"Manual range A: {np.max(salaries_company_a) - np.min(salaries_company_a)}")

Company B has one executive making way more, creating a huge range. But range only looks at two values, what about everything in between?

Variance

Variance answers: “On average, how far are points from the mean?” Measuring Average Squared Distance from Mean.

# Let's calculate variance step-by-step to understand it
data = np.array([2, 4, 6, 8, 10])
mean = np.mean(data)  # 6

# Step 1: Find differences from mean
differences = data - mean  # [-4, -2, 0, 2, 4]

# Step 2: Square the differences (removes negatives, emphasizes large deviations)
squared_diffs = differences ** 2  # [16, 4, 0, 4, 16]

# Step 3: Average the squared differences
variance_manual = np.mean(squared_diffs)  # 8.0

# Step 4: Compare with NumPy's built-in
variance_numpy = np.var(data)
print(f"Manual variance: {variance_manual}")
print(f"NumPy variance: {variance_numpy}")

# Population vs Sample variance
print(f"Population variance: {np.var(data)}")  # Default
print(f"Sample variance: {np.var(data, ddof=1)}")  # ddof=1 for sample

Why square the differences?

  1. Gets rid of negatives (distance can’t be negative)
  2. Emphasizes larger deviations (2 units off becomes 4, 3 units becomes 9)

Population vs Sample:

  • Use population variance (ddof=0) when you have ALL the data
  • Use sample variance (ddof=1) when you have a sample estimating a larger population

The ddof stands for “Delta Degrees of Freedom.” For samples, we divide by n-1 instead of n to correct for sampling bias. This is called Bessel’s correction, named after Friedrich Bessel who probably wished his name was easier to spell.

Standard Deviation

The problem with variance? It’s in squared units. If your data is in dollars, variance is in dollars². Who thinks in squared dollars?

# Same data as before
data = np.array([2, 4, 6, 8, 10])

std_dev = np.std(data)
print(f"Standard deviation: {std_dev:.2f}")
print(f"That's the square root of variance: {np.sqrt(np.var(data)):.2f}")

# Real-world example: Test scores
test_scores = np.array([78, 85, 92, 65, 88, 72, 95, 81])
print(f"Test scores std: {np.std(test_scores):.2f} points")

# What does this mean practically?
mean_score = np.mean(test_scores)
print(f"Most scores are within {std_dev:.2f} points of {mean_score:.2f}")

Standard deviation gives us a “typical” distance from the mean. For normally distributed data (the famous bell curve):

  • About 68% of data falls within ±1 standard deviation of the mean
  • About 95% within ±2 standard deviations
  • About 99.7% within ±3 standard deviations

This is the Empirical Rule or 68-95-99.7 Rule, and it’s incredibly useful for understanding distributions.

Interquartile Range (IQR)

IQR measures the spread of the middle 50% of your data, making it resistant to outliers.

# House prices in a neighborhood (in thousands)
house_prices = np.array([250, 275, 300, 320, 350, 380, 400, 450, 500, 2000])

# The 2000 is an outlier (maybe a mansion)

# Calculate quartiles
q1 = np.percentile(house_prices, 25)  # 25th percentile
q3 = np.percentile(house_prices, 75)  # 75th percentile
iqr = q3 - q1

print(f"Q1 (25th percentile): ${q1:.0f}K")
print(f"Q3 (75th percentile): ${q3:.0f}K")
print(f"IQR: ${iqr:.0f}K")

# Outlier detection using IQR
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = house_prices[(house_prices < lower_bound) | (house_prices > upper_bound)]

print(f"\nOutlier bounds: [${lower_bound:.0f}K, ${upper_bound:.0f}K]")
print(f"Outliers detected: {outliers}")

Output:

Q1 (25th percentile): $300K
Q3 (75th percentile): $425K
IQR: $125K

Outlier bounds: [$112K, $613K]
Outliers detected: [2000]

The $2M house gets flagged as an outlier! IQR is used in box plots (those whisker diagrams) and is great for robust data analysis.

Percentiles and Quantiles – Dividing Your Data

Percentiles divide your data into hundredths. The 50th percentile is the median, the 25th is Q1, the 75th is Q3.

# Exam scores for a large class
exam_scores = np.random.normal(75, 10, 1000)  # 1000 scores, mean=75, std=10

# Common percentiles
percentiles = np.percentile(exam_scores, [25, 50, 75, 90, 95])
print("25th percentile (Q1):", percentiles[0])
print("50th percentile (Median):", percentiles[1])
print("75th percentile (Q3):", percentiles[2])
print("90th percentile:", percentiles[3])  # Better than 90% of students
print("95th percentile:", percentiles[4])  # Top 5%!

# What score do you need to beat 80% of students?
score_80th = np.percentile(exam_scores, 80)
print(f"\nTo beat 80% of students, you need: {score_80th:.1f}")

Real-world application: Standardized tests! When you hear “you scored in the 85th percentile,” it means you did better than 85% of test-takers.

Multiple Quantiles at Once

# Get deciles (10%, 20%, ..., 90%)
deciles = np.percentile(exam_scores, np.arange(10, 100, 10))
for i, decile in enumerate(deciles, 1):
    print(f"{i*10}th percentile: {decile:.2f}")

Covariance and Correlation (Relationships Between Variables)

Now we’re getting to the fun part! How do two variables move together?

Covariance (Do They Move Together?)

Covariance measures whether two variables tend to increase or decrease together.

# Hours studied vs. Exam scores
hours_studied = np.array([2, 5, 7, 10, 12, 15, 18, 20])
exam_scores = np.array([55, 65, 70, 75, 80, 85, 88, 90])

covariance = np.cov(hours_studied, exam_scores)[0, 1]
print(f"Covariance: {covariance:.2f}")

# What does this number mean?
if covariance > 0:
    print("Positive covariance: As hours increase, scores tend to increase")
elif covariance < 0:
    print("Negative covariance: As hours increase, scores tend to decrease")
else:
    print("No linear relationship")

The problem with covariance: The units are weird (hours × score points), and the magnitude depends on your data’s scale. That’s why we need…

Correlation (Covariance’s Standardized Cousin)

Correlation (specifically Pearson correlation) standardizes covariance to a range of -1 to 1.

# Same data
correlation = np.corrcoef(hours_studied, exam_scores)[0, 1]
print(f"Correlation coefficient: {correlation:.3f}")

# Interpretation guide
if abs(correlation) > 0.7:
    strength = "strong"
elif abs(correlation) > 0.3:
    strength = "moderate"
else:
    strength = "weak"

if correlation > 0:
    direction = "positive"
elif correlation < 0:
    direction = "negative"
else:
    direction = "no"

print(f"This shows a {strength} {direction} correlation.")

# Visualize what different correlations look like
import matplotlib.pyplot as plt

# Generate example data with different correlations
np.random.seed(42)
n = 100

# Strong positive correlation
x1 = np.random.randn(n)
y1 = x1 + np.random.randn(n) * 0.3

# Strong negative correlation  
y2 = -x1 + np.random.randn(n) * 0.3

# No correlation
y3 = np.random.randn(n)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, x, y, title in zip(axes, 
                          [x1, x1, np.random.randn(n)],
                          [y1, y2, y3],
                          [f"Positive (r={np.corrcoef(x1, y1)[0,1]:.2f})",
                           f"Negative (r={np.corrcoef(x1, y2)[0,1]:.2f})",
                           f"None (r={np.corrcoef(np.random.randn(n), y3)[0,1]:.2f})"]):
    ax.scatter(x, y, alpha=0.6)
    ax.set_title(title)
    ax.set_xlabel("Variable X")
    ax.set_ylabel("Variable Y")
plt.tight_layout()
plt.show()

Important caveat: Correlation ≠ Causation! Just because two things are correlated doesn’t mean one causes the other. Ice cream sales and drowning incidents are correlated (both increase in summer), but buying ice cream doesn’t cause drowning!

Working with Real, Messy Data

Real-world data isn’t clean arrays of perfect numbers. Let’s handle some common issues.

Missing Data (NaN Handling)

NumPy uses np.nan for missing values (NaN = “Not a Number”).

# Real data often has missing values
sales_data = np.array([1200, 1350, np.nan, 1420, np.nan, 1550, 1480])

# Problems with NaN
print(f"Mean with NaN: {np.mean(sales_data)}")  # Returns nan!
print(f"Sum with NaN: {np.sum(sales_data)}")     # Also nan!

# Solution: Use nan-aware functions
print(f"Nan-safe mean: {np.nanmean(sales_data):.2f}")
print(f"Nan-safe sum: {np.nansum(sales_data):.0f}")
print(f"Nan-safe std: {np.nanstd(sales_data):.2f}")

# Count non-NaN values
valid_count = np.count_nonzero(~np.isnan(sales_data))
print(f"Valid data points: {valid_count}/7")

Data Type Matters

# Be careful with integer vs float arrays
int_array = np.array([1, 2, 3, 4, 5])
float_array = np.array([1., 2., 3., 4., 5.])

# Mean of integers
print(f"Mean of ints: {np.mean(int_array)} - type: {np.mean(int_array).dtype}")

# Division with integers can be surprising
print(f"Int division: {int_array / 2}")  # Becomes float!
print(f"Floor division: {int_array // 2}")  # Keeps as int

# Convert between types
float_version = int_array.astype(float)
print(f"Converted to float: {float_version} - dtype: {float_version.dtype}")

Handling Large Datasets

# For huge datasets, memory matters
large_data = np.random.randn(10_000_000)  # 10 million points!

# Time different operations
import time

start = time.time()
mean_slow = sum(large_data) / len(large_data)  # Python's sum
time_slow = time.time() - start

start = time.time()
mean_fast = np.mean(large_data)  # NumPy's mean
time_fast = time.time() - start

print(f"Python sum time: {time_slow:.3f} seconds")
print(f"NumPy mean time: {time_fast:.3f} seconds")
print(f"NumPy is {time_slow/time_fast:.1f}x faster!")

# Memory usage
print(f"\nMemory usage: {large_data.nbytes / 1_000_000:.1f} MB")

On my machine, NumPy was about 50x faster! That difference matters when processing millions of records.

Statistical Functions for 2D Arrays (Matrices)

Real data often comes in tables (rows and columns). Let’s analyze multidimensional data.

Column-wise and Row-wise Statistics

# Monthly sales for 3 products across 4 quarters
sales_data = np.array([
    [120, 150, 130, 140],  # Product A
    [80, 90, 85, 95],      # Product B  
    [200, 220, 210, 230]   # Product C
])

print("Sales data shape:", sales_data.shape)  # (3 products, 4 quarters)

# Mean of each product (across quarters)
product_means = np.mean(sales_data, axis=1)
print(f"\nAverage quarterly sales per product: {product_means}")

# Mean of each quarter (across products)
quarter_means = np.mean(sales_data, axis=0)
print(f"Average sales per quarter: {quarter_means}")

# Overall mean
overall_mean = np.mean(sales_data)
print(f"Overall average: {overall_mean:.2f}")

# Variance along different axes
print(f"\nVariance per product: {np.var(sales_data, axis=1)}")
print(f"Variance per quarter: {np.var(sales_data, axis=0)}")

The axis parameter is crucial:

  • axis=0: Operate down columns (row-wise aggregation)
  • axis=1: Operate across rows (column-wise aggregation)
  • axis=None: Operate on all elements

Correlation Matrix for Multiple Variables

# Sales, marketing spend, and website traffic
sales = np.array([100, 120, 130, 150, 170])
marketing = np.array([50, 55, 60, 65, 70])
traffic = np.array([1000, 1100, 1200, 1300, 1400])

# Combine into a 2D array
data_matrix = np.column_stack([sales, marketing, traffic])

# Calculate correlation matrix
corr_matrix = np.corrcoef(data_matrix, rowvar=False)  # rowvar=False means columns are variables

print("Correlation Matrix:")
print("                Sales   Marketing   Traffic")
print(f"Sales:       {corr_matrix[0,0]:.2f}     {corr_matrix[0,1]:.2f}       {corr_matrix[0,2]:.2f}")
print(f"Marketing:   {corr_matrix[1,0]:.2f}     {corr_matrix[1,1]:.2f}       {corr_matrix[1,2]:.2f}")
print(f"Traffic:     {corr_matrix[2,0]:.2f}     {corr_matrix[2,1]:.2f}       {corr_matrix[2,2]:.2f}")

# Diagonal is always 1 (variable perfectly correlates with itself)
# Matrix is symmetric (corr(X,Y) = corr(Y,X))

Advanced Statistical Functions

Skewness and Kurtosis

While NumPy doesn’t have built-in skewness and kurtosis (those are in SciPy), let’s understand what they measure:

Skewness: Is your data symmetric?

  • Positive skew: Right tail longer (mean > median)
  • Negative skew: Left tail longer (mean < median)

Kurtosis: How heavy are the tails?

  • High kurtosis: Heavy tails, more outliers
  • Low kurtosis: Light tails, fewer outliers
from scipy import stats

# Generate different distributions
normal_data = np.random.normal(0, 1, 1000)
skewed_data = np.random.exponential(1, 1000)

print(f"Normal data - Skewness: {stats.skew(normal_data):.3f}, Kurtosis: {stats.kurtosis(normal_data):.3f}")
print(f"Skewed data - Skewness: {stats.skew(skewed_data):.3f}, Kurtosis: {stats.kurtosis(skewed_data):.3f}")

Cumulative Statistics

# Running sales total
daily_sales = np.array([100, 150, 200, 180, 220])
cumulative_sales = np.cumsum(daily_sales)
print(f"Daily sales: {daily_sales}")
print(f"Cumulative: {cumulative_sales}")

# Running average
running_avg = np.cumsum(daily_sales) / np.arange(1, len(daily_sales) + 1)
print(f"Running average: {running_avg}")

Statistical Summary Function

Create your own summary function:

def data_summary(data, name="Data"):
    """Print comprehensive statistical summary."""
    print(f"\n{'='*50}")
    print(f"Statistical Summary: {name}")
    print(f"{'='*50}")

    # Basic stats
    print(f"Count: {data.size}")
    print(f"Mean: {np.mean(data):.4f}")
    print(f"Median: {np.median(data):.4f}")
    print(f"Std Dev: {np.std(data):.4f}")
    print(f"Variance: {np.var(data):.4f}")
    print(f"Min: {np.min(data):.4f}")
    print(f"Q1 (25%): {np.percentile(data, 25):.4f}")
    print(f"Q2 (50%): {np.percentile(data, 50):.4f}")
    print(f"Q3 (75%): {np.percentile(data, 75):.4f}")
    print(f"Max: {np.max(data):.4f}")
    print(f"Range: {np.ptp(data):.4f}")
    print(f"IQR: {np.percentile(data, 75) - np.percentile(data, 25):.4f}")

    # Skewness and Kurtosis if SciPy available
    try:
        from scipy import stats
        print(f"Skewness: {stats.skew(data):.4f}")
        print(f"Kurtosis: {stats.kurtosis(data):.4f}")
    except ImportError:
        print("Note: Install SciPy for skewness and kurtosis")

    print(f"{'='*50}")

# Test it
test_scores = np.random.normal(75, 10, 1000)
data_summary(test_scores, "Exam Scores")

Practical Examples and Case Studies

Example 1: Analyzing Website Traffic

# Simulate 30 days of website traffic
np.random.seed(123)
daily_visitors = np.random.poisson(500, 30)  # Poisson distribution for count data
weekends = np.array([5, 6, 12, 13, 19, 20, 26, 27]) - 1  # Weekend indices (0-based)

print("Website Traffic Analysis")
print("=" * 50)

# Basic stats
print(f"Total visitors: {np.sum(daily_visitors):,}")
print(f"Daily average: {np.mean(daily_visitors):.1f}")
print(f"Best day: {np.max(daily_visitors)} visitors")
print(f"Worst day: {np.min(daily_visitors)} visitors")

# Weekend vs weekday comparison
weekday_visitors = np.delete(daily_visitors, weekends)
weekend_visitors = daily_visitors[weekends]

print(f"\nWeekday average: {np.mean(weekday_visitors):.1f}")
print(f"Weekend average: {np.mean(weekend_visitors):.1f}")
print(f"Difference: {np.mean(weekend_visitors) - np.mean(weekday_visitors):.1f}")

# Trend analysis (7-day moving average)
moving_avg = np.convolve(daily_visitors, np.ones(7)/7, mode='valid')
print(f"\n7-day moving average range: {np.min(moving_avg):.1f} - {np.max(moving_avg):.1f}")

Example 2: A/B Testing Analysis

# Simulate A/B test results
np.random.seed(42)
group_a_conversions = np.random.binomial(1, 0.12, 1000)  # 12% conversion rate
group_b_conversions = np.random.binomial(1, 0.15, 1000)  # 15% conversion rate

print("A/B Test Results: New Design (B) vs Old Design (A)")
print("=" * 60)

# Conversion rates
conv_rate_a = np.mean(group_a_conversions) * 100
conv_rate_b = np.mean(group_b_conversions) * 100
improvement = ((conv_rate_b - conv_rate_a) / conv_rate_a) * 100

print(f"Group A (old): {conv_rate_a:.2f}% conversion rate")
print(f"Group B (new): {conv_rate_b:.2f}% conversion rate")
print(f"Improvement: {improvement:.1f}%")

# Statistical significance (simplified)
from scipy import stats
t_stat, p_value = stats.ttest_ind(group_a_conversions, group_b_conversions)
print(f"\nt-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Result: Statistically significant at 95% confidence level")
    print("Conclusion: New design performs better")
else:
    print("Result: Not statistically significant")
    print("Conclusion: Cannot confirm new design is better")

Example 3: Financial Data Analysis

# Simulate stock returns
np.random.seed(101)
days = 252  # Trading days in a year
stock_returns = np.random.normal(0.0005, 0.02, days)  # Daily returns

# Convert to price series
initial_price = 100
stock_prices = initial_price * np.cumprod(1 + stock_returns)

print("Stock Performance Analysis")
print("=" * 50)

# Returns analysis
print(f"Final price: ${stock_prices[-1]:.2f}")
print(f"Total return: {((stock_prices[-1]/initial_price)-1)*100:.2f}%")
print(f"Average daily return: {np.mean(stock_returns)*100:.3f}%")
print(f"Daily volatility (std): {np.std(stock_returns)*100:.3f}%")

# Annualized metrics
annual_return = np.mean(stock_returns) * 252 * 100
annual_volatility = np.std(stock_returns) * np.sqrt(252) * 100
print(f"\nAnnualized return: {annual_return:.2f}%")
print(f"Annualized volatility: {annual_volatility:.2f}%")

# Risk-adjusted return (Sharpe ratio, assuming 2% risk-free rate)
risk_free_rate = 0.02 / 252  # Daily risk-free rate
excess_returns = stock_returns - risk_free_rate
sharpe_ratio = np.mean(excess_returns) / np.std(excess_returns) * np.sqrt(252)
print(f"Sharpe ratio: {sharpe_ratio:.3f}")

# Worst drawdown
cumulative_returns = (stock_prices / initial_price) - 1
running_max = np.maximum.accumulate(stock_prices)
drawdown = (stock_prices - running_max) / running_max
max_drawdown = np.min(drawdown) * 100
print(f"Maximum drawdown: {max_drawdown:.2f}%")

Performance Tips and Best Practices

1. Use Vectorized Operations

# BAD: Using Python loops
def slow_mean(data):
    total = 0
    for value in data:
        total += value
    return total / len(data)

# GOOD: Using NumPy
def fast_mean(data):
    return np.mean(data)

# Even better: Use NumPy directly!

2. Preallocate Arrays When Possible

# BAD: Growing arrays
result = np.array([])
for i in range(10000):
    result = np.append(result, i)  # Creates new array each time!

# GOOD: Preallocate
result = np.zeros(10000)
for i in range(10000):
    result[i] = i

# BEST: Vectorized
result = np.arange(10000)

3. Use Appropriate Data Types

# Float64 is default, but sometimes you can use smaller types
large_array_64 = np.ones(1_000_000, dtype=np.float64)  # 8 MB
large_array_32 = np.ones(1_000_000, dtype=np.float32)  # 4 MB - half the memory!

print(f"Float64 memory: {large_array_64.nbytes / 1_000_000:.1f} MB")
print(f"Float32 memory: {large_array_32.nbytes / 1_000_000:.1f} MB")

# Trade-off: Less precision but more memory efficient

4. Broadcasting for Efficiency

# Calculate z-scores (standardized scores)
data = np.random.randn(1000)
mean = np.mean(data)
std = np.std(data)

# GOOD: Broadcasting
z_scores = (data - mean) / std  # Applies to entire array at once

# Verify
print(f"Mean of z-scores: {np.mean(z_scores):.6f}")  # Should be ~0
print(f"Std of z-scores: {np.std(z_scores):.6f}")     # Should be ~1

Common Pitfalls and How to Avoid Them

Pitfall 1: Ignoring NaN Values

# This will give you NaN
data_with_nan = np.array([1, 2, np.nan, 4, 5])
print(f"Incorrect mean: {np.mean(data_with_nan)}")  # nan

# Always use nan-aware functions
print(f"Correct mean: {np.nanmean(data_with_nan)}")  # 3.0

Pitfall 2: Confusing Axis Parameters

matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])

print("Full matrix:")
print(matrix)
print(f"\nMean along axis=0 (columns): {np.mean(matrix, axis=0)}")  # [2.5, 3.5, 4.5]
print(f"Mean along axis=1 (rows): {np.mean(matrix, axis=1)}")      # [2., 5.]

Mnemonic: axis=0 collapses rows (vertical), axis=1 collapses columns (horizontal).

Pitfall 3: Integer Division

int_array = np.array([1, 2, 3, 4, 5])

# Surprise! This returns floats
result = int_array / 2
print(f"Division result: {result}, dtype: {result.dtype}")

# If you want integer division
int_result = int_array // 2
print(f"Integer division: {int_result}, dtype: {int_result.dtype}")

Pitfall 4: Sample vs Population Statistics

data = np.array([1, 2, 3, 4, 5])

print(f"Population variance: {np.var(data):.4f}")         # 2.0
print(f"Sample variance: {np.var(data, ddof=1):.4f}")     # 2.5

# Rule: Use ddof=1 when working with a sample of a larger population

Beyond Basic Statistics – What’s Next?

Integrating with Pandas

import pandas as pd

# NumPy arrays work seamlessly with Pandas
data = np.random.randn(100, 3)  # 100 rows, 3 columns
df = pd.DataFrame(data, columns=['A', 'B', 'C'])

# Pandas builds on NumPy
print("DataFrame describe (uses NumPy under the hood):")
print(df.describe())

# You can always go back to NumPy
numpy_array = df.values  # or df.to_numpy()
print(f"\nBack to NumPy array shape: {numpy_array.shape}")

Visualization with Matplotlib

import matplotlib.pyplot as plt

# Generate some data
data = np.random.normal(0, 1, 1000)

# Histogram
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(data, bins=30, edgecolor='black', alpha=0.7)
plt.axvline(np.mean(data), color='red', linestyle='--', label=f'Mean: {np.mean(data):.2f}')
plt.axvline(np.median(data), color='green', linestyle='--', label=f'Median: {np.median(data):.2f}')
plt.title('Distribution with Mean and Median')
plt.legend()

# Box plot
plt.subplot(1, 2, 2)
plt.boxplot(data)
plt.title('Box Plot Showing Quartiles and Outliers')
plt.ylabel('Values')

plt.tight_layout()
plt.show()

Statistical Testing with SciPy

from scipy import stats

# T-test example
group1 = np.random.normal(5, 1, 30)
group2 = np.random.normal(5.5, 1, 30)

t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"T-test: t = {t_stat:.3f}, p = {p_value:.4f}")

# Chi-square test
observed = np.array([30, 45, 25])  # Observed frequencies
expected = np.array([35, 40, 25])  # Expected frequencies
chi2, p = stats.chisquare(observed, expected)
print(f"Chi-square: χ² = {chi2:.3f}, p = {p:.4f}")

Conclusion: Your Statistical Analysis Toolkit

We’ve covered a lot of ground! Let’s recap what we now have in our toolkit:

  1. Central Tendency: mean(), median(), mode (via SciPy)
  2. Spread/Variability: var(), std(), ptp(), IQR via percentile()
  3. Distribution Shape: Percentiles, quantiles
  4. Relationships: cov(), corrcoef()
  5. Missing Data: nanmean(), nanstd(), etc.
  6. Multidimensional Data: Axis parameter in all functions
  7. Real-world Applications: Financial analysis, A/B testing, trend analysis

The beautiful thing about NumPy is that once you understand these basic statistical functions, you’re not just learning Python, you’re learning the fundamental concepts that underpin all statistical analysis. Whether you move on to machine learning with scikit-learn, deep learning with TensorFlow, or advanced statistics with Statsmodels, you’ll be building on this foundation.

Remember: Statistics isn’t about perfection, it’s about understanding. It’s about looking at a jumble of numbers and seeing patterns, trends, and stories. It’s about making informed decisions even when you don’t have complete information.

As you continue your data journey, keep this in mind: The most sophisticated analysis starts with the basics we covered today. Master these fundamentals, practice with real data (even if it’s just your personal finances or exercise tracking), and you’ll be amazed at what you can discover.

Now go forth and analyze! The world is full of data waiting to tell its story, and you now have the tools to listen.