fbpx
Skip to content

Hotel Booking EDA Data Analysis Python Project

    Hotel Booking EDA Data Analysis Python

    Last Updated on: 22nd May 2024, 04:10 pm

    In this project, we conducted a comprehensive data analysis on a hotel booking dataset to identify key issues affecting booking performance and revenue.

    Using Python and its powerful data analysis libraries such as Pandas, Matplotlib, and Seaborn, we explored various aspects of the dataset, including cancellation rates, booking distribution, pricing strategies, customer types, seasonal trends, lead times, and the reliance on online travel agencies.

    Through visualizations and statistical analysis, we uncovered insights that can help hotels optimize their marketing, pricing, and operational strategies to improve overall performance and customer satisfaction.

    Problem Statement

    The hotel booking dataset reveals several issues that need to be addressed:

    1. High Cancellation Rate:
    • A significant portion of bookings are being canceled, especially for City Hotels.
    1. Uneven Distribution of Bookings:
    • City Hotels receive a higher number of bookings compared to Resort Hotels, indicating a potential imbalance in marketing or pricing strategies.
    1. Pricing Inefficiencies:
    • The Average Daily Rate (ADR) for Resort Hotels is higher, yet they receive fewer bookings. This suggests that pricing strategies may not be optimized.
    1. Customer Type Imbalance:
    • Transient customers dominate the bookings, while other customer types (Contract, Group, etc.) are underrepresented.
    1. Seasonal Booking Trends:
    • There are significant peaks in bookings during the summer months (July and August), which may lead to overbooking issues or underutilization during off-peak seasons.
    1. Lead Time Distribution:
    • Most bookings are made with a lead time of less than 100 days, indicating a potential for optimizing early bird discounts or last-minute deals.
    1. Dependence on Online Travel Agencies (OTAs):
    • The majority of bookings come through Online Travel Agencies, suggesting a heavy reliance on these channels, which may affect profit margins.
    1. Geographic Concentration:
    • The majority of bookings are from a limited number of countries, indicating potential untapped markets.

    Addressing these problems through strategic adjustments in pricing, marketing, customer service, and distribution channels will help improve overall booking performance, reduce cancellations, and optimize revenue.

    EDA Data Analysis

    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # Load the dataset
    file_path = '/mnt/data/hotel_booking.csv'
    df = pd.read_csv(file_path)
    1. Import Libraries: We import the necessary libraries, pandas for data manipulation, seaborn and matplotlib.pyplot for data visualization.
    2. Load Dataset: The dataset is loaded into a DataFrame df from the specified file path.
    # Display the first few rows of the dataset
    print(df.head())
    1. Preview Data: Display the first few rows of the dataset to get an initial look at the data.
    # Display the structure of the dataset
    print(df.info())
    1. Data Structure: Print the structure of the dataset, showing information about the columns, data types, and non-null counts.
    # Check for missing values
    missing_values = df.isnull().sum()
    print(missing_values[missing_values > 0])
    1. Check for Missing Values: Identify columns with missing values and display the count of missing values for each.
    # Handle missing values (drop rows with missing values for simplicity)
    df = df.dropna()
    
    # Verify no missing values remain
    print(df.isnull().sum())
    1. Handle Missing Values: Drop rows with missing values to simplify the analysis. Verify that no missing values remain.
    # Display the data types of each column
    print(df.dtypes)
    1. Display Data Types: Show the data types of each column in the dataset.
    # Select numerical columns for correlation analysis
    numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns
    print("Numerical columns:", numerical_columns)
    1. Select Numerical Columns: Select columns with numerical data types for correlation analysis.
    # Compute the correlation matrix for numerical columns
    correlations = df[numerical_columns].corr()
    print(correlations)
    1. Correlation Matrix: Compute and print the correlation matrix for numerical columns to identify relationships between them.
    # Visualize correlations
    plt.figure(figsize=(10, 8))
    sns.heatmap(correlations, annot=True, cmap='coolwarm')
    plt.title('Correlation Matrix')
    plt.show()
    1. Visualize Correlations: Create a heatmap to visualize the correlation matrix, showing the strength and direction of relationships between numerical variables.
    # Distribution of Bookings by Hotel Type
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x='hotel')
    plt.title('Distribution of Bookings by Hotel Type')
    plt.xlabel('Hotel Type')
    plt.ylabel('Number of Bookings')
    plt.show()
    1. Bookings by Hotel Type: Plot the distribution of bookings by hotel type to see which type of hotel receives more bookings.
    # Average Daily Rate (ADR) by Hotel Type
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=df, x='hotel', y='adr')
    plt.title('Average Daily Rate (ADR) by Hotel Type')
    plt.xlabel('Hotel Type')
    plt.ylabel('ADR')
    plt.show()
    1. Average Daily Rate by Hotel Type: Create a boxplot to compare the average daily rate between different hotel types.
    # Bookings by Customer Type
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x='customer_type')
    plt.title('Bookings by Customer Type')
    plt.xlabel('Customer Type')
    plt.ylabel('Number of Bookings')
    plt.show()
    1. Bookings by Customer Type: Plot the distribution of bookings by customer type to see the most common customer types.
    # Bookings by Market Segment
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x='market_segment')
    plt.title('Bookings by Market Segment')
    plt.xlabel('Market Segment')
    plt.ylabel('Number of Bookings')
    plt.show()
    1. Bookings by Market Segment: Visualize the distribution of bookings by market segment to identify the most important market segments.
    # Booking Cancellations
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x='is_canceled')
    plt.title('Booking Cancellations')
    plt.xlabel('Is Canceled')
    plt.ylabel('Number of Bookings')
    plt.show()
    1. Booking Cancellations: Plot the distribution of booking cancellations to understand the proportion of canceled bookings.
    # Lead Time Analysis
    plt.figure(figsize=(10, 6))
    sns.histplot(df['lead_time'], bins=30, kde=True)
    plt.title('Lead Time Distribution')
    plt.xlabel('Lead Time')
    plt.ylabel('Frequency')
    plt.show()
    1. Lead Time Analysis: Create a histogram to visualize the distribution of lead times for bookings.
    # Monthly Booking Trends
    df['arrival_date_month'] = pd.Categorical(df['arrival_date_month'], categories=[
        'January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], ordered=True)
    
    monthly_bookings = df['arrival_date_month'].value_counts().sort_index()
    
    plt.figure(figsize=(12, 6))
    sns.barplot(x=monthly_bookings.index, y=monthly_bookings.values)
    plt.title('Monthly Booking Trends')
    plt.xlabel('Month')
    plt.ylabel('Number of Bookings')
    plt.xticks(rotation=45)
    plt.show()
    1. Monthly Booking Trends: Plot the number of bookings by month to identify peak booking periods.

    More Visuals

    # Bookings by Distribution Channel
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x='distribution_channel')
    plt.title('Bookings by Distribution Channel')
    plt.xlabel('Distribution Channel')
    plt.ylabel('Number of Bookings')
    plt.show()
    
    # Bookings by Meal Type
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x='meal')
    plt.title('Bookings by Meal Type')
    plt.xlabel('Meal Type')
    plt.ylabel('Number of Bookings')
    plt.show()
    
    # Average Lead Time by Hotel Type
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=df, x='hotel', y='lead_time')
    plt.title('Average Lead Time by Hotel Type')
    plt.xlabel('Hotel Type')
    plt.ylabel('Lead Time')
    plt.show()
    
    # Booking Changes by Hotel Type
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=df, x='hotel', y='booking_changes')
    plt.title('Booking Changes by Hotel Type')
    plt.xlabel('Hotel Type')
    plt.ylabel('Number of Booking Changes')
    plt.show()
    
    # Bookings by Country
    top_countries = df['country'].value_counts().head(10)
    plt.figure(figsize=(10, 6))
    sns.barplot(x=top_countries.index, y=top_countries.values)
    plt.title('Top 10 Countries by Number of Bookings')
    plt.xlabel('Country')
    plt.ylabel('Number of Bookings')
    plt.xticks(rotation=45)
    plt.show()
    1. Bookings by Distribution Channel: Plot the distribution of bookings by distribution channel to see which channels are most effective.
    2. Bookings by Meal Type: Visualize the distribution of bookings by meal type to understand customer preferences.
    3. Average Lead Time by Hotel Type: Compare the average lead time for bookings between different hotel types.
    4. Booking Changes by Hotel Type: Compare the number of booking changes between different hotel types.
    5. Bookings by Country: Plot the number of bookings by country, focusing on the top 10 countries.

    Summary of Key Findings and Solutions

    Key Findings:

    1. Distribution of Bookings:
    • Most bookings are for the City Hotel.
    • The average daily rate is higher for the Resort Hotel.
    • The majority of bookings are made by Transient customers.
    • Online TA is the most common market segment.
    • 37% of bookings are canceled.
    • Lead time distribution shows most bookings are made less than 100 days in advance.
    • Peak booking months are July and August.
    • TA/TO is the most common distribution channel.
    • The most common meal type is BB (Bed & Breakfast).
    • Average lead time is higher for City Hotel than Resort Hotel.
    • Booking changes are slightly higher for City Hotel.
    • Portugal is the country with the highest number of bookings.

    Solutions:

    1. Reduce Cancellations:
    • Implement stricter cancellation policies, especially for City Hotels.
    • Offer incentives for non-cancelable bookings.
    1. Optimize Pricing:
    • Adjust pricing strategies for Resort Hotels to increase bookings.
    • Consider dynamic pricing based on booking lead times and peak seasons.
    1. Target Marketing Efforts:
    • Focus marketing efforts on the Transient customer segment.
    • Strengthen partnerships with Online Travel Agencies (OTAs) to leverage their reach.
    1. Improve Customer Experience:
    • Enhance the meal offerings, focusing on the most popular BB meal type.
    • Ensure a smooth booking change process to improve customer satisfaction.
    1. Expand International Reach:
    • Target marketing campaigns to countries with fewer bookings but high potential.
    • Customize offerings based on preferences of top countries like Portugal.

    These insights and solutions should help the hotel improve its booking performance, reduce cancellations, and optimize its marketing and pricing strategies.

    Share this post on social!

    Comment on Post

    Your email address will not be published. Required fields are marked *