Project Foundations for Data Science: FoodHub Data Analysis¶

Context¶

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

Objective¶

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.

Data Description¶

The data contains the different data related to a food order. The detailed data dictionary is given below.

Data Dictionary¶

  • order_id: Unique ID of the order
  • customer_id: ID of the customer who ordered the food
  • restaurant_name: Name of the restaurant
  • cuisine_type: Cuisine ordered by the customer
  • cost: Cost of the order
  • day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
  • rating: Rating given by the customer out of 5
  • food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant's order confirmation and the delivery person's pick-up confirmation.
  • delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person's pick-up confirmation and drop-off information

Let us start by importing the required libraries¶

In [ ]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Understanding the structure of the data¶

In [ ]:
# read the data
df = pd.read_csv('/content/drive/MyDrive/MIT - Data Analytics/Project/foodhub_order.csv')
# returns the first 5 rows
df.head()
Out[ ]:
order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time
0 1477147 337525 Hangawi Korean 30.75 Weekend Not given 25 20
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.08 Weekend Not given 25 23
2 1477070 66393 Cafe Habana Mexican 12.23 Weekday 5 23 28
3 1477334 106968 Blue Ribbon Fried Chicken American 29.20 Weekend 3 25 15
4 1478249 76942 Dirty Bird to Go American 11.59 Weekday 4 25 24

Observations:¶

The DataFrame has 9 columns as mentioned in the Data Dictionary. Data in each row corresponds to the order placed by a customer.

Question 1: How many rows and columns are present in the data? [0.5 mark]¶

In [ ]:
df.shape # Write your code here
Out[ ]:
(1898, 9)

Observations:¶

Question 2: What are the datatypes of the different columns in the dataset? (The info() function can be used) [0.5 mark]¶

In [ ]:
df.info() # Use info() to print a concise summary of the DataFrame
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1898 non-null   object 
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 133.6+ KB

Observations:¶

Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method. [1 mark]¶

In [ ]:
df.isnull().sum() # Write your code here
Out[ ]:
0
order_id 0
customer_id 0
restaurant_name 0
cuisine_type 0
cost_of_the_order 0
day_of_the_week 0
rating 0
food_preparation_time 0
delivery_time 0

.isnull returns the data frame with true if the values are missing and .sum counts how many true (missing) values are there in each coloumn. Therefore, based on this there are no missing values in any of the coloumns.

Observations:¶

Question 4: Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed? [2 marks]¶

In [ ]:
df.describe(include="all")# Write your code here
Out[ ]:
order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time
count 1.898000e+03 1898.000000 1898 1898 1898.000000 1898 1898 1898.000000 1898.000000
unique NaN NaN 178 14 NaN 2 4 NaN NaN
top NaN NaN Shake Shack American NaN Weekend Not given NaN NaN
freq NaN NaN 219 584 NaN 1351 736 NaN NaN
mean 1.477496e+06 171168.478398 NaN NaN 16.498851 NaN NaN 27.371970 24.161749
std 5.480497e+02 113698.139743 NaN NaN 7.483812 NaN NaN 4.632481 4.972637
min 1.476547e+06 1311.000000 NaN NaN 4.470000 NaN NaN 20.000000 15.000000
25% 1.477021e+06 77787.750000 NaN NaN 12.080000 NaN NaN 23.000000 20.000000
50% 1.477496e+06 128600.000000 NaN NaN 14.140000 NaN NaN 27.000000 25.000000
75% 1.477970e+06 270525.000000 NaN NaN 22.297500 NaN NaN 31.000000 28.000000
max 1.478444e+06 405334.000000 NaN NaN 35.410000 NaN NaN 35.000000 33.000000

This gives us the bsic summary statistics for the data frame. From this we can tell the minimum time for food to be prepared is 20.0, the average (mean) time for food to be prepared is 27.371970 and the maximum time for food to be prepared is 35.0

Observations:¶

Question 5: How many orders are not rated? [1 mark]¶

In [ ]:
df ['rating'].value_counts()# Write the code here
Out[ ]:
count
rating
Not given 736
5 588
4 386
3 188

The number of orders that are not rated which means Not given is 736

Observations:¶

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

Question 6: Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration.) [9 marks]¶

In [ ]:
print(f"The number of unique order is: {df['order_id'].nunique()}")
print(f"The number of unique customer ID is: {df['customer_id'].nunique()}")
print(f"The number of unique restaurants: {df['restaurant_name'].nunique()}")
print(f"The number of unique cuisine types: {df['cuisine_type'].nunique()}")
print(f"The number of unique day of the week: {df['day_of_the_week'].nunique()}")
print(f"The number of unique rating: {df['rating'].nunique()}")
The number of unique order is: 1898
The number of unique customer ID is: 1200
The number of unique restaurants: 178
The number of unique cuisine types: 14
The number of unique day of the week: 2
The number of unique rating: 4
In [ ]:
sns.histplot(data=df,x="cost_of_the_order",kde=True)
plt.show()
sns.boxplot(data=df,x="cost_of_the_order")
plt.show()
No description has been provided for this image
No description has been provided for this image

From the histogram we can see the majority cost of the order is mostly between 12 to 13 and the mean of the entire dataset for cost of the order is also around 14 as shown by the boxplot. The data is also skewed to the left which indicates the cost size is generally on the lower side. There are a few anomolies with the maximum order going up to 35.

In [ ]:
plt.figure(figsize=(20,10))
sns.countplot(data = df, x = 'cuisine_type')
Out[ ]:
<Axes: xlabel='cuisine_type', ylabel='count'>
No description has been provided for this image

The count plot helps us understand which cusine type is preferred by the customers. Most customers seem to order American food. The other popular options are Japanese, Italian and Chinese food. The other cusines have number of order below 100.

In [ ]:
sns.countplot(data = df, x = 'day_of_the_week')
Out[ ]:
<Axes: xlabel='day_of_the_week', ylabel='count'>
No description has been provided for this image

From this we can infer that weekend is the time when majority of orders around 1300 are placed.

In [ ]:
sns.countplot(data = df, x = 'rating')
Out[ ]:
<Axes: xlabel='rating', ylabel='count'>
No description has been provided for this image

This rating count chart helps us understand how customers rate their order. Majority of customers seem to not rate at all and out of the ones who do decide to rate they do mostly give 5 or 4 stars. However, it is important to notice because majority of customers have not places a rating making a decision soley based on this is not ideal.

In [ ]:
sns.histplot(data=df,x="delivery_time",kde=True)
plt.show()
sns.boxplot(data=df,x="delivery_time")
plt.show()
No description has been provided for this image
No description has been provided for this image

From the chart above we can infer that the data is right skewed which means majority of delivery time seems to be on the higher side. This is important information for the company as they can run eperiments and simulations to try to decrease the delivery time. Furthermore, majority of delivery seems to take around 28 to 29 minutes and the average delivery time is 25 minutes. The box plot can help us spot delays as we can see there are orders that are outliers as they have taken around 32 minutes to complete.

In [ ]:
sns.histplot(data=df,x="food_preparation_time",kde=True)
plt.show()
sns.boxplot(data=df,x="food_preparation_time")
plt.show()
No description has been provided for this image
No description has been provided for this image

This can help us see if the orders are prepared quickly, do restuarants take a long time to prepare them or not. We can see that there is no clear skew on if the orders are quick or late as majority seem to be prepared in the average time of 27 minutues. There is no clear inference that we can make from this.

Question 7: Which are the top 5 restaurants in terms of the number of orders received? [1 mark]¶

In [ ]:
df['restaurant_name'].value_counts().head(5)# Write the code here
Out[ ]:
count
restaurant_name
Shake Shack 219
The Meatball Shop 132
Blue Ribbon Sushi 119
Blue Ribbon Fried Chicken 96
Parm 68

Observations:¶

Question 8: Which is the most popular cuisine on weekends? [1 mark]¶

In [ ]:
df_weekend = df[df['day_of_the_week'] == 'Weekend']
print (f" The most popular cusine on weekend is: {df_weekend['cuisine_type'].value_counts().idxmax()}")
# Write the code here
 The most popular cusine on weekend is: American

Observations:¶

Question 9: What percentage of the orders cost more than 20 dollars? ¶

In [ ]:
df_greater_than_20 = df[df['cost_of_the_order']>20]
percentage = (df_greater_than_20.shape[0] / df.shape[0]) * 100
print("Percentage of orders above 20 dollars:", round(percentage, 2),"%")
Percentage of orders above 20 dollars: 29.24 %

Observations:¶

Question 10: What is the mean order delivery time? [1 mark]¶

In [ ]:
mean_delivery_time = df['delivery_time'].mean() # Write the code here
print('Mean order delivery time:', round(mean_delivery_time, 2), 'minutes')
Mean order delivery time: 24.16 minutes

Observations:¶

Question 11: The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed. [1 mark]¶

In [ ]:
df['customer_id'].value_counts().head(3)# Write the code here
Out[ ]:
count
customer_id
52832 13
47440 10
83287 9

Observations:¶

Multivariate Analysis¶

Question 12: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables) [10 marks]¶

In [ ]:
df['cost_bin'] = pd.cut(df['cost_of_the_order'], bins=range(0, 41, 5))
pivot = df.pivot_table(index='cost_bin', columns='day_of_the_week', values='delivery_time', aggfunc='mean')

# Heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(pivot, annot=True, cmap='YlGnBu')
plt.title("Average Delivery Time by Cost and Day")
plt.ylabel("Cost of the order")
plt.xlabel("Day of the Week")
plt.show()
<ipython-input-146-50a6fe26980d>:2: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
  pivot = df.pivot_table(index='cost_bin', columns='day_of_the_week', values='delivery_time', aggfunc='mean')
No description has been provided for this image

This is a super helpful heatmap which tells us that regardless of what the cost is the delivery time is less on weekends and more on weekdays

In [ ]:
plt.figure(figsize=(20,10))
sns.boxplot(x = "cuisine_type", y = "cost_of_the_order", data = df, palette = 'YlGn')
plt.xticks(rotation = 60)
plt.show()
<ipython-input-147-8b83706e854f>:2: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x = "cuisine_type", y = "cost_of_the_order", data = df, palette = 'YlGn')
No description has been provided for this image

This shows how much people typically spend when ordering specific cusine. We can infer that people tend to spend more when ordering French and Thai cusine. For cusines like Korean people typically tend to spend below 15 and we can consider this cusine as affordable.

In [ ]:
plt.figure(figsize=(20,10))
sns.boxplot(x = "cuisine_type", y = "delivery_time", data = df, palette = 'PuBu')
plt.xticks(rotation = 60)
plt.show()
<ipython-input-148-b30c527a9826>:2: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x = "cuisine_type", y = "delivery_time", data = df, palette = 'PuBu')
No description has been provided for this image

This is very helpful as it lets us compare the cusine type to delivery time so we can see which cusines take a lot longer to deliver compared to others. The French cusine on average tends to have a higher deliery time compared to others. The quickest cusine on average that is gets delivered the fastest is Korean. One reason for this could be based on geographical location of the restuarants compared to the customer locations.

In [ ]:
plt.figure(figsize=(20,10))
sns.boxplot(x = "cuisine_type", y = "food_preparation_time", data = df, palette = 'coolwarm')
plt.xticks(rotation = 60)
plt.show()
<ipython-input-149-96bb1e48ecb1>:2: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x = "cuisine_type", y = "food_preparation_time", data = df, palette = 'coolwarm')
No description has been provided for this image

From this boxplot we can see on average the three cusines that take the longest to prepare are Italian, Thai and Spanish cusines. The two cusines that are quicker to prepare are Korean and Vietnamese.

In [ ]:
df['cost_bucket'] = pd.cut(df['cost_of_the_order'], bins=[0, 15, 30, 45], labels=['Low', 'Medium', 'High'])
sns.violinplot(data=df, x='day_of_the_week', y='delivery_time', hue='cost_bucket', split=True)
plt.title("Delivery Time Spread by Day and Cost Bucket")
plt.show()
No description has been provided for this image

This will help us understand how delivery time is affected on weekend and weekday. We can see that delivery time tends to be a lot higher on weekdays. On weekdays even high cost orders tend to face a lot of delays which is not the case on weekend. Also, low cost order tends to take longer on weekday as well compared to on weekend.

In [ ]:
plt.figure(figsize=(15, 7))
sns.pointplot(x='rating',y='cost_of_the_order',data=df)
plt.show()
No description has been provided for this image

The rating 5 star tends to be given when the cost of the order is higher, and the majority of customers that do not give a rating tend to spend less on the order size.

In [ ]:
df.groupby(['restaurant_name'])['cost_of_the_order'].sum().sort_values(ascending = False).head(14)
Out[ ]:
cost_of_the_order
restaurant_name
Shake Shack 3579.53
The Meatball Shop 2145.21
Blue Ribbon Sushi 1903.95
Blue Ribbon Fried Chicken 1662.29
Parm 1112.76
RedFarm Broadway 965.13
RedFarm Hudson 921.21
TAO 834.50
Han Dynasty 755.29
Blue Ribbon Sushi Bar & Grill 666.62
Rubirosa 660.45
Sushi of Gari 46 640.87
Nobu Next Door 623.67
Five Guys Burgers and Fries 506.47

The highest revenue generating compaies that cross 1000 are Shake Shack, The Meatball Shop, Blue Ribbon Sushi, Blue Ribbon Fried Chicken and Parm

In [ ]:
col_list = ['cost_of_the_order', 'food_preparation_time', 'delivery_time']
plt.figure(figsize=(20, 10))
sns.heatmap(df[col_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="coolwarm")
plt.show()
No description has been provided for this image

Food preperation time, cost of order and delivery time does not really seem to be highly correlated.

Question 13: The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer. [3 marks]¶

In [ ]:
df_rated = df[df['rating'] != 'Not given'].copy()
df_rated['rating'] = df_rated['rating'].astype('int')
df_rating_count = df_rated.groupby(['restaurant_name'])['rating'].count().sort_values(ascending = False).reset_index()
df_rating_count.head()
Out[ ]:
restaurant_name rating
0 Shake Shack 133
1 The Meatball Shop 84
2 Blue Ribbon Sushi 73
3 Blue Ribbon Fried Chicken 64
4 RedFarm Broadway 41

The restuarants that seem to qualify for the promotional offer are Shake Shack, The Meatball Shop, Blue Ribbon Sushi, Blue Ribbon Fried Chicken and RedFarm Broadway

Observations:¶

Question 14: The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders. [3 marks]¶

In [ ]:
def compute_rev(x):
    if x > 20:
        return x*0.25
    elif x > 5:
        return x*0.15
    else:
        return x*0

df['Revenue'] = df['cost_of_the_order'].apply(compute_rev)
df.head()
Out[ ]:
order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time cost_bin cost_bucket Revenue
0 1477147 337525 Hangawi Korean 30.75 Weekend Not given 25 20 (30, 35] High 7.6875
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.08 Weekend Not given 25 23 (10, 15] Low 1.8120
2 1477070 66393 Cafe Habana Mexican 12.23 Weekday 5 23 28 (10, 15] Low 1.8345
3 1477334 106968 Blue Ribbon Fried Chicken American 29.20 Weekend 3 25 15 (25, 30] Medium 7.3000
4 1478249 76942 Dirty Bird to Go American 11.59 Weekday 4 25 24 (10, 15] Low 1.7385

Observations:¶

Question 15: The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.) [2 marks]¶

In [ ]:
df['total_time'] = df['food_preparation_time'] + df['delivery_time']
df_greater_than_60 = df[df['total_time']>60]
print('The Total orders taking more than 60 minutes are:', df_greater_than_60.shape[0])

percentage = (df_greater_than_60.shape[0] / df.shape[0]) * 100
print("Percentage of orders takinge more than 60 minutes are:", round(percentage, 2), '%')
The Total orders taking more than 60 minutes are: 200
Percentage of orders takinge more than 60 minutes are: 10.54 %

Observations:¶

Question 16: The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends? [2 marks]¶

In [ ]:
print('Mean delivery time on weekdays is:',
      round(df[df['day_of_the_week'] == 'Weekday']['delivery_time'].mean()),
     'minutes')

print('Mean delivery time on weekend is:',
      round(df[df['day_of_the_week'] == 'Weekend']['delivery_time'].mean()),
     'minutes')
Mean delivery time on weekdays is: 28 minutes
Mean delivery time on weekend is: 22 minutes

Observations:¶

Conclusion and Recommendations¶

Question 17: What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations.) [6 marks]¶

Conclusions:¶

  • The Cusine type that seems to be popular among people are American, Japanese and Italian.
  • Customer who place higher value orders seem to be more inclined to leave 5 star review.
  • People on weekends tend to order more food compared to the weekedays.
  • There waslLow correlation between order cost, prep time, and delivery time.
  • Most food preparation times clustered around 20–30 minutes, while delivery times mostly ranged between 15–25 minutes.

Recommendations:¶

  • It is important to focus on the High demand cusines like American, Japanese and Italian as these cusines seem to be popular among customers and will likely increase customer engagement further. Therefore, the company should shift its marketing efforts towards these cusines.
  • Need to optimize delivery operations during the weekday as the delivery time is very high during the weekday which could end up resulting in low customer satisfaction in the future if this continues.
  • Incentivize customers (e.g., via small discounts or loyalty points) to submit ratings, as this will help improve the data quality and increase the quality of future business insights.
  • As we now have the average delivery time over the weekday and weekend it would be beneficial to implement a better predictive delivery tracking system which predicts the timings better and more accurately.
  • Since demand is low on weekdays, we can introduce weekday-exclusive deals or bundles which will help increase the revenue during the week as well.