import pandas as pd
import matplotlib.pyplot as plt
import glob
from google.colab import drive
drive.mount('/content/drive')
FIG_SIZE = (8,6)
l = [pd.read_csv(filename) for filename in glob.glob("/content/drive/MyDrive/data/datasets/*.csv")]
print(len(l))
df = pd.concat(l, axis=0)
df.head()
Plot daily sales for all 50 weeks
df["sale_time"].dtype
sale_day = pd.to_datetime(df["sale_time"])
sale_day
daily_sales_df = sale_day.groupby(sale_day.dt.floor('d')).count().reset_index(name='sales_amount')
daily_sales_df.rename(columns={'sale_time':'sale_day'}, inplace=True)
daily_sales_df.head()
daily_sales_df = daily_sales_df.set_index('sale_day')
daily_sales_df.plot(figsize=FIG_SIZE, title="Daily sales over 50 weeks")
plt.show()
It looks like there has been a sudden change in daily sales. What date did it occur?
daily_changes = daily_sales_df['sales_amount'].diff()
daily_changes.head()
change_date = daily_changes.abs().idxmax()
print("Sudden change on:", change_date)
daily_sales_df['prev_day_sale_amount'] = daily_sales_df['sales_amount'].shift(1)
daily_sales_df['difference_btw_prev_day'] = daily_sales_df['sales_amount'] - daily_sales_df['prev_day_sale_amount']
daily_sales_df.head()
max_change_date = daily_sales_df['difference_btw_prev_day'].idxmax()
daily_sales_df.loc[[max_change_date]]
The single day when sales volume jumpedthe most compared to the previous day. There is a surge in demand because of may be a promotion, pay day, product launch or seasonal effect. it is useful for operational planning and helps you identify the peak load days or when you lost sales opportunity
pct_changes = daily_sales_df['sales_amount'].pct_change()
change_date = pct_changes.abs().idxmax()
print("Biggest relative jump on:", change_date)
The day with the largest relative swing (like a doubling or halving), even if the absolute sales weren’t the highest. Business insight:
A spike from a low baseline (e.g., weekend sales were 50 → 100 = +100% even if just 50 units).
Highlights volatility, which absolute change may miss.
Useful for marketing and customer behavior analysis: you can link these days to specific campaigns, events, or pricing changes that triggered a big reaction.
rolling_avg = daily_sales_df['sales_amount'].rolling(7).mean()
deviation = ((daily_sales_df['sales_amount'] - rolling_avg)/rolling_avg)
max_dev_date = deviation.abs().idxmax()
max_dev_value = deviation.loc[max_dev_date]
print("Biggest deviation on:", max_dev_date)
print("Deviation value:", max_dev_value)
The day sales deviated the most from their typical level (7-day average, by default).
Business insight:
Shows outliers in context of the usual sales trend.
Positive deviation → temporary boost beyond expectations.
Negative deviation → temporary dip below expectations.
Useful for forecasting and anomaly detection: tells you when the business experienced a day that doesn’t fit the usual seasonality, which may require explanation (holiday, outage, competitor action).
Is the change in daily sales at the date you selected statistically significant? if so, what is the p-value?
from scipy import stats
change_date = change_date
before = daily_sales_df[daily_sales_df.index < change_date]['sales_amount']
after = daily_sales_df[daily_sales_df.index >= change_date]['sales_amount']
#variance check using levene's test
lev_stat, lev_p = stats.levene(before, after, center="mean")
equal_var = lev_p > 0.05
test_type = "Student's t-test (equal var)" if equal_var else "Welch's t-test(unequal variance)"
# two sample t-test
t_stat, p_val = stats.ttest_ind(before, after, equal_var=equal_var)
print(f"{test_type} p-value: {p_val}")
print(f"t-statistic: {t_stat}")
t-statistic = –45.79
Very large in magnitude (way beyond ±2, which is the usual cutoff for significance).
Negative sign means the after group mean < before group mean (i.e., sales fell after the event date).
p-value = 2.27 × 10⁻¹⁴⁹
That’s essentially 0 (way below 0.05, 0.01, or even 0.001).
Means the probability of seeing such a difference if the true means were equal is astronomically small.
This extremely small p-value provides strong statistical evidence that the difference in daily sales before and after the change date is not due to random chance. Therefore, we can confidently conclude that this particular date marks a statistically significant shift in sales behavior, and it is indeed important for further analysis.
Does the data suggest that the change in daily sales is due to a shift in the proportion of male-vs-female customers? Please use plots to support your answer (a rigorous statistical analysis is not necessary).
df['sale_day'] = pd.to_datetime(df['sale_time']).dt.floor('d')
gender_portions_df = (
df.groupby(['sale_day', 'purchaser_gender'])
.count()
.reset_index()
.rename(columns={'sale_time': 'sales_amount'}))
gender_portions_df
gender_df = (
gender_portions_df
.pivot(index="sale_day", columns="purchaser_gender", values="sales_amount")
.fillna(0)
.rename(columns={"female": "female_sales_amount",
"male": "male_sales_amount"})
.reset_index()
.sort_values("sale_day")
)
gender_df
gender_df.columns.name = None
gender_df
gender_df['gender_ratio'] = gender_df['male_sales_amount'] / gender_df['female_sales_amount']
plot_df = gender_df.copy()
plot_df['total_sales_amount'] = plot_df['male_sales_amount'] + plot_df['female_sales_amount']
print('printing the data frame after adding male and female sales_amount')
print(plot_df.head())
print('-'*50)
plot_df.drop(columns=['female_sales_amount', 'male_sales_amount'], inplace=True)
print('printing dataframe after dropping male and female sales_amount')
print(plot_df.head())
print('-'*50)
plot_df.set_index('sale_day', inplace=True)
plot_df.plot(secondary_y = ['gender_ratio'],figsize=FIG_SIZE, title="Gender ration and sales amount over all 50 weeks")
The increasing trend of the male ratio in the sales amount does not explain the sudden change. The reason should be something different. For example, it might be caused by starting a new marketing campaign or adding a new feature to the website.
Assume a given day is divided into four dayparts: night (12:00AM - 6:00AM), morning (6:00AM to 12:00PM), afternoon (12:00PM to 6:00PM) and evening (6:00PM - 12:00AM). What is the percentage of sales in each daypart over all 50 weeks?
df['sale_hour'] = pd.to_datetime(df['sale_time']).dt.ceil('h').dt.hour
df.head()
bins = [0, 6, 12, 18, 24]
labels = ['Night', 'Morning', 'Afternoon', 'Evening']
df['daypart'] = pd.cut(df['sale_hour'], bins=bins, labels=labels, right=False )
df.head()
df.drop(columns = ['purchaser_gender', 'sale_hour', 'sale_time'], inplace=True)
df_pct_sales = df.groupby(df['daypart']).count().reset_index()
df_pct_sales.rename(columns={'sale_day': 'day_part_sales_number'}, inplace=True)
df_pct_sales
df_pct_sales['day_part_sales_percentage'] = df_pct_sales['day_part_sales_number'] / df_pct_sales['day_part_sales_number'].sum()
df_pct_sales
Afternoon dominates (40%)
The afternoon is clearly the prime sales window, accounting for nearly 4 in 10 sales.
This suggests peak customer activity and should be the focus period for campaigns, staffing, and inventory readiness.
Morning is strong (27%)
A significant share of sales happens in the morning.
Customers are active early in the day — this could be tied to routines (work commute, morning shopping, breakfast/lunch runs, online purchases before work).
Evening is steady (26%)
Evening is almost as important as morning, showing customers engage after work hours.
Potentially a good slot for targeted promotions (e.g., flash sales, reminders, end-of-day discounts).
Night is minimal (7%)
Very low activity, which is expected.
Night orders might come from niche segments: night-shift workers, global customers in other time zones, or automated/recurring orders.
Not a focus area for marketing spend, but operational support (like customer service, system uptime) must still cover it.
Business Takeaways: bold text
Allocate resources (marketing budget, staffing, delivery logistics) proportional to sales share — heavier in afternoon, steady coverage in morning and evening, minimal effort overnight.
Campaign timing matters:
Launch promotions mid-morning to capture morning + afternoon peaks.
Send reminders in the evening to catch the after-work crowd.
Night orders can be monitored separately — if that 7% grows over time, it may indicate international expansion or new customer behavior patterns.