import pandas as pd
import matplotlib.pyplot as plt
import glob

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

FIG_SIZE = (8,6)

l = [pd.read_csv(filename) for filename in glob.glob("/content/drive/MyDrive/data/datasets/*.csv")]
print(len(l))
df = pd.concat(l, axis=0)
df.head()

50

Question 1¶

Plot daily sales for all 50 weeks

df["sale_time"].dtype

dtype('O')

sale_day = pd.to_datetime(df["sale_time"])

sale_day

daily_sales_df = sale_day.groupby(sale_day.dt.floor('d')).count().reset_index(name='sales_amount')
daily_sales_df.rename(columns={'sale_time':'sale_day'}, inplace=True)
daily_sales_df.head()

daily_sales_df = daily_sales_df.set_index('sale_day')

daily_sales_df.plot(figsize=FIG_SIZE, title="Daily sales over 50 weeks")
plt.show()

Question 2¶

It looks like there has been a sudden change in daily sales. What date did it occur?

Look at daily differences using diff()

daily_changes = daily_sales_df['sales_amount'].diff()

daily_changes.head()

change_date = daily_changes.abs().idxmax()
print("Sudden change on:", change_date)

Sudden change on: 2013-04-29 00:00:00

daily_sales_df['prev_day_sale_amount'] = daily_sales_df['sales_amount'].shift(1)

daily_sales_df['difference_btw_prev_day'] = daily_sales_df['sales_amount'] - daily_sales_df['prev_day_sale_amount']
daily_sales_df.head()

max_change_date = daily_sales_df['difference_btw_prev_day'].idxmax()

daily_sales_df.loc[[max_change_date]]

The single day when sales volume jumpedthe most compared to the previous day. There is a surge in demand because of may be a promotion, pay day, product launch or seasonal effect. it is useful for operational planning and helps you identify the peak load days or when you lost sales opportunity

Look at the percentage change (Biggest Percentage change relative to base level)

pct_changes = daily_sales_df['sales_amount'].pct_change()
change_date = pct_changes.abs().idxmax()
print("Biggest relative jump on:", change_date)

Biggest relative jump on: 2013-04-29 00:00:00

The day with the largest relative swing (like a doubling or halving), even if the absolute sales weren’t the highest. Business insight:

A spike from a low baseline (e.g., weekend sales were 50 → 100 = +100% even if just 50 units).

Highlights volatility, which absolute change may miss.

Useful for marketing and customer behavior analysis: you can link these days to specific campaigns, events, or pricing changes that triggered a big reaction.

Look at the biggest Rolling Deviation vs. smoothed trend

rolling_avg = daily_sales_df['sales_amount'].rolling(7).mean()
deviation = ((daily_sales_df['sales_amount'] - rolling_avg)/rolling_avg)
max_dev_date = deviation.abs().idxmax()

max_dev_value = deviation.loc[max_dev_date]
print("Biggest deviation on:", max_dev_date)
print("Deviation value:", max_dev_value)

Biggest deviation on: 2013-04-29 00:00:00
Deviation value: 0.41703539823008856

The day sales deviated the most from their typical level (7-day average, by default).

Business insight:

Shows outliers in context of the usual sales trend.

Positive deviation → temporary boost beyond expectations.

Negative deviation → temporary dip below expectations.

Useful for forecasting and anomaly detection: tells you when the business experienced a day that doesn’t fit the usual seasonality, which may require explanation (holiday, outage, competitor action).

Question 3:¶

Is the change in daily sales at the date you selected statistically significant? if so, what is the p-value?

from scipy import stats
change_date = change_date
before = daily_sales_df[daily_sales_df.index < change_date]['sales_amount']
after = daily_sales_df[daily_sales_df.index >= change_date]['sales_amount']

#variance check using levene's test
lev_stat, lev_p = stats.levene(before, after, center="mean")
equal_var = lev_p > 0.05
test_type = "Student's t-test (equal var)" if equal_var else "Welch's t-test(unequal variance)"

# two sample t-test
t_stat, p_val = stats.ttest_ind(before, after, equal_var=equal_var)

print(f"{test_type} p-value: {p_val}")
print(f"t-statistic: {t_stat}")

Student's t-test (equal var) p-value: 2.27287631770394e-149
t-statistic: -45.7865380479371

t-statistic = –45.79

Very large in magnitude (way beyond ±2, which is the usual cutoff for significance).

Negative sign means the after group mean < before group mean (i.e., sales fell after the event date).

p-value = 2.27 × 10⁻¹⁴⁹

That’s essentially 0 (way below 0.05, 0.01, or even 0.001).

Means the probability of seeing such a difference if the true means were equal is astronomically small.

This extremely small p-value provides strong statistical evidence that the difference in daily sales before and after the change date is not due to random chance. Therefore, we can confidently conclude that this particular date marks a statistically significant shift in sales behavior, and it is indeed important for further analysis.

Question 4¶

Does the data suggest that the change in daily sales is due to a shift in the proportion of male-vs-female customers? Please use plots to support your answer (a rigorous statistical analysis is not necessary).

df['sale_day'] = pd.to_datetime(df['sale_time']).dt.floor('d')
gender_portions_df = (
    df.groupby(['sale_day', 'purchaser_gender'])
    .count()
    .reset_index()
    .rename(columns={'sale_time': 'sales_amount'}))

gender_portions_df

gender_df = (
    gender_portions_df
      .pivot(index="sale_day", columns="purchaser_gender", values="sales_amount")
      .fillna(0)
      .rename(columns={"female": "female_sales_amount",
                       "male": "male_sales_amount"})
      .reset_index()
      .sort_values("sale_day")
)
gender_df

gender_df.columns.name = None

gender_df

gender_df['gender_ratio'] = gender_df['male_sales_amount'] / gender_df['female_sales_amount']
plot_df = gender_df.copy()
plot_df['total_sales_amount'] = plot_df['male_sales_amount'] + plot_df['female_sales_amount']
print('printing the data frame after adding male and female sales_amount')
print(plot_df.head())
print('-'*50)
plot_df.drop(columns=['female_sales_amount', 'male_sales_amount'], inplace=True)
print('printing dataframe after dropping male and female sales_amount')
print(plot_df.head())
print('-'*50)
plot_df.set_index('sale_day', inplace=True)
plot_df.plot(secondary_y = ['gender_ratio'],figsize=FIG_SIZE, title="Gender ration and sales amount over all 50 weeks")

printing the data frame after adding male and female sales_amount
    sale_day  female_sales_amount  male_sales_amount  gender_ratio  \
0 2012-10-01                  413                101      0.244552   
1 2012-10-02                  379                103      0.271768   
2 2012-10-03                  386                113      0.292746   
3 2012-10-04                  432                114      0.263889   
4 2012-10-05                  368                108      0.293478   

   total_sales_amount  
0                 514  
1                 482  
2                 499  
3                 546  
4                 476  
--------------------------------------------------
printing dataframe after dropping male and female sales_amount
    sale_day  gender_ratio  total_sales_amount
0 2012-10-01      0.244552                 514
1 2012-10-02      0.271768                 482
2 2012-10-03      0.292746                 499
3 2012-10-04      0.263889                 546
4 2012-10-05      0.293478                 476
--------------------------------------------------

<Axes: title={'center': 'Gender ration and sales amount over all 50 weeks'}, xlabel='sale_day'>

The increasing trend of the male ratio in the sales amount does not explain the sudden change. The reason should be something different. For example, it might be caused by starting a new marketing campaign or adding a new feature to the website.

Question 5¶

Assume a given day is divided into four dayparts: night (12:00AM - 6:00AM), morning (6:00AM to 12:00PM), afternoon (12:00PM to 6:00PM) and evening (6:00PM - 12:00AM). What is the percentage of sales in each daypart over all 50 weeks?

df['sale_hour'] = pd.to_datetime(df['sale_time']).dt.ceil('h').dt.hour

df.head()

bins = [0, 6, 12, 18, 24]
labels = ['Night', 'Morning', 'Afternoon', 'Evening']
df['daypart'] = pd.cut(df['sale_hour'], bins=bins, labels=labels, right=False )
df.head()

df.drop(columns = ['purchaser_gender', 'sale_hour', 'sale_time'], inplace=True)

df_pct_sales = df.groupby(df['daypart']).count().reset_index()
df_pct_sales.rename(columns={'sale_day': 'day_part_sales_number'}, inplace=True)
df_pct_sales

/tmp/ipython-input-4167007817.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df_pct_sales = df.groupby(df['daypart']).count().reset_index()

df_pct_sales['day_part_sales_percentage'] = df_pct_sales['day_part_sales_number'] / df_pct_sales['day_part_sales_number'].sum()

df_pct_sales

Afternoon dominates (40%)

The afternoon is clearly the prime sales window, accounting for nearly 4 in 10 sales.

This suggests peak customer activity and should be the focus period for campaigns, staffing, and inventory readiness.

Morning is strong (27%)

A significant share of sales happens in the morning.

Customers are active early in the day — this could be tied to routines (work commute, morning shopping, breakfast/lunch runs, online purchases before work).

Evening is steady (26%)

Evening is almost as important as morning, showing customers engage after work hours.

Potentially a good slot for targeted promotions (e.g., flash sales, reminders, end-of-day discounts).

Night is minimal (7%)

Very low activity, which is expected.

Night orders might come from niche segments: night-shift workers, global customers in other time zones, or automated/recurring orders.

Not a focus area for marketing spend, but operational support (like customer service, system uptime) must still cover it.

Business Takeaways: bold text

Allocate resources (marketing budget, staffing, delivery logistics) proportional to sales share — heavier in afternoon, steady coverage in morning and evening, minimal effort overnight.

Campaign timing matters:

Launch promotions mid-morning to capture morning + afternoon peaks.

Send reminders in the evening to catch the after-work crowd.

Night orders can be monitored separately — if that 7% grows over time, it may indicate international expansion or new customer behavior patterns.

	sale_time
0	2013-01-07 02:14:07
1	2013-01-07 02:57:53
2	2013-01-07 02:59:49
3	2013-01-07 03:02:53
4	2013-01-07 03:27:36
...	...
3528	2013-03-03 22:47:58
3529	2013-03-03 22:48:51
3530	2013-03-03 22:51:19
3531	2013-03-03 22:56:58
3532	2013-03-03 23:53:58

purchaser_gender	sale_day	female_sales_amount	male_sales_amount
0	2012-10-01	413	101
1	2012-10-02	379	103
2	2012-10-03	386	113
3	2012-10-04	432	114
4	2012-10-05	368	108
...	...	...	...
345	2013-09-11	224	495
346	2013-09-12	225	510
347	2013-09-13	206	537
348	2013-09-14	193	462
349	2013-09-15	230	522

	sale_day	female_sales_amount	male_sales_amount
0	2012-10-01	413	101
1	2012-10-02	379	103
2	2012-10-03	386	113
3	2012-10-04	432	114
4	2012-10-05	368	108
...	...	...	...
345	2013-09-11	224	495
346	2013-09-12	225	510
347	2013-09-13	206	537
348	2013-09-14	193	462
349	2013-09-15	230	522

	sale_time	purchaser_gender	sale_day	sale_hour
0	2013-01-07 02:14:07	female	2013-01-07	3
1	2013-01-07 02:57:53	male	2013-01-07	3
2	2013-01-07 02:59:49	female	2013-01-07	3
3	2013-01-07 03:02:53	male	2013-01-07	4
4	2013-01-07 03:27:36	female	2013-01-07	4

	daypart	day_part_sales_number	day_part_sales_percentage
0	Night	13567	0.066398
1	Morning	56080	0.274459
2	Afternoon	81159	0.397198
3	Evening	53523	0.261945

	sales_amount	prev_day_sale_amount	difference_btw_prev_day
sale_day
2012-10-01	514	NaN	NaN
2012-10-02	482	514.0	-32.0
2012-10-03	499	482.0	17.0
2012-10-04	546	499.0	47.0
2012-10-05	476	546.0	-70.0

	sales_amount	prev_day_sale_amount	difference_btw_prev_day
sale_day
2013-04-29	732	458.0	274.0