In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import glob
In [2]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [3]:
FIG_SIZE = (8,6)
In [4]:
l = [pd.read_csv(filename) for filename in glob.glob("/content/drive/MyDrive/data/datasets/*.csv")]
print(len(l))
df = pd.concat(l, axis=0)
df.head()
50
Out[4]:
sale_time purchaser_gender
0 2013-01-07 02:14:07 female
1 2013-01-07 02:57:53 male
2 2013-01-07 02:59:49 female
3 2013-01-07 03:02:53 male
4 2013-01-07 03:27:36 female

Question 1

Plot daily sales for all 50 weeks

In [5]:
df["sale_time"].dtype
Out[5]:
dtype('O')
In [6]:
sale_day = pd.to_datetime(df["sale_time"])
In [7]:
sale_day
Out[7]:
sale_time
0 2013-01-07 02:14:07
1 2013-01-07 02:57:53
2 2013-01-07 02:59:49
3 2013-01-07 03:02:53
4 2013-01-07 03:27:36
... ...
3528 2013-03-03 22:47:58
3529 2013-03-03 22:48:51
3530 2013-03-03 22:51:19
3531 2013-03-03 22:56:58
3532 2013-03-03 23:53:58

204329 rows × 1 columns


In [8]:
daily_sales_df = sale_day.groupby(sale_day.dt.floor('d')).count().reset_index(name='sales_amount')
daily_sales_df.rename(columns={'sale_time':'sale_day'}, inplace=True)
daily_sales_df.head()
Out[8]:
sale_day sales_amount
0 2012-10-01 514
1 2012-10-02 482
2 2012-10-03 499
3 2012-10-04 546
4 2012-10-05 476
In [9]:
daily_sales_df = daily_sales_df.set_index('sale_day')
In [10]:
daily_sales_df.plot(figsize=FIG_SIZE, title="Daily sales over 50 weeks")
plt.show()

Question 2

It looks like there has been a sudden change in daily sales. What date did it occur?

  1. Look at daily differences using diff()
In [11]:
daily_changes = daily_sales_df['sales_amount'].diff()
In [12]:
daily_changes.head()
Out[12]:
sales_amount
sale_day
2012-10-01 NaN
2012-10-02 -32.0
2012-10-03 17.0
2012-10-04 47.0
2012-10-05 -70.0

In [13]:
change_date = daily_changes.abs().idxmax()
print("Sudden change on:", change_date)
Sudden change on: 2013-04-29 00:00:00
In [14]:
daily_sales_df['prev_day_sale_amount'] = daily_sales_df['sales_amount'].shift(1)
In [15]:
daily_sales_df['difference_btw_prev_day'] = daily_sales_df['sales_amount'] - daily_sales_df['prev_day_sale_amount']
daily_sales_df.head()
Out[15]:
sales_amount prev_day_sale_amount difference_btw_prev_day
sale_day
2012-10-01 514 NaN NaN
2012-10-02 482 514.0 -32.0
2012-10-03 499 482.0 17.0
2012-10-04 546 499.0 47.0
2012-10-05 476 546.0 -70.0
In [16]:
max_change_date = daily_sales_df['difference_btw_prev_day'].idxmax()
In [17]:
daily_sales_df.loc[[max_change_date]]
Out[17]:
sales_amount prev_day_sale_amount difference_btw_prev_day
sale_day
2013-04-29 732 458.0 274.0

The single day when sales volume jumpedthe most compared to the previous day. There is a surge in demand because of may be a promotion, pay day, product launch or seasonal effect. it is useful for operational planning and helps you identify the peak load days or when you lost sales opportunity

  1. Look at the percentage change (Biggest Percentage change relative to base level)
In [18]:
pct_changes = daily_sales_df['sales_amount'].pct_change()
change_date = pct_changes.abs().idxmax()
print("Biggest relative jump on:", change_date)
Biggest relative jump on: 2013-04-29 00:00:00

The day with the largest relative swing (like a doubling or halving), even if the absolute sales weren’t the highest. Business insight:

A spike from a low baseline (e.g., weekend sales were 50 → 100 = +100% even if just 50 units).

Highlights volatility, which absolute change may miss.

Useful for marketing and customer behavior analysis: you can link these days to specific campaigns, events, or pricing changes that triggered a big reaction.

  1. Look at the biggest Rolling Deviation vs. smoothed trend
In [19]:
rolling_avg = daily_sales_df['sales_amount'].rolling(7).mean()
deviation = ((daily_sales_df['sales_amount'] - rolling_avg)/rolling_avg)
max_dev_date = deviation.abs().idxmax()

max_dev_value = deviation.loc[max_dev_date]
print("Biggest deviation on:", max_dev_date)
print("Deviation value:", max_dev_value)
Biggest deviation on: 2013-04-29 00:00:00
Deviation value: 0.41703539823008856

The day sales deviated the most from their typical level (7-day average, by default).

Business insight:

Shows outliers in context of the usual sales trend.

Positive deviation → temporary boost beyond expectations.

Negative deviation → temporary dip below expectations.

Useful for forecasting and anomaly detection: tells you when the business experienced a day that doesn’t fit the usual seasonality, which may require explanation (holiday, outage, competitor action).


Question 3:

Is the change in daily sales at the date you selected statistically significant? if so, what is the p-value?

In [20]:
from scipy import stats
change_date = change_date
before = daily_sales_df[daily_sales_df.index < change_date]['sales_amount']
after = daily_sales_df[daily_sales_df.index >= change_date]['sales_amount']

#variance check using levene's test
lev_stat, lev_p = stats.levene(before, after, center="mean")
equal_var = lev_p > 0.05
test_type = "Student's t-test (equal var)" if equal_var else "Welch's t-test(unequal variance)"

# two sample t-test
t_stat, p_val = stats.ttest_ind(before, after, equal_var=equal_var)

print(f"{test_type} p-value: {p_val}")
print(f"t-statistic: {t_stat}")
Student's t-test (equal var) p-value: 2.27287631770394e-149
t-statistic: -45.7865380479371

t-statistic = –45.79

Very large in magnitude (way beyond ±2, which is the usual cutoff for significance).

Negative sign means the after group mean < before group mean (i.e., sales fell after the event date).

p-value = 2.27 × 10⁻¹⁴⁹

That’s essentially 0 (way below 0.05, 0.01, or even 0.001).

Means the probability of seeing such a difference if the true means were equal is astronomically small.

This extremely small p-value provides strong statistical evidence that the difference in daily sales before and after the change date is not due to random chance. Therefore, we can confidently conclude that this particular date marks a statistically significant shift in sales behavior, and it is indeed important for further analysis.


Question 4

Does the data suggest that the change in daily sales is due to a shift in the proportion of male-vs-female customers? Please use plots to support your answer (a rigorous statistical analysis is not necessary).

In [21]:
df['sale_day'] = pd.to_datetime(df['sale_time']).dt.floor('d')
gender_portions_df = (
    df.groupby(['sale_day', 'purchaser_gender'])
    .count()
    .reset_index()
    .rename(columns={'sale_time': 'sales_amount'}))
In [22]:
gender_portions_df
Out[22]:
sale_day purchaser_gender sales_amount
0 2012-10-01 female 413
1 2012-10-01 male 101
2 2012-10-02 female 379
3 2012-10-02 male 103
4 2012-10-03 female 386
... ... ... ...
695 2013-09-13 male 537
696 2013-09-14 female 193
697 2013-09-14 male 462
698 2013-09-15 female 230
699 2013-09-15 male 522

700 rows × 3 columns

In [23]:
gender_df = (
    gender_portions_df
      .pivot(index="sale_day", columns="purchaser_gender", values="sales_amount")
      .fillna(0)
      .rename(columns={"female": "female_sales_amount",
                       "male": "male_sales_amount"})
      .reset_index()
      .sort_values("sale_day")
)
gender_df
Out[23]:
purchaser_gender sale_day female_sales_amount male_sales_amount
0 2012-10-01 413 101
1 2012-10-02 379 103
2 2012-10-03 386 113
3 2012-10-04 432 114
4 2012-10-05 368 108
... ... ... ...
345 2013-09-11 224 495
346 2013-09-12 225 510
347 2013-09-13 206 537
348 2013-09-14 193 462
349 2013-09-15 230 522

350 rows × 3 columns

In [24]:
gender_df.columns.name = None
In [25]:
gender_df
Out[25]:
sale_day female_sales_amount male_sales_amount
0 2012-10-01 413 101
1 2012-10-02 379 103
2 2012-10-03 386 113
3 2012-10-04 432 114
4 2012-10-05 368 108
... ... ... ...
345 2013-09-11 224 495
346 2013-09-12 225 510
347 2013-09-13 206 537
348 2013-09-14 193 462
349 2013-09-15 230 522

350 rows × 3 columns

In [35]:
gender_df['gender_ratio'] = gender_df['male_sales_amount'] / gender_df['female_sales_amount']
plot_df = gender_df.copy()
plot_df['total_sales_amount'] = plot_df['male_sales_amount'] + plot_df['female_sales_amount']
print('printing the data frame after adding male and female sales_amount')
print(plot_df.head())
print('-'*50)
plot_df.drop(columns=['female_sales_amount', 'male_sales_amount'], inplace=True)
print('printing dataframe after dropping male and female sales_amount')
print(plot_df.head())
print('-'*50)
plot_df.set_index('sale_day', inplace=True)
plot_df.plot(secondary_y = ['gender_ratio'],figsize=FIG_SIZE, title="Gender ration and sales amount over all 50 weeks")
printing the data frame after adding male and female sales_amount
    sale_day  female_sales_amount  male_sales_amount  gender_ratio  \
0 2012-10-01                  413                101      0.244552   
1 2012-10-02                  379                103      0.271768   
2 2012-10-03                  386                113      0.292746   
3 2012-10-04                  432                114      0.263889   
4 2012-10-05                  368                108      0.293478   

   total_sales_amount  
0                 514  
1                 482  
2                 499  
3                 546  
4                 476  
--------------------------------------------------
printing dataframe after dropping male and female sales_amount
    sale_day  gender_ratio  total_sales_amount
0 2012-10-01      0.244552                 514
1 2012-10-02      0.271768                 482
2 2012-10-03      0.292746                 499
3 2012-10-04      0.263889                 546
4 2012-10-05      0.293478                 476
--------------------------------------------------
Out[35]:
<Axes: title={'center': 'Gender ration and sales amount over all 50 weeks'}, xlabel='sale_day'>

The increasing trend of the male ratio in the sales amount does not explain the sudden change. The reason should be something different. For example, it might be caused by starting a new marketing campaign or adding a new feature to the website.

In [ ]:

Question 5

Assume a given day is divided into four dayparts: night (12:00AM - 6:00AM), morning (6:00AM to 12:00PM), afternoon (12:00PM to 6:00PM) and evening (6:00PM - 12:00AM). What is the percentage of sales in each daypart over all 50 weeks?

In [36]:
df['sale_hour'] = pd.to_datetime(df['sale_time']).dt.ceil('h').dt.hour
In [37]:
df.head()
Out[37]:
sale_time purchaser_gender sale_day sale_hour
0 2013-01-07 02:14:07 female 2013-01-07 3
1 2013-01-07 02:57:53 male 2013-01-07 3
2 2013-01-07 02:59:49 female 2013-01-07 3
3 2013-01-07 03:02:53 male 2013-01-07 4
4 2013-01-07 03:27:36 female 2013-01-07 4
In [38]:
bins = [0, 6, 12, 18, 24]
labels = ['Night', 'Morning', 'Afternoon', 'Evening']
df['daypart'] = pd.cut(df['sale_hour'], bins=bins, labels=labels, right=False )
df.head()
Out[38]:
sale_time purchaser_gender sale_day sale_hour daypart
0 2013-01-07 02:14:07 female 2013-01-07 3 Night
1 2013-01-07 02:57:53 male 2013-01-07 3 Night
2 2013-01-07 02:59:49 female 2013-01-07 3 Night
3 2013-01-07 03:02:53 male 2013-01-07 4 Night
4 2013-01-07 03:27:36 female 2013-01-07 4 Night
In [39]:
df.drop(columns = ['purchaser_gender', 'sale_hour', 'sale_time'], inplace=True)
In [46]:
df_pct_sales = df.groupby(df['daypart']).count().reset_index()
df_pct_sales.rename(columns={'sale_day': 'day_part_sales_number'}, inplace=True)
df_pct_sales
/tmp/ipython-input-4167007817.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df_pct_sales = df.groupby(df['daypart']).count().reset_index()
Out[46]:
daypart day_part_sales_number
0 Night 13567
1 Morning 56080
2 Afternoon 81159
3 Evening 53523
In [47]:
df_pct_sales['day_part_sales_percentage'] = df_pct_sales['day_part_sales_number'] / df_pct_sales['day_part_sales_number'].sum()
In [48]:
df_pct_sales
Out[48]:
daypart day_part_sales_number day_part_sales_percentage
0 Night 13567 0.066398
1 Morning 56080 0.274459
2 Afternoon 81159 0.397198
3 Evening 53523 0.261945

Afternoon dominates (40%)

The afternoon is clearly the prime sales window, accounting for nearly 4 in 10 sales.

This suggests peak customer activity and should be the focus period for campaigns, staffing, and inventory readiness.


Morning is strong (27%)

A significant share of sales happens in the morning.

Customers are active early in the day — this could be tied to routines (work commute, morning shopping, breakfast/lunch runs, online purchases before work).


Evening is steady (26%)

Evening is almost as important as morning, showing customers engage after work hours.

Potentially a good slot for targeted promotions (e.g., flash sales, reminders, end-of-day discounts).


Night is minimal (7%)

Very low activity, which is expected.

Night orders might come from niche segments: night-shift workers, global customers in other time zones, or automated/recurring orders.

Not a focus area for marketing spend, but operational support (like customer service, system uptime) must still cover it.


Business Takeaways: bold text

Allocate resources (marketing budget, staffing, delivery logistics) proportional to sales share — heavier in afternoon, steady coverage in morning and evening, minimal effort overnight.


Campaign timing matters:

Launch promotions mid-morning to capture morning + afternoon peaks.

Send reminders in the evening to catch the after-work crowd.

Night orders can be monitored separately — if that 7% grows over time, it may indicate international expansion or new customer behavior patterns.

In [ ]: