Introduction
If anyone has studied the Cox Hazards Model, it is normally seen in the health industry or in actuarial modeling. I feel I learned about this model back while studying CT papers, however came across the usage of this model in the retail industry. Been fascinated by its approach in something other than insurance, which led me to dive into it deeper.
Survival analysis is essential in domains where understanding the timing and risk of an event matters - such as cancer recurrence. In this blog, we will be using a Stratified Cox Proportional Hazards model to predict customer repurchase behavior at the product (SKU) level using the UCI Online Retail dataset.
What Makes It Special in Retail?
The stratified Cox model is perfect for this use case because it:
The Math (Simplified)
For a customer buying product s:
h(t | X, product=s) = h₀ₛ(t) x exp(β₁ x frequency + β₂ x recency + ... + βₚ x featureₚ)
^ ^
Product-specific Shared across
baseline hazard all products
Pipeline Overview
Raw Transactions
|
Data Cleaning (remove cancellations, nulls)
|
Create Survival Dataset
|
For each Customer x Product:
- DURATION_DAYS: Days between purchases
- EVENT: 1 = repurchased, 0 = censored
|
Feature Engineering
|
Customer Features:
- RECENCY: Days since first purchase
- FREQUENCY: Total purchase count
- MONETARY: Average spend
- PRODUCT_FREQUENCY: Times bought this SKU
- DAYS_SINCE_FIRST: Days since first bought this SKU
|
Train Stratified Cox Model
|
strata=['StockCode'] ← Each product gets own baseline
|
Predict & Rank Customers
|
For each product:
- Risk score (relative hazard)
- P(repurchase in 30/60/90 days)
|
Business Insights & Recommendations
Key Code Snippet
from lifelines import CoxPHFitter
# Train stratified model
cph = CoxPHFitter(penalizer=0.01)
cph.fit(
train_df,
duration_col='DURATION_DAYS',
event_col='EVENT',
strata=['StockCode'], # Each product gets own baseline
)
# Predict repurchase probability
risk_scores = cph.predict_partial_hazard(test_customers)
survival_30d = cph.predict_survival_function(test_customers, times=[30])
repurchase_prob_30d = 1 - survival_30d
Understanding Censored Events
Censored means "last observed purchase where we don't know what happens next". That means every customer-product pair has at least 1 censored record.
Customer A - Shampoo (Two Purchases)
Key Point: Customer A bought 2 times. The first purchase is NOT censored, but the last purchase IS censored.
Timeline:
Feb 15 Apr 1 ... Dec 9
| | |
Buy Buy [Observation Ends]
Survival Records Created:
Record 1:
- Start: Feb 15 purchase
- End: Apr 1 purchase (next purchase)
- DURATION_DAYS: 45
- EVENT: 1 (NOT censored - we SAW the repurchase)
Record 2:
- Start: Apr 1 purchase
- End: Dec 9 (observation window ends)
- DURATION_DAYS: 252
- EVENT: 0 (CENSORED - observation ended, never saw next repurchase)
Customer B - Winter Coat (One Purchase Only)
Key Point: Customer B bought only once, so their single purchase IS censored.
Timeline:
Nov 20 Dec 9
| |
Buy [Observation Ends]
Survival Records Created:
Record 1:
- Start: Nov 20 purchase
- End: Dec 9 (observation window ends)
- DURATION_DAYS: 19
- EVENT: 0 (CENSORED - observation ended, never saw repurchase)
The LAST purchase for any customer-product combination is ALWAYS censored. Why? Because we don't know when (or if) they'll repurchase after that.
Any Customer-Product Pair:
Purchase #1 -> Purchase #2 -> Purchase #3 -> ... -> LAST Purchase -> ???
| | | |
'--------------'--------------' '--- CENSORED
ALL these create EVENT=1 records This creates EVENT=0
(we saw what happened next) (we DON'T know what's next)
Complete Example: End-to-End Feature Engineering
Input: Survival DataFrame (Before Features)
survival_df (before features):
CustomerID | StockCode | PurchaseDate | DURATION_DAYS | EVENT
12345 | MUG_001 | 2010-01-05 | 7 | 1
12345 | MUG_001 | 2010-01-12 | 39 | 1
12345 | MUG_001 | 2010-02-20 | 658 | 0
12345 | PLATE_02 | 2010-03-01 | 75 | 1
67890 | MUG_001 | 2010-11-15 | 389 | 0
Original Transaction Data
df_original:
CustomerID | StockCode | InvoiceDate | InvoiceNo | Revenue
12345 | MUG_001 | 2010-01-05 | INV001 | 19.98
12345 | MUG_001 | 2010-01-12 | INV002 | 9.99
12345 | MUG_001 | 2010-02-20 | INV003 | 9.99
12345 | PLATE_02 | 2010-03-01 | INV001 | 17.97
12345 | PLATE_02 | 2010-05-15 | INV004 | 11.98
67890 | MUG_001 | 2010-11-15 | INV100 | 9.99
Step 1: Calculate Aggregations
customer_first_purchase:
12345 -> 2010-01-05
67890 -> 2010-11-15
customer_purchase_count:
12345 -> 4 (INV001, INV002, INV003, INV004)
67890 -> 1
customer_avg_revenue:
12345 -> 13.98 ((19.98 + 9.99 + 9.99 + 17.97 + 11.98) / 5)
67890 -> 9.99
customer_product_count:
(12345, MUG_001) -> 3
(12345, PLATE_02) -> 2
(67890, MUG_001) -> 1
customer_product_first:
(12345, MUG_001) -> 2010-01-05
(12345, PLATE_02) -> 2010-03-01
(67890, MUG_001) -> 2010-11-15
Step 2: Add Features (Raw Values)
survival_df (after adding features, before scaling):
CustomerID | StockCode | PurchaseDate | RECENCY | FREQUENCY | MONETARY | PRODUCT_FREQ | DAYS_SINCE_FIRST
12345 | MUG_001 | 2010-01-05 | 0 | 4 | 13.98 | 3 | 0
12345 | MUG_001 | 2010-01-12 | 7 | 4 | 13.98 | 3 | 7
12345 | MUG_001 | 2010-02-20 | 46 | 4 | 13.98 | 3 | 46
12345 | PLATE_02 | 2010-03-01 | 55 | 4 | 13.98 | 2 | 0
67890 | MUG_001 | 2010-11-15 | 0 | 1 | 9.99 | 1 | 0
Step 3: Apply Log Transform to MONETARY
LOG_MONETARY = np.log1p(MONETARY)
12345 -> log1p(13.98) = 2.64
67890 -> log1p(9.99) = 2.40
Step 4: Standardize Features
# Calculate means and stds
RECENCY: mean=21.6, std=25.8
FREQUENCY: mean=3.4, std=1.3
LOG_MONETARY: mean=2.58, std=0.11
PRODUCT_FREQUENCY: mean=2.4, std=0.9
DAYS_SINCE_FIRST: mean=10.6, std=20.8
# Apply standardization
survival_df (final, after scaling):
CustomerID | RECENCY | FREQUENCY | LOG_MONETARY | PRODUCT_FREQ | DAYS_SINCE_FIRST
12345 | -0.84 | 0.46 | 0.55 | 0.67 | -0.51
12345 | -0.57 | 0.46 | 0.55 | 0.67 | -0.17
12345 | 0.94 | 0.46 | 0.55 | 0.67 | 1.70
12345 | 1.29 | 0.46 | 0.55 | -0.44 | -0.51
67890 | -0.84 | -1.85 | -1.64 | -1.56 | -0.51
Why These Features Matter for the Cox Model
RECENCY
Low (new customer) -> Might churn quickly OR become loyal
High (established) -> Stable behavior, predictable repurchase
FREQUENCY
Low (1-2 purchases) -> Uncertain loyalty, might not return
High (10+ purchases) -> Strong engagement, likely to repurchase
MONETARY (LOG)
Low spenders -> Price sensitive, wait for deals
High spenders -> Value quality, repurchase regularly
PRODUCT_FREQUENCY
Bought this product 1 time -> Trying it out
Bought this product 5+ times -> Product loyalty, will repurchase
DAYS_SINCE_FIRST
0 days -> First purchase of product, uncertain
100+ days -> Long relationship with product, stable pattern
Key Takeaways
Two types of features:
Transformations matter:
These features become the X in the Cox model:
h(t | X) = h₀_product(t) x exp(β₁ x RECENCY + β₂ x FREQUENCY + ...)
The model learns: which features make customers repurchase faster or slower. For example, "High FREQUENCY customers have 1.5x higher repurchase hazard."
The stratified Cox model is the gold standard for product-level repurchase prediction.
Business Insights & Recommendations
1. Most Important Factors for Repurchase
coef exp(coef) p abs_coef
covariate
PRODUCT_FREQUENCY 0.3679 1.4447 0.0000e+00 0.3679
DAYS_SINCE_FIRST 0.0909 1.0951 1.8065e-20 0.0909
RECENCY -0.0594 0.9423 1.0159e-05 0.0594
LOG_MONETARY 0.0325 1.0330 3.6136e-06 0.0325
FREQUENCY 0.0197 1.0199 5.0609e-02 0.0197
Interpretation:
2. Product-Level Targeting Opportunities
Products with highest average 30-day repurchase probability:
PROB_30D PROB_60D PROB_90D RISK_SCORE CustomerID
StockCode
20725 0.9533 0.9989 0.9999 15.3214 10
20727 0.8957 0.9932 0.9991 11.0845 10
85099B 0.8094 0.9741 0.9937 11.4678 10
22139 0.7032 0.8342 0.8956 10.3694 10
47566 0.5935 0.8458 0.9056 4.2543 10