Github repository here

Introduction

If anyone has studied the Cox Hazards Model, it is normally seen in the health industry or in actuarial modeling. I feel I learned about this model back while studying CT papers, however came across the usage of this model in the retail industry. Been fascinated by its approach in something other than insurance, which led me to dive into it deeper.

Survival analysis is essential in domains where understanding the timing and risk of an event matters - such as cancer recurrence. In this blog, we will be using a Stratified Cox Proportional Hazards model to predict customer repurchase behavior at the product (SKU) level using the UCI Online Retail dataset.

What Makes It Special in Retail?

The stratified Cox model is perfect for this use case because it:

  • Predicts BOTH likelihood AND timing of repurchase
  • Handles censoring correctly - Distinguishes "haven't bought yet" from "will never buy"
  • Product-specific baselines - Each product gets its own repurchase timing curve
  • Shared customer behaviors - Frequent buyers buy ALL products more often
  • Statistical efficiency - Borrows strength across products
  • The Math (Simplified)

    For a customer buying product s:

    h(t | X, product=s) = h₀ₛ(t) x exp(β₁ x frequency + β₂ x recency + ... + βₚ x featureₚ)
                          ^              ^
                  Product-specific    Shared across
                  baseline hazard     all products

    Pipeline Overview

    Raw Transactions
        |
    Data Cleaning (remove cancellations, nulls)
        |
    Create Survival Dataset
        |
        For each Customer x Product:
        - DURATION_DAYS: Days between purchases
        - EVENT: 1 = repurchased, 0 = censored
        |
    Feature Engineering
        |
        Customer Features:
        - RECENCY: Days since first purchase
        - FREQUENCY: Total purchase count
        - MONETARY: Average spend
        - PRODUCT_FREQUENCY: Times bought this SKU
        - DAYS_SINCE_FIRST: Days since first bought this SKU
        |
    Train Stratified Cox Model
        |
        strata=['StockCode'] ← Each product gets own baseline
        |
    Predict & Rank Customers
        |
        For each product:
        - Risk score (relative hazard)
        - P(repurchase in 30/60/90 days)
        |
    Business Insights & Recommendations

    Key Code Snippet

    from lifelines import CoxPHFitter
    
    # Train stratified model
    cph = CoxPHFitter(penalizer=0.01)
    cph.fit(
        train_df,
        duration_col='DURATION_DAYS',
        event_col='EVENT',
        strata=['StockCode'],  # Each product gets own baseline
    )
    
    # Predict repurchase probability
    risk_scores = cph.predict_partial_hazard(test_customers)
    survival_30d = cph.predict_survival_function(test_customers, times=[30])
    repurchase_prob_30d = 1 - survival_30d

    Understanding Censored Events

    Censored means "last observed purchase where we don't know what happens next". That means every customer-product pair has at least 1 censored record.

    Customer A - Shampoo (Two Purchases)

    Key Point: Customer A bought 2 times. The first purchase is NOT censored, but the last purchase IS censored.

    Timeline:
    Feb 15           Apr 1           ...           Dec 9
      |               |                             |
      Buy             Buy                           [Observation Ends]
    
    Survival Records Created:
    
    Record 1:
      - Start: Feb 15 purchase
      - End: Apr 1 purchase (next purchase)
      - DURATION_DAYS: 45
      - EVENT: 1 (NOT censored - we SAW the repurchase)
    
    Record 2:
      - Start: Apr 1 purchase
      - End: Dec 9 (observation window ends)
      - DURATION_DAYS: 252
      - EVENT: 0 (CENSORED - observation ended, never saw next repurchase)

    Customer B - Winter Coat (One Purchase Only)

    Key Point: Customer B bought only once, so their single purchase IS censored.

    Timeline:
    Nov 20                                    Dec 9
      |                                         |
      Buy                                       [Observation Ends]
    
    Survival Records Created:
    
    Record 1:
      - Start: Nov 20 purchase
      - End: Dec 9 (observation window ends)
      - DURATION_DAYS: 19
      - EVENT: 0 (CENSORED - observation ended, never saw repurchase)

    The LAST purchase for any customer-product combination is ALWAYS censored. Why? Because we don't know when (or if) they'll repurchase after that.

    Any Customer-Product Pair:
    
    Purchase #1 -> Purchase #2 -> Purchase #3 -> ... -> LAST Purchase -> ???
       |              |              |                        |
       '--------------'--------------'                        '--- CENSORED
            ALL these create EVENT=1 records          This creates EVENT=0
            (we saw what happened next)               (we DON'T know what's next)

    Complete Example: End-to-End Feature Engineering

    Input: Survival DataFrame (Before Features)

    survival_df (before features):
      CustomerID | StockCode | PurchaseDate | DURATION_DAYS | EVENT
      12345      | MUG_001   | 2010-01-05   | 7             | 1
      12345      | MUG_001   | 2010-01-12   | 39            | 1
      12345      | MUG_001   | 2010-02-20   | 658           | 0
      12345      | PLATE_02  | 2010-03-01   | 75            | 1
      67890      | MUG_001   | 2010-11-15   | 389           | 0

    Original Transaction Data

    df_original:
      CustomerID | StockCode | InvoiceDate | InvoiceNo | Revenue
      12345      | MUG_001   | 2010-01-05  | INV001    | 19.98
      12345      | MUG_001   | 2010-01-12  | INV002    | 9.99
      12345      | MUG_001   | 2010-02-20  | INV003    | 9.99
      12345      | PLATE_02  | 2010-03-01  | INV001    | 17.97
      12345      | PLATE_02  | 2010-05-15  | INV004    | 11.98
      67890      | MUG_001   | 2010-11-15  | INV100    | 9.99

    Step 1: Calculate Aggregations

    customer_first_purchase:
      12345 -> 2010-01-05
      67890 -> 2010-11-15
    
    customer_purchase_count:
      12345 -> 4  (INV001, INV002, INV003, INV004)
      67890 -> 1
    
    customer_avg_revenue:
      12345 -> 13.98  ((19.98 + 9.99 + 9.99 + 17.97 + 11.98) / 5)
      67890 -> 9.99
    
    customer_product_count:
      (12345, MUG_001)  -> 3
      (12345, PLATE_02) -> 2
      (67890, MUG_001)  -> 1
    
    customer_product_first:
      (12345, MUG_001)  -> 2010-01-05
      (12345, PLATE_02) -> 2010-03-01
      (67890, MUG_001)  -> 2010-11-15

    Step 2: Add Features (Raw Values)

    survival_df (after adding features, before scaling):
    
    CustomerID | StockCode | PurchaseDate | RECENCY | FREQUENCY | MONETARY | PRODUCT_FREQ | DAYS_SINCE_FIRST
    12345      | MUG_001   | 2010-01-05   | 0       | 4         | 13.98    | 3            | 0
    12345      | MUG_001   | 2010-01-12   | 7       | 4         | 13.98    | 3            | 7
    12345      | MUG_001   | 2010-02-20   | 46      | 4         | 13.98    | 3            | 46
    12345      | PLATE_02  | 2010-03-01   | 55      | 4         | 13.98    | 2            | 0
    67890      | MUG_001   | 2010-11-15   | 0       | 1         | 9.99     | 1            | 0

    Step 3: Apply Log Transform to MONETARY

    LOG_MONETARY = np.log1p(MONETARY)
    
    12345 -> log1p(13.98) = 2.64
    67890 -> log1p(9.99)  = 2.40

    Step 4: Standardize Features

    # Calculate means and stds
    RECENCY:           mean=21.6,  std=25.8
    FREQUENCY:         mean=3.4,   std=1.3
    LOG_MONETARY:      mean=2.58,  std=0.11
    PRODUCT_FREQUENCY: mean=2.4,   std=0.9
    DAYS_SINCE_FIRST:  mean=10.6,  std=20.8
    
    # Apply standardization
    survival_df (final, after scaling):
    
    CustomerID | RECENCY | FREQUENCY | LOG_MONETARY | PRODUCT_FREQ | DAYS_SINCE_FIRST
    12345      | -0.84   | 0.46      | 0.55         | 0.67         | -0.51
    12345      | -0.57   | 0.46      | 0.55         | 0.67         | -0.17
    12345      |  0.94   | 0.46      | 0.55         | 0.67         |  1.70
    12345      |  1.29   | 0.46      | 0.55         | -0.44        | -0.51
    67890      | -0.84   | -1.85     | -1.64        | -1.56        | -0.51

    Why These Features Matter for the Cox Model

    RECENCY

    Low (new customer)  -> Might churn quickly OR become loyal
    High (established)  -> Stable behavior, predictable repurchase

    FREQUENCY

    Low (1-2 purchases)   -> Uncertain loyalty, might not return
    High (10+ purchases)  -> Strong engagement, likely to repurchase

    MONETARY (LOG)

    Low spenders  -> Price sensitive, wait for deals
    High spenders -> Value quality, repurchase regularly

    PRODUCT_FREQUENCY

    Bought this product 1 time   -> Trying it out
    Bought this product 5+ times -> Product loyalty, will repurchase

    DAYS_SINCE_FIRST

    0 days    -> First purchase of product, uncertain
    100+ days -> Long relationship with product, stable pattern

    Key Takeaways

    Two types of features:

  • Customer-level (FREQUENCY, MONETARY) → Same for all products
  • Product-level (PRODUCT_FREQUENCY, DAYS_SINCE_FIRST) → Specific to customer-product pair
  • Transformations matter:

  • Log transform → Handles skewed distributions
  • Standardization → Puts features on same scale
  • These features become the X in the Cox model:

    h(t | X) = h₀_product(t) x exp(β₁ x RECENCY + β₂ x FREQUENCY + ...)

    The model learns: which features make customers repurchase faster or slower. For example, "High FREQUENCY customers have 1.5x higher repurchase hazard."

    The stratified Cox model is the gold standard for product-level repurchase prediction.

    Business Insights & Recommendations

    1. Most Important Factors for Repurchase

                        coef   exp(coef)          p       abs_coef
    covariate
    PRODUCT_FREQUENCY  0.3679   1.4447     0.0000e+00     0.3679
    DAYS_SINCE_FIRST   0.0909   1.0951     1.8065e-20     0.0909
    RECENCY           -0.0594   0.9423     1.0159e-05     0.0594
    LOG_MONETARY       0.0325   1.0330     3.6136e-06     0.0325
    FREQUENCY          0.0197   1.0199     5.0609e-02     0.0197

    Interpretation:

  • PRODUCT_FREQUENCY: INCREASES repurchase risk by 44.5% per unit increase (Hazard Ratio: 1.445, p-value: 0.0000)
  • DAYS_SINCE_FIRST: INCREASES repurchase risk by 9.5% per unit increase (Hazard Ratio: 1.095, p-value: 0.0000)
  • RECENCY: DECREASES repurchase risk by 5.8% per unit increase (Hazard Ratio: 0.942, p-value: 0.0000)
  • 2. Product-Level Targeting Opportunities

    Products with highest average 30-day repurchase probability:

                PROB_30D  PROB_60D  PROB_90D  RISK_SCORE  CustomerID
    StockCode
    20725       0.9533    0.9989    0.9999    15.3214     10
    20727       0.8957    0.9932    0.9991    11.0845     10
    85099B      0.8094    0.9741    0.9937    11.4678     10
    22139       0.7032    0.8342    0.8956    10.3694     10
    47566       0.5935    0.8458    0.9056     4.2543     10

    3. Customer Segmentation for Targeting

  • HIGH INTENT (>60% prob in 30 days): 43 customers → Gentle reminder email or push notification. They're already primed to buy - don't over-incentivize.
  • MEDIUM INTENT (30-60% prob in 30 days): 7 customers → Offer modest discount (10-15%) to accelerate purchase. Personalized product recommendations.
  • LOW INTENT (<30% prob in 30 days): 0 customers → Skip for now or try re-engagement campaign. Avoid annoying them with irrelevant offers.
  • 4. Inventory & Demand Forecasting

  • Expected repurchases in next 30 days: 40 → Use for inventory planning and demand forecasting
  • Expected repurchases in next 60 days: 46 → Use for inventory planning and demand forecasting
  • Expected repurchases in next 90 days: 48 → Use for inventory planning and demand forecasting