Spark

Retention by monthly cohort

Each customer is tied to their cohort (first month of purchase), age in months is pivoted, and the M1/M2/M3 columns express retention as a percentage of the original cohort.

Prerequisites

PySpark 3.x

Python
from pyspark.sql import functions as F

base = commandes.withColumn("mois", F.trunc("order_date", "month"))
cohorte = base.groupBy("customer_id").agg(F.min("mois").alias("cohorte"))

retention = (
    base.join(cohorte, "customer_id")
        .withColumn("age_mois", F.months_between("mois", "cohorte").cast("int"))
        .groupBy("cohorte").pivot("age_mois", [0, 1, 2, 3])
        .agg(F.countDistinct("customer_id"))
)

for m in [1, 2, 3]:
    retention = retention.withColumn(
        f"M{m}", F.round(F.col(str(m)) / F.col("0") * 100, 1))

retention.select("cohorte", F.col("0").alias("clients"), "M1", "M2", "M3") \
         .orderBy("cohorte").show()

Result

+----------+-------+----+----+----+
|   cohorte|clients|  M1|  M2|  M3|
+----------+-------+----+----+----+
|2026-01-01|  8,204|34.1|22.8|18.5|
|2026-02-01|  6,917|31.7|21.2|17.9|
|2026-03-01| 12,480|19.4|11.0| 8.2|
|2026-04-01|  7,308|33.2|22.5|null|
|2026-05-01|  7,841|32.9|null|null|
+----------+-------+----+----+----+

La cohorte de mars (soldes, 12 480 clients) retient 2x moins : acquis par
le prix, partis avec lui — recalibrer le coût d'acquisition de l'opération
PySparkCohorteRétentionpivot

Related snippets

Back to the Data Lab