Retention by monthly cohort
Each customer is tied to their cohort (first month of purchase), age in months is pivoted, and the M1/M2/M3 columns express retention as a percentage of the original cohort.
Prerequisites
PySpark 3.x
Python
from pyspark.sql import functions as F
base = commandes.withColumn("mois", F.trunc("order_date", "month"))
cohorte = base.groupBy("customer_id").agg(F.min("mois").alias("cohorte"))
retention = (
base.join(cohorte, "customer_id")
.withColumn("age_mois", F.months_between("mois", "cohorte").cast("int"))
.groupBy("cohorte").pivot("age_mois", [0, 1, 2, 3])
.agg(F.countDistinct("customer_id"))
)
for m in [1, 2, 3]:
retention = retention.withColumn(
f"M{m}", F.round(F.col(str(m)) / F.col("0") * 100, 1))
retention.select("cohorte", F.col("0").alias("clients"), "M1", "M2", "M3") \
.orderBy("cohorte").show()Result
+----------+-------+----+----+----+ | cohorte|clients| M1| M2| M3| +----------+-------+----+----+----+ |2026-01-01| 8,204|34.1|22.8|18.5| |2026-02-01| 6,917|31.7|21.2|17.9| |2026-03-01| 12,480|19.4|11.0| 8.2| |2026-04-01| 7,308|33.2|22.5|null| |2026-05-01| 7,841|32.9|null|null| +----------+-------+----+----+----+ La cohorte de mars (soldes, 12 480 clients) retient 2x moins : acquis par le prix, partis avec lui — recalibrer le coût d'acquisition de l'opération
PySparkCohorteRétentionpivot