Machine Learning

Purged cross-validation with embargo (finance)

A temporal fold generator that removes observations adjacent to the test fold: essential when labels span multiple periods (H-horizon returns) and overlap.

Prerequisites

numpy, scikit-learn

Python
import numpy as np

def purged_splits(n_samples, n_splits=5, embargo_pct=0.02):
    """K-fold temporel : purge les points voisins du test (embargo)."""
    embargo = int(n_samples * embargo_pct)
    bounds = np.linspace(0, n_samples, n_splits + 1, dtype=int)
    for i in range(n_splits):
        t0, t1 = bounds[i], bounds[i + 1]
        test_idx = np.arange(t0, t1)
        train_mask = np.ones(n_samples, dtype=bool)
        lo = max(0, t0 - embargo)
        hi = min(n_samples, t1 + embargo)
        train_mask[lo:hi] = False        # test + zones d'embargo exclus
        yield np.where(train_mask)[0], test_idx

aucs = []
for tr, te in purged_splits(len(X), n_splits=5):
    model.fit(X[tr], y[tr])
    aucs.append(roc_auc_score(y[te], model.predict_proba(X[te])[:, 1]))
print("AUC purgée :", np.mean(aucs).round(3))

Result

AUC purgée : 0.563
>>> np.round(aucs, 3)
array([0.581, 0.554, 0.567, 0.549, 0.562])
>>> # même protocole SANS purge ni embargo : 0.642 (labels chevauchants)
Purged CVEmbargoFinanceBacktest

Related snippets

Back to the Data Lab