Machine Learning

Adversarial validation: are train and test comparable?

Train a classifier to tell train from production: an AUC near 0.5 means similar distributions; above 0.7, the most important features point to the source of the drift.

Prerequisites

scikit-learn, numpy, pandas

Python
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

X_adv = pd.concat([X_train, X_prod], ignore_index=True)
y_adv = np.r_[np.zeros(len(X_train)), np.ones(len(X_prod))]

clf = RandomForestClassifier(n_estimators=200, max_depth=6,
                             n_jobs=-1, random_state=42)
auc = cross_val_score(clf, X_adv, y_adv, cv=5, scoring="roc_auc").mean()
print(f"AUC adversariale : {auc:.3f}")
print("~0.50 = distributions similaires | >0.70 = drift sérieux")

if auc > 0.6:
    clf.fit(X_adv, y_adv)
    imp = pd.Series(clf.feature_importances_, index=X_adv.columns)
    print("Features qui trahissent l'époque/la source :")
    print(imp.sort_values(ascending=False).head(5).round(3))

Result

AUC adversariale : 0.731
~0.50 = distributions similaires | >0.70 = drift sérieux
Features qui trahissent l'époque/la source :
montant        0.412
delai_jours    0.218
solde_moyen    0.097
age            0.054
nb_produits    0.041
dtype: float64
Adversarial validationDriftDiagnosticDistribution

Related snippets

Back to the Data Lab