Adversarial Validation: Battling Overfitting

6 min readJun 23, 2020

Introduction

In the process of a model building, overfitting drains out a lot of effort. Diagnosing overfitting in data is sometimes obvious where your training data performance is far better than validation but not always. Especially in a production environment, understanding the root cause of failure becomes hectic and time-consuming. We try a lot of techniques to make sure we do not overfit the training data and generalize well by checking model performance on cross-validation data.
Overfitting on a high level means our model has leaned features from train data which doesn’t generalize well on test data. There are a lot of reasons for the model overfitting the data. In supervised learning, overfitting happens when our model captures the noise along with the underlying pattern in data or maybe it has not been provided with all-sufficient variations from both train and test scenarios.
But consider the scenario where you have test data from different time periods where a lot of patterns have changed. In the case of credit card fraud detection, the fraudsters might have changed their fraud conducting patterns altogether. In price prediction use case the market could have changed. These are very valid factors when you are building models in the long run.

Consider the situation: training and testing examples coming from the same distribution, so that the validation error should give a good estimation of the test error and classifier should generalize well to unseen test examples.

The problem with training examples being different from test examples is that validation won’t be any good for comparing models. That’s because validation examples originate in the training set. So our cross-validation score which might show the robustness of the model will be an illusion.

The solution

The solution is adversarial validation. It is a fancy term and quiet popular in Kaggle competitions but is a fairly simple technique to avoid overfitting and more important understanding which variables in your dataset is causing the issue.

What is adversarial validation?

Adversarial validation, inspired by FastML: Adversarial validation. The general idea is to check the degree of similarity between training and tests in terms of feature distribution: if they are difficult to distinguish, the distribution is probably similar and the usual validation techniques should work.
It does not seem to be the case, so we can suspect they are quite different. This intuition can be quantified by combining train and test sets, assigning 0/1 labels (0 — train, 1-test), and evaluating a binary classification task. For adversarial validation, we want to learn a model that predicts which rows are in the training dataset, and which are in the test set. We, therefore, create a new target column in which the test samples are labeled 1 and the train samples with 0. Note: The Performance of this model will be indicator of how big the problem is.

CASE 1: Loan Prediction Data Set

In this case, we take an example dataset of Loan Prediction from Analytics Vidhya. The data set is about Predict Loan Eligibility for Dream Housing Finance company.

We will drop the ID column and Loan Status columns from Train and Test DataFrame and join them creating a master data frame and add one more column ‘dataset_label’.

I tried to see the data distribution of train and test set via a UMAP plot. It is clear from the plot that the train and test set come from similar distribution. So Ideally it will be very difficult for a model to distinguish between train and test set from Master Data.

Create adversarial validation data from Master data which we created.

Create Adversarial Data Set for Classification

Create Pool Data For CatBoost Classifier

Train the Catboost Classifier for Master Data

AUC curve for CatBoost Classifier (0.49)

PLot function

The ROC curve above is for the validation dataset and the values come close to 50% which indicates that it is difficult to distinguish between train and test set and they come from the same distribution.

So, in this case, our usual method for stratified train validation split should work fine and will perform equally well on the Unseen test set.

Case 2: Russian-housing-market price prediction

This is a Kaggle competition Data Set for Price Prediction. In this competition, Sberbank is challenging Kagglers to develop algorithms that use a broad spectrum of features to predict realty prices.

Data Set: https://www.kaggle.com/c/sberbank-russian-housing-market

I tried to plot the Adversarial data prepared against label 1/0 for train/test and we can clearly see both of them belong to different feature spaces as they have significant differences. So it will be fair to assume the model will easily distinguish between them and will have a good AUC compared to Case-1.

We train similar adversarial datasets by combining train and test datasets. Below is the training ROC curve for the model.

AUC curve for Adversarial Validation Dataset

From the above graph with AUC value as 0.91, it means it was pretty easy for the classifier to differentiate between the Train and Test Sample which simple they are very different from each other and it leaves us with a high chance of poor performance on Test Data. As pointed out at the beginning of the blog, there might be a time or market factor in our dataset which is making the feature distribution from train and test set very different from each other.

Next Step

But what the process is not able to do is tell us how to fix it. We still need to apply our creativity here. As the next step what we can do is analyze the adversarial model by looking at the feature importance. The most important feature helps the model to differentiate between the labels, so we can drop those features ( can create trade-off with training performance) and see the AUC value dropping. The idea is that you want to remove information that is not important for predicting fraud/price/loan defaulter but is important for separating your training and test sets. As I said, in the beginning, better the adversarial model bigger is the problem.
Observed in some cases where we have a time-dependent feature like the release of a software version date can be replaced with “days since release”.

I hope the idea is clear.

In the next post, I will show hands-on experience with Kaggle Data Set till then keep reading, keep practicing.