Stealing pages from the server...

Detect Covariate Shift


Introduction

A supervised machine learning model has two phases: training and testing. When these models are learned, validated, and tested, the test and train data points are normally presumed to have the same distribution. In the real world, however, the training and test datasets rarely follow the same distribution.

Dataset Shift

Typically, there are three types of shifts:

  1. Covariate Shift: Change in the independent variables.
  2. Prior Probability Shift: Change in the target variable.
  3. Concept Shift: Change in the relationship between the independent and the target variable.

The change in the distribution of the input variables present in the training and test data is referred to as covariate shift. It is the most common form of shift, and it is now gaining more prominence because it affects virtually every real-world dataset.

Let me illustrate this with the aid of a hypothetical situation. If you want to know if anyone has the Corona Virus based on their age. The training dataset, however, only contains information about people who are relatively younger in age, while the evaluation dataset contains information about people who are older in age. As a result, the data distribution in the train and test datasets is significantly different. The change in the distribution of data in the train and test datasets is referred to as Covariate Shift.

Prior Probability Shift focuses on changes in the distribution of the class variable y, whereas covariate shift focuses on changes in the feature x distribution. Consider an unbalanced dataset as an intuitive way to think about it. I will only discuss Covariate Shift in this article because the other two subjects are still under investigation and no significant work has been done to address these issues.

Covariate Shift

I’ll use an example from my current dissertation project as an example. When I tried to forecast the cryptocurrency trend, the model still failed to perform well in the test dataset. So I’m wondering if it’s because the training and evaluation data sets have different distributions. This is what the dataset looks like.

date text label
2014-09-18 the dumb money is getting smarter every day 0
2014-09-19 paypal expands acceptance of bitcoin to mercha… 0
2014-09-20 what’s a bitcoin look like? popular photograp… 0
2014-09-21 the kids aren’t into paypal as apple rules mo… 0
2014-09-22 ‘not a buyer of rocket internet shares,’ says … 1

The basic steps that I will follow are:

  1. Text pre-processing such as remove stopwords, fix contraction, and text lowercase etc.
  2. Weighted word embedding to create features for text (article).
  3. Split the dataset into two parts, and note that the shape of both the training and test dataset samples should be approximately equal; otherwise, the dataset would be unbalanced.
  4. In both train and test data, a label called “is_train” must be added. This label will have a value of 0 for test and 1 for train.
  5. Combine training and test dataset and build a classifier to predict the labels for each row in the combined dataset.
  6. As an approximation of how much covariate shift this data has, I’ll produce the ROC-AUC metric for our classifier.

If the classifier is able to accurately identify the rows into train and test, our AUC score should be higher, in general, greater than 0.8 (Dharani et al., 2019). This indicates a significant covariate change between the train and the test.

embedding_model = {
    "word2vec_google_model": word2vec_google_model, 
    "glove_25_model": glove_25_model, 
    "glove_50_model": glove_50_model, 
    "glove_100_model": glove_100_model, 
    "glove_200_model": glove_200_model, 
    "glove_300_model": glove_300_model, 
    "fasttext_model": fasttext_model
}

features_set_train, features_set_test = dict(), dict()
results = []
for name, model in embedding_model.items():
    for weight in ["mean", "tfidf", "sif"]:
        if weight == "mean":
            if name == "fasttext_model":
                vectorisor = embedding.MeanEmbeddingVectorizer(model, fasttext=True)
            else:
                vectorisor = embedding.MeanEmbeddingVectorizer(model, fasttext=False)
        elif weight == "tfidf":
            if name == "fasttext_model":
                vectorisor = embedding.TfidfEmbeddingVectorizer(model, fasttext=True)
            else:
                vectorisor = embedding.TfidfEmbeddingVectorizer(model, fasttext=False)
        elif weight == "sif":
            if name == "fasttext_model":
                vectorisor = embedding.SifEmbeddingVectorizer(model, fasttext=True)
            else:
                vectorisor = embedding.SifEmbeddingVectorizer(model, fasttext=False)

        train_feature = vectorisor.fit_transform(d_train["text"])
        test_feature = vectorisor.transform(d_test["text"])
        d_train_1 = np.concatenate((train_feature, 
        							d_train.is_train.values.reshape(-1, 1)), 
        							axis=1)
        d_test_1 = np.concatenate((test_feature, 
        						   d_test.is_train.values.reshape(-1, 1)), 
        						   axis=1)
        dataset = np.concatenate((d_train_1, d_test_1), axis=0)
        x, y = dataset[:, :-1], dataset[:, -1]

        clf = RandomForestClassifier(n_jobs=-1, max_depth=5, min_samples_leaf=5)
        predictions = np.zeros(y.shape)
        skf = StratifiedKFold(n_splits=20, shuffle=True, random_state=100)
        for fold, (train_idx, test_idx) in enumerate(skf.split(x, y)):
            X_train, X_test = x[train_idx], x[test_idx]
            y_train, y_test = y[train_idx], y[test_idx]
            clf.fit(X_train, y_train)
            probs = clf.predict_proba(X_test)[:, 1]
            predictions[test_idx] = probs
        results.append([name, weight, metrics.roc_auc_score(y, predictions)])

        pca = PCA(n_components=2)
        train_feature_pca = pca.fit_transform(train_feature)
        test_feature_pca = pca.transform(test_feature)
        features_set_train[f"{name}-{weight}"] = train_feature_pca
        features_set_test[f"{name}-{weight}"] = test_feature_pca

Let’s take a look at the result.

index embedding weighted rocauc
0 word2vec_google_model mean 0.707722
1 word2vec_google_model tfidf 0.734097
2 word2vec_google_model sif 0.634016
3 glove_25_model mean 0.659759
4 glove_25_model tfidf 0.703027
5 glove_25_model sif 0.648547
6 glove_50_model mean 0.712201
7 glove_50_model tfidf 0.727277
8 glove_50_model sif 0.687929
9 glove_100_model mean 0.724838
10 glove_100_model tfidf 0.731298
11 glove_100_model sif 0.685146
12 glove_200_model mean 0.758758
13 glove_200_model tfidf 0.732395
14 glove_200_model sif 0.663374
15 glove_300_model mean 0.777109
16 glove_300_model tfidf 0.751629
17 glove_300_model sif 0.673617
18 fasttext_model mean 0.736190
19 fasttext_model tfidf 0.731878
20 fasttext_model sif 0.667155

Understand which parameters influence the metric we want to optimise using HiPlot.

import hiplot as hip

df = pd.DataFrame(results, columns=["word embedding", "weighting method", "rocauc"])
data = df.to_dict(orient='records')
hip.Experiment.from_iterable(data).display()
HiPlot
Loading HiPlot...
 Previous
Busuanzi Unable to Display the Number of Visitors Busuanzi Unable to Display the Number of Visitors
When adding statistics to the number of people and visits to the Hexo blog, according to the standard process of writing, pushing the number of people and visits after running found that the number of people simply do not load. Looking through the information found that the reason, busuanzi because in 2018/10/12 its domain name expired, so the number of people can not be displayed.
2021-05-23
Next 
Predicting Stock Price using LSTM Predicting Stock Price using LSTM
This article tends to build a model that predicts stock price in the best way possible. This is an example of how you can use Long Short-Term Memory (LSTM) Neural Network on some real-world time series data with PyTorch. Hopefully, there are much better models that forecast the price of the stock.
2021-05-01
  TOC