Introduction

A supervised machine learning model has two phases: training and testing. When these models are learned, validated, and tested, the test and train data points are normally presumed to have the same distribution. In the real world, however, the training and test datasets rarely follow the same distribution.

Dataset Shift

Typically, there are three types of shifts:

Covariate Shift: Change in the independent variables.
Prior Probability Shift: Change in the target variable.
Concept Shift: Change in the relationship between the independent and the target variable.

The change in the distribution of the input variables present in the training and test data is referred to as covariate shift. It is the most common form of shift, and it is now gaining more prominence because it affects virtually every real-world dataset.

Let me illustrate this with the aid of a hypothetical situation. If you want to know if anyone has the Corona Virus based on their age. The training dataset, however, only contains information about people who are relatively younger in age, while the evaluation dataset contains information about people who are older in age. As a result, the data distribution in the train and test datasets is significantly different. The change in the distribution of data in the train and test datasets is referred to as Covariate Shift.

Prior Probability Shift focuses on changes in the distribution of the class variable y, whereas covariate shift focuses on changes in the feature x distribution. Consider an unbalanced dataset as an intuitive way to think about it. I will only discuss Covariate Shift in this article because the other two subjects are still under investigation and no significant work has been done to address these issues.

Covariate Shift

I’ll use an example from my current dissertation project as an example. When I tried to forecast the cryptocurrency trend, the model still failed to perform well in the test dataset. So I’m wondering if it’s because the training and evaluation data sets have different distributions. This is what the dataset looks like.

date	text	label
2014-09-18	the dumb money is getting smarter every day	0
2014-09-19	paypal expands acceptance of bitcoin to mercha…	0
2014-09-20	what’s a bitcoin look like? popular photograp…	0
2014-09-21	the kids aren’t into paypal as apple rules mo…	0
2014-09-22	‘not a buyer of rocket internet shares,’ says …	1

The basic steps that I will follow are:

Text pre-processing such as remove stopwords, fix contraction, and text lowercase etc.
Weighted word embedding to create features for text (article).
Split the dataset into two parts, and note that the shape of both the training and test dataset samples should be approximately equal; otherwise, the dataset would be unbalanced.
In both train and test data, a label called “is_train” must be added. This label will have a value of 0 for test and 1 for train.
Combine training and test dataset and build a classifier to predict the labels for each row in the combined dataset.
As an approximation of how much covariate shift this data has, I’ll produce the ROC-AUC metric for our classifier.

If the classifier is able to accurately identify the rows into train and test, our AUC score should be higher, in general, greater than 0.8 (Dharani et al., 2019). This indicates a significant covariate change between the train and the test.

embedding_model = {
    "word2vec_google_model": word2vec_google_model, 
    "glove_25_model": glove_25_model, 
    "glove_50_model": glove_50_model, 
    "glove_100_model": glove_100_model, 
    "glove_200_model": glove_200_model, 
    "glove_300_model": glove_300_model, 
    "fasttext_model": fasttext_model
}

features_set_train, features_set_test = dict(), dict()
results = []
for name, model in embedding_model.items():
    for weight in ["mean", "tfidf", "sif"]:
        if weight == "mean":
            if name == "fasttext_model":
                vectorisor = embedding.MeanEmbeddingVectorizer(model, fasttext=True)
            else:
                vectorisor = embedding.MeanEmbeddingVectorizer(model, fasttext=False)
        elif weight == "tfidf":
            if name == "fasttext_model":
                vectorisor = embedding.TfidfEmbeddingVectorizer(model, fasttext=True)
            else:
                vectorisor = embedding.TfidfEmbeddingVectorizer(model, fasttext=False)
        elif weight == "sif":
            if name == "fasttext_model":
                vectorisor = embedding.SifEmbeddingVectorizer(model, fasttext=True)
            else:
                vectorisor = embedding.SifEmbeddingVectorizer(model, fasttext=False)

        train_feature = vectorisor.fit_transform(d_train["text"])
        test_feature = vectorisor.transform(d_test["text"])
        d_train_1 = np.concatenate((train_feature, 
        							d_train.is_train.values.reshape(-1, 1)), 
        							axis=1)
        d_test_1 = np.concatenate((test_feature, 
        						   d_test.is_train.values.reshape(-1, 1)), 
        						   axis=1)
        dataset = np.concatenate((d_train_1, d_test_1), axis=0)
        x, y = dataset[:, :-1], dataset[:, -1]

        clf = RandomForestClassifier(n_jobs=-1, max_depth=5, min_samples_leaf=5)
        predictions = np.zeros(y.shape)
        skf = StratifiedKFold(n_splits=20, shuffle=True, random_state=100)
        for fold, (train_idx, test_idx) in enumerate(skf.split(x, y)):
            X_train, X_test = x[train_idx], x[test_idx]
            y_train, y_test = y[train_idx], y[test_idx]
            clf.fit(X_train, y_train)
            probs = clf.predict_proba(X_test)[:, 1]
            predictions[test_idx] = probs
        results.append([name, weight, metrics.roc_auc_score(y, predictions)])

        pca = PCA(n_components=2)
        train_feature_pca = pca.fit_transform(train_feature)
        test_feature_pca = pca.transform(test_feature)
        features_set_train[f"{name}-{weight}"] = train_feature_pca
        features_set_test[f"{name}-{weight}"] = test_feature_pca

Let’s take a look at the result.

index	embedding	weighted	rocauc
0	word2vec_google_model	mean	0.707722
1	word2vec_google_model	tfidf	0.734097
2	word2vec_google_model	sif	0.634016
3	glove_25_model	mean	0.659759
4	glove_25_model	tfidf	0.703027
5	glove_25_model	sif	0.648547
6	glove_50_model	mean	0.712201
7	glove_50_model	tfidf	0.727277
8	glove_50_model	sif	0.687929
9	glove_100_model	mean	0.724838
10	glove_100_model	tfidf	0.731298
11	glove_100_model	sif	0.685146
12	glove_200_model	mean	0.758758
13	glove_200_model	tfidf	0.732395
14	glove_200_model	sif	0.663374
15	glove_300_model	mean	0.777109
16	glove_300_model	tfidf	0.751629
17	glove_300_model	sif	0.673617
18	fasttext_model	mean	0.736190
19	fasttext_model	tfidf	0.731878
20	fasttext_model	sif	0.667155

Understand which parameters influence the metric we want to optimise using HiPlot.

import hiplot as hip

df = pd.DataFrame(results, columns=["word embedding", "weighting method", "rocauc"])
data = df.to_dict(orient='records')
hip.Experiment.from_iterable(data).display()

HiPlot

Loading HiPlot...

Python ML Classification

Busuanzi Unable to Display the Number of Visitors

When adding statistics to the number of people and visits to the Hexo blog, according to the standard process of writing, pushing the number of people and visits after running found that the number of people simply do not load. Looking through the information found that the reason, busuanzi because in 2018/10/12 its domain name expired, so the number of people can not be displayed.

2021-05-23 Script

JavaScript Busuanzi

Predicting Stock Price using LSTM

This article tends to build a model that predicts stock price in the best way possible. This is an example of how you can use Long Short-Term Memory (LSTM) Neural Network on some real-world time series data with PyTorch. Hopefully, there are much better models that forecast the price of the stock.

2021-05-01 Data Science

Python PyTorch LSTM NN

Detect Covariate Shift

Introduction

Dataset Shift

Covariate Shift