NLP

Publish Date: 2021-03-02

Word Count: 3.5k

Read Times: 21 Min

Read Count:

When I was doing my dissertation project, I found out that the performance of model wasn’t quite well. I believe it’s because the domain of pre-trained GoogleNews-vectors-negative300 is different from the the dataset of mine. Hence, I decide to pre-train a word2vec model by myself. In this article, I’ll use a library called “Koan” released by Bloomberg LP. They build CBOW model using C++, which is more efficiently compared to word2vec and gensim libraries. If you are a Windows user, and you don’t have a Linux system in your computer, please read this [article] I wrote before to set up your WSL.

Introduction

The reason we care about language is that, because of language, we are able to turn invisible ideas into visible actions. However, language is ambiguous at all levels: lexical, phrasal, semantic. To address this, we need to build a language model, which can convert text into vectors. The most common techniques are Bag of Words (One-Hot Encoding, TF-IDF), Distributional Word Embedding (Word2Vec, GloVe, FastText), and Contextualised Word Embedding (ELMo, BERT). In this article, I’m gonna implement Word2Vec to generate pre-trained vectors.

Word2Vec

Word2Vec is a statistical-based method to obtain word vectors, and it is proposed by Tomas Mikolov et al. [4] of Google in 2013. Word2Vec is available in two flavors, the CBoW model and the Skip-Gram model, which is based on neural networks which can map words to low dimensional space. CBoW model predicts the current word by context, and Skip-Gram model predicts context by current word.

Text Pre-processing

First, you need to read in your csv file containing texts.

df = pd.read_csv(r"./20061020_20131126_bloomberg_news.csv")
df["title"] = df["title"].apply(str)
df["paragraph"] = df["paragraph"].apply(str)
df.sample(3)

	title	timestamp	paragraph
6493	Coronavirus: Malaysia’s Economy Shows Doing th…	2020/8/23	Strict lockdowns, accommodative central banks,…
1833	Lower Rates: Trump and the Markets Picked Thei…	2019/8/7	Collapsing bond yields aren’t exactly a sign …
4376	Crypto Brokerage Tagomi Gets $12 Million in Se…	2019/3/4	Tagomi Holdings Inc., a digital asset brokerag…

Second, put them into a list.

documents = []
documents.extend(df.loc[:, ["title", "paragraph"]].values.flatten().tolist())

Third, do some text cleaning work.

def regex(text):
    text = re.sub(r"([^a-zA-Z0-9\.\?\,\!\%\']+)", " ", text)
    text = re.sub(r"(?<=\d),(?=\d)+", "", text)
    text = re.sub(r"\,", " , ", text)
    text = re.sub(r"\?", " ? ", text)
    text = re.sub(r"\!", " ! ", text)
    text = re.sub(r"\.", " . ", text)
    text = re.sub(r"  ", " ", text)
    text = text.strip()
    return text

docs = [regex(doc) for doc in documents]
docs_cased = [regex(doc.lower()) for doc in documents]

Tokenisation

You’ll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs.

def progressbar(iter, prefix="", size=50, file=sys.stdout):
    count = len(iter)
    def show(t):
        x = int(size*t/count)
        # file.write("%s[%s%s] %i/%i\r" % (prefix, "#"*x, "."*(size-x), int(100*t/count), 100))
        file.write("{}[{}{}] {}%\r".format(prefix, "█"*x, "."*(size-x), int(100*t/count)))
        file.flush()
    show(0)
    for i, item in enumerate(iter):
        yield item
        show(i+1)
    file.write("\n")
    file.flush()

class Tokenizer(object):
    
    def __init__(self, 
                 char_level=False, 
                 num_tokens=None, 
                 pad_token='<PAD>', 
                 oov_token='<UNK>', 
                 token_to_index=None
                ):
        self.char_level = char_level
        self.separator = '' if self.char_level else ' '
        # <PAD> + <UNK> tokens
        if num_tokens: num_tokens -= 2
        self.num_tokens = num_tokens
        self.oov_token = oov_token
        if not token_to_index:
            token_to_index = {'<PAD>': 0, '<UNK>': 1}
        self.token_to_index = token_to_index
        self.index_to_token = {v: k for k, v in self.token_to_index.items()}

    def __len__(self):
        return len(self.token_to_index)

    def __str__(self):
        return f"<Tokenizer(num_tokens={len(self)})>"

    def fit_on_texts(self, texts):
        if self.char_level:
            all_tokens = [token for text in texts for token in text]
        if not self.char_level:
            all_tokens = [token for text in texts for token in text.split(' ')]
        counts = Counter(all_tokens).most_common(self.num_tokens)
        self.min_token_freq = counts[-1][1]
        for token, count in progressbar(counts, prefix="VOCAB"):
            index = len(self)
            self.token_to_index[token] = index
            self.index_to_token[index] = token
        return self

    def texts_to_sequences(self, texts):
        sequences = []
        for text in progressbar(texts, prefix="TEXT2SEQ"):
            if not self.char_level:
                text = text.split(' ')
            sequence = []
            for token in text:
                sequence.append(self.token_to_index.get(
                    token, self.token_to_index[self.oov_token]))
            sequences.append(sequence)
        return sequences

    def sequences_to_texts(self, sequences):
        texts = []
        for sequence in progressbar(sequences, prefix="SEQ2TEXT"):
            text = []
            for index in sequence:
                text.append(self.index_to_token.get(index, self.oov_token))
            texts.append(self.separator.join([token for token in text]))
        return texts

    def save(self, fp):
        with open(fp, 'w') as fp:
            contents = {
                'char_level': self.char_level,
                'oov_token': self.oov_token,
                'token_to_index': self.token_to_index
            }
            json.dump(contents, fp, indent=4, sort_keys=False)

    @classmethod
    def load(cls, fp):
        with open(fp, 'r') as fp:
            kwargs = json.load(fp=fp)
        return cls(**kwargs)

tokeniser = Tokenizer(char_level=False, num_tokens=1000000)
tokeniser.fit_on_texts(docs_cased[:])
sequences = tokeniser.texts_to_sequences(docs_cased[:])
texts = tokeniser.sequences_to_texts(sequences)

print(sequences[0:5])
print("="*50)
print(texts[0:5])

[[21789, 358380, 4, 37272, 4, 61540, 358381, 5009, 1964, 5, 2902, 37914], [21789, 4, 37272, 4, 61540, 9, 1133, 34, 1299, 3, 122, 577, 10, 123, 6313, 1253, 294, 8, 547, 11, 25, 304, 2], [7233, 80031, 1117, 546, 47, 9039, 6, 39, 2225, 7, 29623], [328, 19, 1338, 16712, 6, 126, 179, 2, 305, 241, 14, 11689, 606, 2848, 3368, 4, 3, 1166, 1794, 19, 552, 4, 32651, 34, 259, 4, 2902, 577, 10, 2514, 1352, 8, 252, 2, 9, 596, 13, 18410, 4, 850, 606, 3, 7233, 80031, 2], [304, 6076, 3389, 19, 6, 4488, 90, 1037, 488]]
==================================================
['ethereum xet , xrp , litecoin xlc cryptocurrency alternative to bitcoin btc', 'ethereum , xrp , litecoin and others are giving the world ? s most famous digital currency a run for its money .', 'crypto opportunists create 500 more coins in new phase of mania', 'risk is running rampant in financial markets . stocks trade at dot come era valuations , the ipo pipeline is full , spacs are back , bitcoin ? s headed toward a record . and right on cue , here come the crypto opportunists .', 'money stuff exxon is in trouble over climate change']

After tokenised our corpus, save it to a news.tokens file.

with open('./news.tokens', 'w') as f:
    for item in texts:
        f.write("%s\n" % item)

Training Process

Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors: feature vectors that represent words in that corpus.

CBOW

Move your news.tokens file to WSL folder. In my case, it is at C:\Users\yangwang\AppData\Local\Packages\CanonicalGroupLimited.Ubuntu18.04onWindows_79rhkp1fndgsc\LocalState\rootfs\home\yang\.

Next, open your mobaxterm and execute the following code.

sudo ./build/koan -V 1000000 \
             --epochs 10 \
             --dim 300 \
             --negatives 5 \
             --context-size 5 \
             -l 0.075 \
             --threads 16 \
             --cbow true \
             --min-count 2 \
             --file ./news.tokens

Learned embeddings will be saved to embeddings_${CURRENT_TIMESTAMP}.txt in the present working directory.

Skip-Gram

Similarly, you can get the pre-trained vectors by Skip-Gram, just set cbow to false.

sudo  ./build/koan -V 1000000 \
             --epochs 10 \
             --dim 300 \
             --negatives 5 \
             --context-size 5 \
             -l 0.075 \
             --threads 16 \
             --cbow false \
             --min-count 2 \
             --file ./news.tokens

Convert GloVe Format to Word2Vec Format

Move your pre-trained vectors back to your Windows folder, and change your file name to news-cbow-negative300.txt (or news-skipgram-negative300.txt, depend on how you trained it). We then convert GloVe vectors format into the word2vec format.

from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

_ = glove2word2vec("./news-cbow-negative300.txt", "./news-word2vec-cbow-negative300.txt")
wv_from_text = KeyedVectors.load_word2vec_format("./news-word2vec-cbow-negative300.txt", binary=False)

Notes

GloVe format (a real example can be found on the Stanford site)

word1 0.123 0.134 0.532 0.152
word2 0.934 0.412 0.532 0.159
word3 0.334 0.241 0.324 0.188
...
word9 0.334 0.241 0.324 0.188

Word2Vec format (a real example can be found in the old w2v repository).

9 4
word1 0.123 0.134 0.532 0.152
word2 0.934 0.412 0.532 0.159
word3 0.334 0.241 0.324 0.188
...
word9 0.334 0.241 0.324 0.188

Voilà! You have successfully got a pre-trained word embedding!

wv_from_text.similar_by_word("bitcoin")

[('cryptocurrency', 0.7397603392601013),
 ('cryptocurrencies', 0.7099655866622925),
 ('crypto', 0.6509920358657837),
 ('xrp', 0.5511361360549927),
 ('ethereum', 0.547865629196167),
 ('monero', 0.5345401167869568),
 ("bitcoin's", 0.5305401086807251),
 ('bitcoins', 0.5253546237945557),
 ('gold', 0.5229815244674683),
 ('blockchain', 0.508536159992218)]

Train GloVe on WSL

GloVe (Global Vectors for Word Representation) is an alternate method to create word embeddings. It is based on matrix factorization techniques on the word-context matrix.

Download GloVe

Download GloVe library from Standford’s GitHub

git clone https://github.com/stanfordnlp/glove
cd glove && make
./demo.sh

Training Process

This is how you run the model:

git clone http://github.com/stanfordnlp/glove
cd glove && make

To train it on your own corpus, you just have to make changes to one file, that is demo.sh.

Remove the script from if to fi after make. Replace the CORPUS name with your file name, in our case, news.tokens There is another if loop at the end of file demo.sh.

if [ "$CORPUS" = 'text8' ]; then

Replace text8 with news.tokens.

Run the demo.sh once the changes are made.

$ ./demo.sh

Don’t forget to keep your corpus file directly inside the Glove folder. Make sure your corpus file is in the correct format.You’ll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. If your corpus has multiple documents, the documents (only) should be separated by new line characters.

Sometimes, you have trouble with running ./demo.sh. When you use ./demo.sh, you’ll get sudo: demo.sh: command not found.

Here’s a summary of how to troubleshoot the Permission Denied error in our case.

$ ls -l demo.sh # Check file permissions of demo.sh
---------- 1 yang yang 0 2039-10-21 14:47 foo.sh 
    ^^^ 
 ^^^ | ^^^   ^^^^ ^^^^
  |  |  |      |    | 
Owner| World   |    |
     |         |  Name of
   Group       |   Group
            Name of 
             Owner

Owner has no read and write access rw and the - indicates that the executable permission is missing. The chmod command fixes that. (Group and other only have read permission set on the file, they cannot write to it or execute it).

chmod +x demo.sh
chmod +r demo.sh
chmod +w demo.sh
ls -l demo.sh

demo.sh is now executable as far as Linux is concerned.

Convert GloVe Format to Word2Vec Format

_ = glove2word2vec("./news-glove-vectors300.txt", "./news-glove-w2vformat-vectors300.txt")
wv_glove = KeyedVectors.load_word2vec_format("./news-glove-w2vformat-vectors300.txt", binary=False)

Test it on ‘bitcoin’ token.

wv_glove.similar_by_word("bitcoin")

[('cryptocurrency', 0.7422985434532166),
 ('cryptocurrencies', 0.6949392557144165),
 ('crypto', 0.6679537296295166),
 ('blockchain', 0.5640972852706909),
 ('bitcoins', 0.4695727825164795),
 ('ethereum', 0.4689256548881531),
 ('ether', 0.4526808261871338),
 ('virtual', 0.43389463424682617),
 ('tokens', 0.42009514570236206),
 ('coins', 0.418658971786499)]

Train FastText on WSL

FastText is a library for efficient learning of word representations and sentence classification. FastText builds on modern Mac OS and Linux distributions. Since it uses C++11 features, it requires a compiler with good C++11 support.

Download FastText

Install FastText.

wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
unzip v0.9.2.zip

And move to the FastText directory and build it.


cd fastText-0.9.2
make
pip install .

Training Process

Training word vectors using skipgram:

sudo ./fasttext skipgram -input news.tokens -output news-fasttext-skipgram-vectors300 -minn 3 -maxn 6 -dim 300

Training word vectors using cbow:

sdudo ./fasttext cbow -input news.tokens -output news-fasttext-cbow-vectors300 -minn 3 -maxn 6 -dim 300

where news.tokens is the training file containing UTF-8 encoded text we used before. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: skipgram-model.bin and cbow-model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

Instead of training through command line, you can also train it using Python.

import fasttext
model = fasttext.train_unsupervised('news.tokens', 
                                    "cbow", 
                                    minn=3, 
                                    maxn=6, 
                                    dim=300, 
                                    epoch=5, 
                                    lr=0.05, 
                                    thread=12)
model.save_model("news-fasttext-cbow-vectors300.bin")
wv_fasttext_cbow = fasttext.load_model("news-fasttext-cbow-vectors300.bin")

Depending on the quantity of data you have, you may want to change the parameters of the training. The epoch parameter controls how many times the model will loop over your data. By default, we loop over the dataset 5 times. If you dataset is extremely massive, you may want to loop over it less often. Another important parameter is the learning rate -lr. The higher the learning rate is, the faster the model converge to a solution but at the risk of overfitting to the dataset. The default value is 0.05 which is a good compromise. If you want to play with it we suggest to stay in the range of [0.01, 1]. Finally , fastText is multi-threaded and uses 12 threads by default. If you have less CPU cores (say 4), you can easily set the number of threads using the thread flag.

Printing Word Vectors

wv_fasttext_cbow.get_word_vector("bitcoin")

array([-4.72412445e-02,  2.85789132e-01,  3.42660360e-02,  2.09649026e-01,
       -4.54065323e-01, -1.91382036e-01, -5.00535131e-01,  1.86818153e-01,
        3.03504705e-01, -1.97448403e-01,  1.50050864e-01,  6.53051957e-02,
       -7.71196038e-02, -8.81627798e-02,  3.74232829e-02,  1.92417011e-01,
        3.55105817e-01,  3.28541487e-01, -3.44138265e-01, -4.90421832e-01,
       -2.13972241e-01,  1.74339145e-01, -3.67868505e-02,  1.09374836e-01,
        3.75284493e-01,  1.03113867e-01, -1.45857438e-01, -3.04340214e-01,
       -2.54121244e-01,  1.69611394e-01, -2.09063217e-01,  2.09711909e-01,
       -1.41518816e-01,  1.25664864e-02,  3.95129383e-01, -1.39495045e-01,
        8.94690026e-03,  4.83614445e-01,  7.68003613e-02, -1.72020838e-01,
        2.65787989e-01,  6.64022043e-02,  1.34228259e-01,  4.24850464e-01,
        5.29484272e-01,  7.14946613e-02, -1.55057460e-01,  6.64764345e-02,
       -1.79950804e-01,  2.07342580e-02, -5.48851252e-01,  2.00532869e-01,
        2.39266697e-02, -3.15076023e-01,  1.58537552e-01, -1.75947800e-01,
       -4.23456818e-01,  2.27220535e-01, -1.18757211e-01, -1.85626462e-01,
        2.09006771e-01, -1.08534403e-01,  2.79801786e-01, -1.84326231e-01,
        3.45385611e-01,  2.19469175e-01, -1.65827513e-01, -9.27144065e-02,
       -9.44910273e-02,  4.01960224e-01,  2.21235991e-01, -2.24734709e-01,
        5.92879727e-02,  3.68174642e-01, -1.62111774e-01, -3.60321164e-01,
       -3.73723418e-01, -2.35717162e-01, -4.61407304e-01, -1.32908091e-01,
        6.76851049e-02,  2.14217320e-01, -4.72074896e-01,  1.62981063e-01,
        3.71879905e-01,  1.01424217e-01, -2.97889352e-01, -3.91066521e-01,
       -2.46688813e-01,  5.42590201e-01, -1.35109276e-01,  3.26993912e-01,
        2.32391551e-01,  2.00287759e-01, -1.49581164e-01, -2.75721133e-01,
        4.79313314e-01,  2.26864532e-01, -1.83264613e-02,  1.18657842e-01,
        1.28447264e-01, -3.34220439e-01,  2.69317508e-01, -2.59843171e-01,
        3.10199022e-01,  2.16098920e-01, -1.86288506e-01,  5.94185330e-02,
       -4.23078507e-01,  5.34226038e-02,  2.08673358e-01, -1.05236337e-01,
        3.77959639e-01, -1.97113946e-01,  3.33479345e-01,  3.94979984e-01,
        1.35598034e-01,  7.51101971e-03,  2.95481265e-01, -2.15200692e-01,
        2.40353987e-01,  3.65436196e-01, -1.55092150e-01,  1.55085281e-01,
       -4.16599452e-01, -3.74957502e-01, -8.32035206e-03, -7.39385858e-02,
        2.17583347e-02, -3.48901063e-01, -9.27907787e-03,  1.24386065e-01,
        7.21558109e-02, -5.65859616e-01,  2.39448603e-02, -6.12365842e-01,
       -3.45480561e-01,  6.63597524e-01, -5.31071126e-01, -3.11197668e-01,
       -2.66234726e-01,  4.01567996e-01,  7.12649003e-02,  2.27668926e-01,
        3.60199302e-01,  1.40796080e-01, -1.30780600e-02, -4.35646117e-01,
       -3.15058351e-01,  1.79761440e-01, -7.38127008e-02, -1.57344565e-01,
       -1.30275175e-01, -2.29776427e-01, -3.11963826e-01,  2.51461089e-01,
       -7.77154416e-02, -1.93161428e-01, -1.22963764e-01,  1.19474560e-01,
       -1.70210376e-02, -6.77634845e-04,  7.12327287e-03, -2.26126343e-01,
        2.12814316e-01,  1.10432744e-01, -3.75197530e-01, -2.51778066e-01,
        2.61254579e-01, -1.91191047e-01,  1.73024654e-01, -1.69590712e-01,
        1.13725312e-01, -4.02675480e-01, -7.49008298e-01, -4.75077957e-01,
        4.30675596e-03, -5.70537090e-01, -3.68678004e-01, -1.18338585e-01,
        1.02712013e-01,  1.67967491e-02,  5.66727901e-03,  5.40452838e-01,
        4.11487877e-01,  6.39163136e-01,  4.11166042e-01, -2.50596225e-01,
       -1.04347736e-01, -2.55890310e-01,  1.25067562e-01,  3.32301527e-01,
        1.40600502e-01, -2.42391825e-01, -1.40091211e-01, -2.05069736e-01,
       -5.73189482e-02,  2.14646116e-01, -2.63260067e-01,  2.00784519e-01,
        2.35700160e-01,  3.53334904e-01,  5.38006604e-01,  1.59950554e-01,
        1.52627319e-01, -2.47434601e-01, -6.53754920e-02, -1.69809297e-01,
       -2.81990021e-01, -4.69022483e-01, -1.67136639e-01,  2.62764134e-02,
       -1.31334037e-01,  5.59901476e-01, -1.58817634e-01, -3.86552542e-01,
       -3.78590643e-01,  1.53091252e-01,  1.59801438e-01,  3.00560832e-01,
        9.51611772e-02, -1.25739768e-01, -2.82772869e-01, -2.11738721e-01,
       -1.44721761e-01,  3.01432371e-01, -2.95276958e-02, -4.21232760e-01,
        1.95821151e-01, -1.03478849e-01,  3.75818871e-02,  7.30549470e-02,
       -1.24263890e-01,  4.21253517e-02,  5.34670353e-02, -6.04710579e-02,
        4.18751776e-01, -1.89714432e-01,  7.75871202e-02,  2.64797509e-01,
        6.84403598e-01, -2.88427889e-01,  2.65219778e-01, -9.75028351e-02,
       -2.16612965e-01, -1.84845805e-01,  3.57705653e-01,  1.84521660e-01,
       -2.25650191e-01, -2.41775334e-01,  6.35201484e-02,  1.05721205e-01,
       -2.76269794e-01,  7.44905397e-02, -4.05652225e-01, -3.25192034e-01,
        1.33607000e-01, -2.70021617e-01, -5.09377658e-01,  8.15921091e-03,
        1.39862090e-01,  2.68142492e-01,  3.83002162e-01,  1.91613629e-01,
        2.66971558e-01, -2.08550826e-01, -1.84474185e-01,  2.28107542e-01,
       -1.41805783e-01, -3.34146500e-01,  5.33484481e-02,  1.27584279e-01,
        8.07003453e-02,  1.00570947e-01, -4.74314131e-02,  2.64507622e-01,
        5.04497468e-01,  8.56446847e-02,  4.17862684e-01,  1.42475590e-01,
       -1.79341078e-01, -2.17798918e-01,  8.03667828e-02, -1.44884512e-01,
       -2.44018864e-02, -7.17387274e-02,  8.83749798e-02,  1.36670202e-01,
       -1.49312671e-02, -4.16279852e-01,  1.23666152e-01,  4.03715611e-01,
        3.15533012e-01,  2.58996665e-01, -2.77972668e-01,  1.68511316e-01,
        1.92251951e-01,  1.12253219e-01, -4.47139591e-01,  2.39150673e-01],
      dtype=float32)

Nearest neighbor queries

A simple way to check the quality of a word vector is to look at its nearest neighbors. This give an intuition of the type of semantic information the vectors are able to capture.

wv_fasttext_cbow.get_nearest_neighbors('bitcoin')

[(0.8654916286468506, 'cryptocurrency'),
 (0.8515545725822449, 'bitcoins'),
 (0.8421329855918884, 'bitcointalk'),
 (0.8405554890632629, 'cryptocurrencies'),
 (0.8251032829284668, 'tcoin'),
 (0.8214054703712463, 'bitcoiners'),
 (0.8096168637275696, "cryptocurrency's"),
 (0.8051686882972717, 'crypto'),
 (0.8023344278335571, "bitcoin's"),
 (0.7836618423461914, 'altcoin')]

Even if the word is misspell, the fasttext model can also get the correct embedding.

wv_fasttext_cbow.get_nearest_neighbors('bittcoin')

[(0.8647432923316956, 'tcoin'),
 (0.8488795161247253, 'bitcoin'),
 (0.8280304074287415, 'altcoin'),
 (0.8253008127212524, 'virtcoin'),
 (0.7866906523704529, 'basecoin'),
 (0.7821307182312012, 'gatecoin'),
 (0.7780086994171143, 'litecoin'),
 (0.7758980989456177, 'estcoin'),
 (0.7743834853172302, 'cryptocurrency'),
 (0.7679258584976196, 'filecoin')]

To sum up, FastText utilises subword information, while Word2Vec and GloVe don’t. The result does not make much sense when we take uncommon word like ‘weltschmerz’, most of these words are unrelated or not in the vocabulary. On the other hand, using subword information captures different variation around the word.

Conclusion

You can now perform various syntactic/semantic NLP word tasks with the trained vectors! Cheers!

Yang Wang

https://penguinwang96825.github.io/Yang-Tech-Blog/Yang-Tech-Blog/2021/03/02/2021-03-02-train-word-embedding-vectos-on-custom-corpus/

All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Yang Wang !

NLP WSL Ubuntu

P-Value Easy Explanation

In Data Science interviews, one of the frequently asked questions is 'What is P-Value?'. It's hard to grasp the concept behind p-value. To understand p-value, you need to understand some background and context behind it.

2021-03-12 Statistics

Python Statistics

Twitter Hate Speech Detection

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, let's say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

2021-02-07 NLP

Python PyTorch

Train Word Embedding Vectors on Custom Corpus

Introduction

Word2Vec

Text Pre-processing

Tokenisation

Training Process

CBOW

Skip-Gram

Convert GloVe Format to Word2Vec Format

Train GloVe on WSL

Download GloVe

Training Process

Convert GloVe Format to Word2Vec Format

Train FastText on WSL

Download FastText

Training Process

Printing Word Vectors

Nearest neighbor queries

Conclusion