Stealing pages from the server...

End-to-End Word2Vec Training


Introduction

Word2Vec operates on a rather straightforward concept. We’re presuming that a word’s meaning may be derived from the company it keeps. A word’s meaning is likely to be relatively similar to another word’s if its neighbours are quite similar. Using this underlying assumption, you can use Word2Vec to compute similarity between two words and more.

Get Started

You will learn how to use the Gensim implementation of Word2Vec (written in Python) and make it work efficiently in this article. We need quite a few things to let it work:

  1. Input text data.
  2. Word2Vec parameter settings.

Imports

The followings are the libraries we will be using.

import os
import json
import pysbd
import pathlib
import itertools
import unicodedata
import pandas as pd
from glob import glob
from tqdm.auto import tqdm
from collections import Counter
from gensim.models import Word2Vec

Data

In this article, I want to train a Word2Vec model that can learn the financial news semantic. So for this article, I use data from Bloomberg dataset. This dataset has full article of financial news. Each file in D:\nlp-datasets\bloomberg-news folder contains only text of string format.

class Corpus:

    def __init__(self, file_root=r'D:\nlp-datasets\bloomberg-news'):
        root = file_root
        filepaths = []
        for path, subdirs, files in os.walk(root):
            for name in tqdm(files, total=len(files)):
                filepaths.append(os.path.join(path, name))
        print('Start segmenting all the sentences...')
        seg = pysbd.Segmenter(language="en", clean=False)
        self.tokenizer = BasicTokenizer(
            do_lower_case=True, 
            never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")
        )
        self.all_sentences = []
        for filepath in tqdm(filepaths):
            with open(filepath, 'r', encoding='utf-8') as f:
                lines = filter(None, (line.rstrip() for line in f))
                lines = list(lines)
            sentences = list(itertools.chain(*[seg.segment(line) for line in lines]))
            self.all_sentences.extend(sentences)

    def __iter__(self):
        for sentence in self.all_sentences:
            try:
                yield self.tokenizer(sentence)
            except:
                continue

A list of tokenised sentences should be provided as the input to Word2Vec, according to the Gensim Word2Vec tutorial. As a result, a tokeniser is necessary.

class BasicTokenizer(object):
    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
    def __init__(
        self,
        do_lower_case=True,
        never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")
    ):
        """Constructs a BasicTokenizer.
        Args:
          do_lower_case: Whether to lower case the input.
        """
        self.do_lower_case = do_lower_case
        self.never_split = never_split

    def tokenize(self, text):
        """Tokenizes a piece of text."""
        text = self._clean_text(text)
        # This was added on November 1st, 2018 for the multilingual and Chinese
        # models. This is also applied to the English models now, but it doesn't
        # matter since the English models were not trained on any Chinese data
        # and generally don't have any Chinese data in them (there are Chinese
        # characters in the vocabulary because Wikipedia does have some Chinese
        # words in the English Wikipedia.).
        text = self._tokenize_chinese_chars(text)
        orig_tokens = whitespace_tokenize(text)
        split_tokens = []
        for token in orig_tokens:
            if self.do_lower_case and token not in self.never_split:
                token = token.lower()
                token = self._run_strip_accents(token)
            split_tokens.extend(self._run_split_on_punc(token))

        output_tokens = whitespace_tokenize(" ".join(split_tokens))
        return output_tokens

    def __call__(self, text):
        return self.tokenize(text)

    def _run_strip_accents(self, text):
        """Strips accents from a piece of text."""
        text = unicodedata.normalize("NFD", text)
        output = []
        for char in text:
            cat = unicodedata.category(char)
            if cat == "Mn":
                continue
            output.append(char)
        return "".join(output)

    def _run_split_on_punc(self, text):
        """Splits punctuation on a piece of text."""
        if text in self.never_split:
            return [text]
        chars = list(text)
        i = 0
        start_new_word = True
        output = []
        while i < len(chars):
            char = chars[i]
            if _is_punctuation(char):
                output.append([char])
                start_new_word = True
            else:
                if start_new_word:
                    output.append([])
                start_new_word = False
                output[-1].append(char)
            i += 1

        return ["".join(x) for x in output]

    def _tokenize_chinese_chars(self, text):
        """Adds whitespace around any CJK character."""
        output = []
        for char in text:
            cp = ord(char)
            if self._is_chinese_char(cp):
                output.append(" ")
                output.append(char)
                output.append(" ")
            else:
                output.append(char)
        return "".join(output)

    def _is_chinese_char(self, cp):
        """Checks whether CP is the codepoint of a CJK character."""
        # This defines a "chinese character" as anything in the CJK Unicode block:
        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
        #
        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
        # despite its name. The modern Korean Hangul alphabet is a different block,
        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
        # space-separated words, so they are not treated specially and handled
        # like the all of the other languages.
        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
                (cp >= 0x3400 and cp <= 0x4DBF) or  #
                (cp >= 0x20000 and cp <= 0x2A6DF) or  #
                (cp >= 0x2A700 and cp <= 0x2B73F) or  #
                (cp >= 0x2B740 and cp <= 0x2B81F) or  #
                (cp >= 0x2B820 and cp <= 0x2CEAF) or
                (cp >= 0xF900 and cp <= 0xFAFF) or  #
                (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
            return True

        return False

    def _clean_text(self, text):
        """Performs invalid character removal and whitespace cleanup on text."""
        output = []
        for char in text:
            cp = ord(char)
            if cp == 0 or cp == 0xfffd or _is_control(char):
                continue
            if _is_whitespace(char):
                output.append(" ")
            else:
                output.append(char)
        return "".join(output)


def whitespace_tokenize(text):
    """Runs basic whitespace cleaning and splitting on a peice of text."""
    text = text.strip()
    if not text:
        return []
    tokens = text.split()
    return tokens


def _is_whitespace(char):
    """Checks whether `chars` is a whitespace character."""
    # \t, \n, and \r are technically contorl characters but we treat them
    # as whitespace since they are generally considered as such.
    if char == " " or char == "\t" or char == "\n" or char == "\r":
        return True
    cat = unicodedata.category(char)
    if cat == "Zs":
        return True
    return False


def _is_control(char):
    """Checks whether `chars` is a control character."""
    # These are technically control characters but we count them as whitespace
    # characters.
    if char == "\t" or char == "\n" or char == "\r":
        return False
    cat = unicodedata.category(char)
    if cat.startswith("C"):
        return True
    return False


def _is_punctuation(char):
    """Checks whether `chars` is a punctuation character."""
    cp = ord(char)
    # We treat all non-letter/number ASCII as punctuation.
    # Characters such as "^", "$", and "`" are not in the Unicode
    # Punctuation class but we treat them as punctuation anyways, for
    # consistency.
    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
            (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
        return True
    cat = unicodedata.category(char)
    if cat.startswith("P"):
        return True
    return False

Word2Vec Training

Training the model is fairly straightforward.

sentences = Corpus()
model = Word2Vec(sentences=sentences, vector_size=300, window=5, min_count=5, workers=4)
output_txt_file = r'D:\bloomberg-news-summarisation\word2vec-bloomberg-news-300.txt'
model.wv.save_word2vec_format(output_txt_file, binary=False)

The Word2Vec model is being trained in the preceding stage while the vocabulary is being built. In order to train the model to predict the current word based on the context, we are actually training a neural network with a single hidden layer (using the default neural architecture). After training, we won’t be using the neural network, though! The objective is to discover the hidden layer’s weights. We are effectively trying to learn word vectors from these weights. The embeddings are another term for the resulting learned vector. These embeddings might be viewed as certain features that characterise the target word. Training on the Bloomberg dataset takes several minutes, so wait patiently.

Conclusion

After completing this Gensim Word2Vec training, consider how you’ll apply it in real-world situations. Consider developing a language of sentiments, you can accomplish that by training a Word2Vec model on a large number of user reviews. You have a lexicon that includes the majority of the vocabulary in addition to sentiment.


Author: Yang Wang
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Yang Wang !
 Previous
Mix Audio with Noise Mix Audio with Noise
In this tutorial, I will look into how to prepare audio data, and mix it with noise. By doing so, we can generate new examples for free and make our training dataset more generalised. This step is also called data augmentation, this simplest technique usually works better.
2022-09-19
Next 
Install PyTorch Old Version Install PyTorch Old Version
Sometime we might find no matching distribution for PyTorch. So in this article, I'll show you how to install old version PyTorch.
2022-07-06
  TOC