Stealing pages from the server...

Notes on Feature Engineering


Without sufficient data and suitable features, the most powerful model structure cannot get satisfactory output. As a classic saying goes, “Garbage in, garbage out.” For a machine learning problem, the data and features often determine the upper limit of the results, while the selection of models, algorithms and optimization are gradually approaching this upper limit.

Introduction

Feature engineering, as the name implies, is a series of engineering processes on raw data to distill it into features that can be used as input for algorithms and models. In essence, feature engineering is a process of representing the data.

Typically, there are two types of data:

  1. Structured Data: The structured data type can be viewed as a table in a relational database, with each column is clearly defined and contains two basic types: numeric and categorical; each row of data represents a sample. Each row of data represents a sample of information.
  2. Unstructured Data: Unstructured data mainly includes text, image, audio and video data. The information it contains cannot be represented by a simple numerical value, and there is no clear definition of categories, and each piece of data. The size of each piece of data varies.

Feature Normalisation

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between an interval. Feature normalisation will make different indicators comparable. For example, to analyze the impact of a person’s height and weight on health, if using meters (m) and kilograms (kg) as units, then the height characteristics would be in the range of values from 1.6 to 1.8m, and weight characteristics would be in the range of 50 to 100 kg, and the results of the analysis would clearly favor the weight characteristics with a greater numerical difference.

The two main methods most commonly used are as follows.

  1. Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
  1. Z-Score Normalisation
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

Normalisation technique heavily depends on an algorithm being used. Indeed many estimators are designed with the assumption that each feature takes values close to zero or more importantly that all features vary on comparable scales. In particular, metric-based and gradient-based estimators (linear regression, logistic regression, support vector machine, neural network) often assume approximately standardized data (centered features with unit variances). A notable exception are decision tree-based estimators (random forest, gradient boosting tree) that are robust to arbitrary scaling of the data.

Categorical Feature

Mainly, categorical feature indicates the gender (male or female), blood type (A, B,
AB, O), and other characteristics that only take values within a limited number of options. The most commonly used encoding methods are: ordinal encoding, one-hot encoding, binary encoding, etc.

Ordinal Encoding

Ordinal encoding is usually used to handle data with size relationships between categories. For example, grades can be divided into low, medium and high, and there is a “high > medium > low” ranking relationship. Ordinal encoding will assign a numerical ID to category-type features with a numeric ID, e.g., 3 for high, 2 for medium, and 1 for low, which still retains the size relationship after conversion.

One-hot Encoding

One-hot encoding is usually used to deal with features that do not have size relationships between categories. In the “color” variable example, there are 3 categories and therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors. Such as (1, 0, 0) for red, (0, 1, 0) fo green, and (0, 0, 1) for blue.

Binary Encoding

Binary encoding is divided into two main steps, first assigning a category ID to each category using the ordinal encoding, and then the binary code corresponding to the category ID is used as the result. Take color for instance, the ID for red is 1, so the binary representation is 001; the ID for green is 2, so the binary representation is 010; the ID for blue is 3, so the binary representation is 011. It can be seen that binary encoding is essentially a hash mapping of IDs using binary representation, which eventually yields a 0/1 feature vector with fewer dimensions than the one-hot encoding, it will save storage space.


Author: Yang Wang
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Yang Wang !
 Previous
Text Representation for Unstructured Data Text Representation for Unstructured Data
Text is a very important unstructured data, and how to represent text data has been an important research direction in the field of machine learning. In this article, I will only discuss the very basic methods, such as Bag of Words, TF-IDF (Term Frequency Inverse Document Frequency), Topic Model, and Word Embedding.
2021-03-22
Next 
What is Fundamental Analysis and Technical Analysis What is Fundamental Analysis and Technical Analysis
The dream of early retirement. Most people are not able to do so. Therefore, we should do a good job of financial planning while we are young. It is not impossible to make a lot of profit from investments. In essence, there are two methods to do investment, fundamental analysis and technical analysis. Fundamental analysis considers all the factors that affect a company's financial and profitability factors. Technical analysis side is to look at the line chart, indicators, etc.
2021-03-20
  TOC