Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in machine learning, mostly in supervised learning. It consists of five processes: feature creation, transformations, feature extraction, exploratory data analysis and benchmarking. In this context, a 'feature' is any measurable input that can be used in a predictive model. It could be the sound of an animal, a color, or someone's voice.

This technique enables data scientists to extract the most valuable insights from data which ensures more accurate predictions and actionable insights.

Types of features

As stated above a feature is any measurable point that can be used in a predictive model. Let's go through the types of the feature engineering for machine learning-

Numerical features: These features are continuous variables that can be measured on a scale. For example: age, weight, height and income. These features can be used directly in machine learning.
Categorical features: These are discrete values that can be grouped into categories. They include: gender, zip-code, and color. Categorical features in machine learning typically need to be converted to numerical features before they can be used in machine learning algorithms. You can easily do this using one-hot, label, and ordinal encoding.
Time-series features: These features are measurements that are taken over time. Time-series features include stock prices, weather data, and sensor readings. These features can be used to train machine learning models that can predict future values or identify patterns in the data.
Text features: These are text strings that can represent words, phrases, or sentences. Examples of text features include product reviews, social media posts, and medical records. You can use text features to train machine learning models that can understand the meaning of text or classify text into different categories.
One of the most crucial processes in the machine learning pipeline is: feature selection, which is the process of selecting the most relevant features in a dataset to facilitate model training. It enhances the model's predictive performance and robustness, making it less likely to overfit to the training data. The process is crucial as it helps to reduce overfitting, enhance model interpretability, improve accuracy, and reduce training times.

Techniques in feature engineering

Imputation

This techniques deals with the handling of Missing values/data. It is one of the issues that you will encounter as you prepare your data for cleaning and even standardization. This is mostly caused by privacy concerns, human error, and even data flow interruptions. It can be classified into two categories:

Categorical Imputation: Missing categorical variables are usually replaced by the most commonly occurring value in other records(mode). It works with both numerical and categorical values. However, it ignores feature correlation. You can use Scikit-learn’s 'SimpleImputer' class for this imputation method. This class also works for imputation by mean and median approaches as well as shown below.

# impute Graduated and Family_Size features with most_frequent values

from sklearn.impute import SimpleImputer
impute_mode = SimpleImputer(strategy = 'most_frequent')
impute_mode.fit(df[['Graduated', 'age']])

df[['Graduated', 'age']] = impute_mode.transform(df[['Graduated', 'age']])

Numerical Imputation: Missing numerical values are generally replaced by the mean of the corresponding value in other records. Also called imputation by mean. This method is simple, fast, and works well with small datasets. However this method has some limitations, such as outliers in a column can skew the result mean that can impact the accuracy of the ML model. It also fails to consider feature correlation while imputing the missing values. You can use the 'fillna' function to impute the missing values in the column mean.

# Impute Work_Experience feature by its mean in our dataset

df['Work_Experience'] = df['Work_Experience'].fillna(df['Work_Experience'].mean())

Encoding

This is the process of converting categorical data into numerical(continuous) data. The following are some of the techniques of feature encoding:

Label encoding: Label encoding is a method of encoding variables or features in a dataset. It involves converting categorical variables into numerical variables.
One-hot encoding: One-hot encoding is the process by which categorical variables are converted into a form that can be used by ML algorithms.
Binary encoding: Binary encoding is the process of encoding data using the binary code. In binary encoding, each character is represented by a combination of 0s and 1s.

Scaling and Normalization

Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. For example, if you have multiple independent variables like age, salary, and height; With their range as (18–100 Years), (25,000–75,000 Euros), and (1–2 Meters) respectively, feature scaling would help them all to be in the same range, for example- centered around 0 or in the range (0,1) depending on the scaling technique.

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. Here, Xmax and Xmin are the maximum and the minimum values of the feature, respectively.

Binning

Binning (also called bucketing) is a feature engineering technique that groups different numerical subranges into bins or buckets. In many cases, binning turns numerical data into categorical data. For example, consider a feature named X whose lowest value is 15 and highest value is 425. Using binning, you could represent X with the following five bins:

Bin 1: 15 to 34
Bin 2: 35 to 117
Bin 3: 118 to 279
Bin 4: 280 to 392
Bin 5: 393 to 425

Bin 1 spans the range 15 to 34, so every value of X between 15 and 34 ends up in Bin 1. A model trained on these bins will react no differently to X values of 17 and 29 since both values are in Bin 1.

Dimensionality Reduction

This is a method for representing a given dataset using a lower number of features (i.e. dimensions) while still capturing the original data’s meaningful properties.1 This amounts to removing irrelevant or redundant features, or simply noisy data, to create a model with a lower number of variables. Basically transforming high dimensional data into low dimensional data. There are two main approaches to dimensionality reduction -

Feature Selection: Feature selection involves selecting a subset of the original features that are most relevant to the problem at hand. The goal is to reduce the dimensionality of the dataset while retaining the most important features. There are several methods for feature selection, including filter methods, wrapper methods, and embedded methods. Filter methods rank the features based on their relevance to the target variable, wrapper methods use the model performance as the criteria for selecting features, and embedded methods combine feature selection with the model training process.
Feature Extraction: Feature extraction involves creating new features by combining or transforming the original features. The goal is to create a set of features that captures the essence of the original data in a lower-dimensional space. There are several methods for feature extraction, including principal component analysis (PCA), linear discriminant analysis (LDA), and t-distributed stochastic neighbor embedding (t-SNE). PCA is a popular technique that projects the original features onto a lower-dimensional space while preserving as much of the variance as possible.

Automated Feature Engineering Tools

There are several tools that are used to automate feature engineering, let's look at some of them.

FeatureTools -This is a popular open-source Python framework for automated feature engineering. It works across multiple related tables and applies various transformations for feature generation. The entire process is carried out using a technique called “Deep Feature Synthesis” (DFS) which recursively applies transformations across entity sets to generate complex features.

Autofeat - This is a python library that provides automated feature engineering and feature selection along with models such as AutoFeatRegressor and AutoFeatClassifier. These are built with many scientific calculations and need good computational power. The following are some of the features of the library:

Works similar to scikit learn models using functions such as fit(), fit_transform(), predict(), and score().
Can handle categorical features with One hot encoding.
Contains a feature selector class for selecting suitable features.
Physical units of features can be passed and relatable features will be computed.
Contains Buckingham Pi theorem – used for computing dimensionless quantities. Only used for tabular data.
Only used for tabular data.

AutoML - Automatic Machine Learning in simple terms can be defined as a search concept, with specialized search algorithms for finding the optimal solutions for each component piece of the ML pipeline. It includes: Automated Feature engineering, Automated Hyperparameter Optimization, )Neural Architecture Search (NAS

Common Issues and Best practices in Feature Engineering

Common Issues

Ignoring irrelevant features: This could result in a model with poor predictive performance, as irrelevant features don’t contribute to the output and might even add noise to the data. The mistake is caused by a lack of understanding and analysis of the relationship between different datasets and the target variable.

Imagine a business that wants to use machine learning to predict monthly sales. They input data such as employee count and office size, which have no relationship with sales volume.
Fix: Avoid this by conducting a thorough feature analysis to understand which data variables are necessary and remove those that are not.

Overfitting from too many features: The model may have perfect performance on training data (because it has effectively ‘memorized’ the data) but may perform poorly on new, unseen data. This is known as overfitting. This mistake is usually due to the misconception that “more is better.” Adding too many features to the model can lead to large complexity, making the model harder to interpret.

Consider an app forecasting future user growth that inputs 100 features into their model, but most of them share overlapping information.
Fix: Counter this by using strategies like dimensionality reduction and feature selection to minimize the number of inputs, thus reducing the model complexity.

Not normalizing features: The algorithm may give more weight to features with a larger scale, which can lead to inaccurate predictions. This mistake often happens due to a lack of understanding of how machine learning algorithms work. Most algorithms perform better if all features are on a similar scale.

Imagine a healthcare provider uses patient age and income level to predict the risk of a certain disease but doesn’t normalize these features, which have different scales.
Fix: Apply feature scaling techniques to bring all the variables into a similar scale to avoid this issue.

Neglecting to handle missing values Models can behave unpredictably when confronted with missing values, sometimes leading to faulty predictions. This pitfall often happens because of an oversight or the assumption that the presence of missing values won’t adversely affect the model.

For example, an online retailer predicting customer churn uses purchase history data but does not address instances where purchase data is absent.
Fix: Implement strategies to deal with missing values, such as data imputation, where you replace missing values with statistical estimates.

Best Practices

Make sure to handle missing data in your input features: In a real-world case where a project aims to predict housing prices, not all data entries may have information about a house’s age. Instead of discarding these entries, you may impute the missing data by using a strategy like “mean imputation,” where the average value of the house’s age from the dataset is used. By correctly handling missing data instead of just discarding it, the model will have more data to learn from, which could lead to better model performance.
Use one-hot encoding for categorical data: For instance, if we have a feature “color” in a dataset about cars, with the possible values of “red,” “blue,” and “green,” we would transform this into three separate binary features: “is_red,” “is_blue,” and “is_green.” This strategy allows the model to correctly interpret categorical data, improving the quality of the model’s findings and predictions.
Consider feature scaling: As a real example, a dataset for predicting disease may have age in years (1100) and glucose level measurements (70180). Scaling places these two features on the same scale, allowing each to contribute equally to distance computations like in the K-nearest neighbors (KNN) algorithm. Feature scaling may improve the performance of many machine learning algorithms, rendering them more efficient and reducing computation time.
Create interaction features where relevant: An example could include predicting house price interactions, which may be beneficial. Creating a new feature that multiplies the number of bathrooms by the total square footage may give the model valuable new information. Interaction features can capture patterns in the data that linear models otherwise wouldn’t see, potentially improving model performance.
Remove irrelevant features: In a problem where we need to predict the price of a smartphone, the color of the smartphone may have little impact on the prediction and can be dropped. Removing irrelevant features can simplify your model, make it faster, more interpretable, and reduce the risk of overfitting.

Feature engineering is not just a pre-processing step in machine learning; it’s a fundamental aspect that can make or break the success of your models. Well-engineered features can lead to more accurate predictions and better generalization. Data Representation: Features serve as the foundation upon which machine learning algorithms operate. By representing data effectively, feature engineering enables algorithms to discern meaningful patterns. Therefore, aspiring and even experienced data scientists and machine learning enthusiasts and engineers must recognize the pivotal role feature engineering plays in extracting meaningful insights from data. By understanding the art of feature engineering and applying it well, one can unlock the true potential of machine learning algorithms and drive impactful solutions across various domains.

If you have any questions, or any ways I could improve my article, please leave them in the comment section. Thank you!

Feature Engineering: Unlocking the Power of Data for Superior Machine Learning Models