Introduction

Feature engineering is one of the most essential steps in the data science pipeline. It consists of reconstructing raw data into meaningful features that enhance machine learning models' performance.
In this article, we will dive into the key techniques for effective feature engineering along with hands-on examples to assist you in getting started.

Roles of Features in Machine Learning

In feature engineering, features refer to the measurable properties machine learning models use to make predictions or decisions, they are obtained from the primary data and modified to formats that can be used efficiently by algorithms.
Some of these features include:

1.Raw Features

These features are derived from the main dataset without any moderation. They include subject, grade, and class in a student's dataset.

2.Derived Features

These are features that are generated from the already existing features through combinations for instance a Density feature from mass and volume.

3.Categorical Features

These features represent discrete values or classifications such as brands or types. In most of the algorithms in machine learning, they need to be converted to numerical values.

4.Numerical Features

They represent continuous or discrete data such as age, income, or weight.

5.Aggregated Features

These features summarize information over groups of data, such as average

6.Spatial Features

These features represent geographical or spatial information, such as the distance between different locations.

Techniques for Feature Engineering

1.Handling Missing Data

Imputation

This method replaces the missing values in the dataset with a statistic such as mean, median or mode. Example in Python code:

import pandas as pd
from sklearn.impute import SimpleImputer

# Sample DataFrame with missing values
df = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [5, None, 7, 8]
})

# Initialize the imputer
imputer = SimpleImputer(strategy="mean")

# Impute missing values
df_imputed = df.copy()
for col in df.select_dtypes(include="number").columns:
    df_imputed[col] = imputer.fit_transform(df[[col]])

print(df_imputed)

Flagging Missing values

These techniques create a new feature which tends to indicate all the missing values in the dataset. Example in Python code:

df['A_missing'] = df['A'].isnull().astype(int)
df['B_missing'] = df['B'].isnull().astype(int)

print(df)

2.Encoding Categorical Variables

One-Hot Encoding

This method converts the categorical variables in the data into binary variables. Example in Python code:

df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'blue']
})

df_encoded = pd.get_dummies(df, columns=['color'])
print(df_encoded)

Label Encoding

This method gives a unique integer to each category in the data. Example in Python code:

df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'blue']
})

df_encoded = pd.get_dummies(df, columns=['color'])
print(df_encoded)

3.Creating Interaction Features

Polynomial Features

This technique generates new features by multiplying the existing ones together. Example in Python code:

from sklearn.preprocessing import PolynomialFeatures

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

poly = PolynomialFeatures(degree=2, include_bias=False)
df_poly = pd.DataFrame(poly.fit_transform(df), columns=poly.get_feature_names_out())
print(df_poly)

4.Binning and Discretization

Binning

This method categorizes the data into bins. Example in Python code.

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5]
})

df['A_binned'] = pd.cut(df['A'], bins=[0, 2, 4, 6], labels=['low', 'medium', 'high'])
print(df)

Discretization

This technique encodes the continuous variables in the dataset into discrete categories. Example in Python code:

df['A_discretized'] = pd.cut(df['A'], q=3, labels=['low', 'medium', 'high'])
print(df)

5.Feature Extraction

Principal Component Analysis(PCA)

This technique minimizes the amplitude of the data. Example in Python code:

from sklearn.decomposition import PCA

df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8]
})

pca = PCA(n_components=1)
df_pca = pd.DataFrame(pca.fit_transform(df), columns=['PC1'])
print(df_pca)

t-SNE

This technique visualizes the high-dimensional data.
Example in Python code:

from sklearn.manifold import TSNE
import numpy as np

df = pd.DataFrame({
    'A': np.random.rand(100),
    'B': np.random.rand(100)
})

tsne = TSNE(n_components=2)
df_tsne = pd.DataFrame(tsne.fit_transform(df), columns=['Dim1', 'Dim2'])
print(df_tsne.head())

6.Feature Selection

Filter Methods

This type of technique tends to select features based on the statistical properties of the dataset. Example in Python code:

from sklearn.feature_selection import SelectKBest, f_classif

df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'target': [1, 0, 1, 0]
})

X = df[['A', 'B']]
y = df['target']

selector = SelectKBest(score_func=f_classif, k=1)
X_new = selector.fit_transform(X, y)
print(X_new)

Wrapper

This method uses a model to assess feature subsets. Example in Python code:

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'target': [1, 0, 1, 0]
})

X = df[['A', 'B']]
y = df['target']

model = LogisticRegression()
rfe = RFE(model, n_features_to_select=1)
X_rfe = rfe.fit_transform(X, y)
print(X_rfe)

Challenges in Feature Engineering

While Feature engineering remains an important part of leveraging big datasets it also comes with its shortcomings.

1.Time Consuming

The manual feature engineering process involves the data scientist thoroughly examining all available data. The goal is to identify potential combinations of columns and predictors that could yield valuable insights to address the business problem at hand. This ends up requiring a significant amount of time and effort to complete all these steps.

2.Field Expertise

Having a deep understanding of the industry related to a machine learning project is crucial for identifying which features are pertinent and valuable. This knowledge also helps in visualizing how data points may interconnect in meaningful and predictive ways.

3.Advanced Technical Skillset

Feature engineering necessitates advanced technical skills and a comprehensive understanding of data science as well as machine learning algorithms. It requires a specific skill set that includes programming abilities and familiarity with database management. Most feature engineering techniques rely heavily on Python coding skills. Additionally, evaluating the effectiveness of newly created features involves a process of repetitive trial and error.

4.Overfitting

Generating an excessive number of features or overly complex features can result in overfitting. This occurs when the model excels on the training data but struggles to perform effectively on new, unseen data.

Tools for Feature Engineering

Pandas

This is a Python library for data manipulation and also for creating new features for data models.

Scikit-learn

This is an open-source Python library that is designed to be in feature engineering.

Feature-Engine

This is a Python library with multiple transformers to engineer and select features for machine learning models.

Featuretools

This is an automated feature engineering library that can create new features from relational data.

Conclusion

Feature engineering is an essential step in data science. Through feature engineering, you can process your data, to discover hidden trends and boost the performance of the machine learning models.
By mastering feature engineering, you enhance your models while also gaining a deeper insight into the underlying data and the specific problem you are addressing.

Feature Engineering Fundamentals: Best Practices and Practical Tips