Scaling and Normalizing Data for Machine Learning Models 🐍🤖
In the world of machine learning, scaling and normalizing your data are crucial preprocessing steps before feeding it into models. Proper scaling ensures that each feature contributes equally to the result, while normalization often improves the performance of the algorithm. In this post, we'll explore these concepts in detail, focusing on methods provided by the scikit-learn
module. We'll also provide code snippets and formulas for clarity.
Why Scale and Normalize ❓
- Improves Model Performance: Many machine learning algorithms perform better when features are on a similar scale. For instance, algorithms like SVM and KNN are sensitive to the scales of the features.
- Faster Convergence: Gradient descent converges faster with scaled features.
- Reduces Bias: Unscaled features can cause bias in the model towards features with a larger range.
Scaling Techniques
Standardization (Z-score Normalization)
Standardization scales the data to have a mean of zero and a standard deviation of one.
The formula is: z = (x - μ) / σ
Where:
-
x
is the original value -
μ
is the mean of the feature -
σ
is the standard deviation of the feature
Code Example
from sklearn.preprocessing import StandardScaler
import numpy as np
# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
# Standardizing the data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
print("Standardized Data:\n", standardized_data)
output
Standardized Data:
[[-1.34164079 -1.34164079]
[-0.4472136 -0.4472136 ]
[ 0.4472136 0.4472136 ]
[ 1.34164079 1.34164079]]
Min-Max Scaling (Normalization)
Min-Max scaling scales the data to a fixed range, usually [0, 1].
The formula is: x' = (x - x_min) / (x_max - x_min)
Where:
- x is the original value
- x_min is the minimum value of the feature
- x_max is the maximum value of the feature
Code Example
from sklearn.preprocessing import MinMaxScaler
import numpy as np
# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
# Normalizing the data
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print("Normalized Data:\n", normalized_data)
output
Normalized Data:
[[0. 0. ]
[0.33333333 0.33333333]
[0.66666667 0.66666667]
[1. 1. ]]
Normalization Techniques
L2 Normalization
L2 normalization scales each data point such that the Euclidean norm (L2 norm) of the feature vector is 1.
The formula is: x' = x / ||x||_2
Where ||x||_2 is the L2 norm of the feature vector.
Code Example
from sklearn.preprocessing import Normalizer
# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
# Normalizing the data using L2 norm
normalizer = Normalizer(norm='l2')
l2_normalized_data = normalizer.fit_transform(data)
print("L2 Normalized Data:\n", l2_normalized_data)
output
L2 Normalized Data:
[[0.4472136 0.89442719]
[0.6 0.8 ]
[0.6401844 0.76822128]
[0.65850461 0.75257669]]
L1 Normalization
L1 normalization scales each data point such that the Manhattan norm (L1 norm) of the feature vector is 1. The formula is:
formula : x' = x / ||x||_1
Where ||x||_1
is the L1 norm of the feature vector.
Code Example
from sklearn.preprocessing import Normalizer
# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
# Normalizing the data using L1 norm
normalizer = Normalizer(norm='l1')
l1_normalized_data = normalizer.fit_transform(data)
print("L1 Normalized Data:\n", l1_normalized_data)
output
L1 Normalized Data:
[[0.33333333 0.66666667]
[0.42857143 0.57142857]
[0.45454545 0.54545455]
[0.46666667 0.53333333]]
example code :
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load the iris dataset
data = load_iris()
X, y = data.data, data.target
# Normalize the features
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.3, random_state=42)
# Fit a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate the model
score = model.score(X_test, y_test)
print(f"Model Accuracy: {score:.2f}")
output : Model Accuracy : 0.91
Conclusion
Scaling and normalizing your data are fundamental steps in preparing it for machine learning models. scikit-learn provides convenient and efficient tools for both scaling and normalization. Here’s a quick summary of the methods discussed:
- Standardization: Adjusts the data to have a mean of 0 and a standard deviation of 1.
- Min-Max Scaling: Scales the data to a fixed range, usually [0, 1].
- L2 Normalization: Scales the data so that the L2 norm of each row is 1.
- L1 Normalization: Scales the data so that the L1 norm of each row is 1.
→ By correctly applying these techniques, you can improve the performance and convergence of your machine learning models.