Predicting House Prices with Scikit-learn: A Complete Guide

Amit Chandra - Sep 6 - - Dev Community

Machine learning is transforming various industries, including real estate. One common task is predicting house prices based on various features such as the number of bedrooms, bathrooms, square footage, and location. In this article, we will explore how to build a machine learning model using scikit-learn to predict house prices, covering all aspects from data preprocessing to model deployment.

Table of Contents

  1. Introduction to Scikit-learn
  2. Problem Definition
  3. Data Collection
  4. Data Preprocessing
  5. Feature Selection
  6. Model Training
  7. Model Evaluation
  8. Model Tuning (Hyperparameter Optimization)
  9. Model Deployment
  10. Conclusion

1. Introduction to Scikit-learn

Scikit-learn is one of the most widely used libraries for machine learning in Python. It offers simple and efficient tools for data analysis and modeling. Whether you’re dealing with classification, regression, clustering, or dimensionality reduction, scikit-learn provides an extensive set of utilities to help you build robust machine learning models.

In this guide, we’ll build a regression model using scikit-learn to predict house prices. Let’s walk through each step of the process.


2. Problem Definition

The task at hand is to predict the price of a house based on its features such as:

  • Number of bedrooms
  • Number of bathrooms
  • Area (in square feet)
  • Location

This is a supervised learning problem where the target variable (house price) is continuous, making it a regression task. Scikit-learn provides a variety of algorithms for regression, such as Linear Regression and Random Forest, which we will use in this project.


3. Data Collection

You can either use a real-world dataset like the Kaggle House Prices dataset or gather your own data from a public API.

Here’s a sample of how your data might look:

Bedrooms Bathrooms Area (sq.ft) Location Price ($)
3 2 1500 Boston 300,000
4 3 2000 Seattle 500,000

The target variable here is the Price.


4. Data Preprocessing

Before feeding the data into a machine learning model, we need to preprocess it. This includes handling missing values, encoding categorical features, and scaling the data.

Handling Missing Data

Missing data is common in real-world datasets. We can either fill missing values with a statistical measure like the median or drop rows with missing data:

data.fillna(data.median(), inplace=True)
Enter fullscreen mode Exit fullscreen mode

Encoding Categorical Features

Since machine learning models require numerical input, we need to convert categorical features like Location into numbers. Label Encoding assigns a unique number to each category:

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
data['Location'] = encoder.fit_transform(data['Location'])
Enter fullscreen mode Exit fullscreen mode

Feature Scaling

It’s important to scale features like Area and Price to ensure that they are on the same scale, especially for algorithms sensitive to feature magnitude. Here’s how we apply scaling:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Enter fullscreen mode Exit fullscreen mode

5. Feature Selection

Not all features contribute equally to the target variable. Feature selection helps in identifying the most important features, which improves model performance and reduces overfitting.

In this project, we use SelectKBest to select the top 5 features based on their correlation with the target variable:

from sklearn.feature_selection import SelectKBest, f_regression
selector = SelectKBest(score_func=f_regression, k=5)
X_new = selector.fit_transform(X, y)
Enter fullscreen mode Exit fullscreen mode

6. Model Training

Now that we have preprocessed the data and selected the best features, it’s time to train the model. We’ll use two regression algorithms: Linear Regression and Random Forest.

Linear Regression

Linear regression fits a straight line through the data, minimizing the difference between the predicted and actual values:

from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Random Forest

Random Forest is an ensemble method that uses multiple decision trees and averages their results to improve accuracy and reduce overfitting:

from sklearn.ensemble import RandomForestRegressor
forest_model = RandomForestRegressor(n_estimators=100)
forest_model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Train-Test Split

To evaluate how well our models generalize, we split the data into training and testing sets:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

7. Model Evaluation

After training the models, we need to evaluate their performance using metrics like Mean Squared Error (MSE) and R-squared (R²).

Mean Squared Error (MSE)

MSE calculates the average squared difference between the predicted and actual values. A lower MSE indicates better performance:

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
Enter fullscreen mode Exit fullscreen mode

R-squared (R²)

R² tells us how well the model explains the variance in the target variable. A value of 1 means perfect prediction:

from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
Enter fullscreen mode Exit fullscreen mode

Compare the performance of the Linear Regression and Random Forest models using these metrics.


8. Model Tuning (Hyperparameter Optimization)

To further improve model performance, we can fine-tune the hyperparameters. For Random Forest, hyperparameters like n_estimators (number of trees) and max_depth (maximum depth of trees) can significantly impact performance.

Here’s how to use GridSearchCV for hyperparameter optimization:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20]
}

grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
Enter fullscreen mode Exit fullscreen mode

9. Model Deployment

Once you’ve trained and tuned the model, the next step is deployment. You can use Flask to create a simple web application that serves predictions.

Here’s a basic Flask app to serve house price predictions:

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

# Load the trained model
model = joblib.load('best_model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    prediction = model.predict([data['features']])
    return jsonify({'predicted_price': prediction[0]})

if __name__ == '__main__':
    app.run()
Enter fullscreen mode Exit fullscreen mode

Save the trained model using joblib:

import joblib
joblib.dump(best_model, 'best_model.pkl')
Enter fullscreen mode Exit fullscreen mode

This way, you can make predictions by sending requests to the API.


10. Conclusion

In this project, we explored the entire process of building a machine learning model using scikit-learn to predict house prices. From data preprocessing and feature selection to model training, evaluation, and deployment, each step was covered with practical code examples.

Whether you’re new to machine learning or looking to apply scikit-learn in real-world projects, this guide provides a comprehensive workflow that you can adapt for various regression tasks.

Feel free to experiment with different models, datasets, and techniques to enhance the performance and accuracy of your model.

Regression #AI #DataAnalysis #DataPreprocessing #MLModel #RandomForest #LinearRegression #Flask #APIDevelopment #RealEstate #TechBlog #Tutorial #DataEngineering #DeepLearning #PredictiveAnalytics #DevCommunity

. . . . . .
Terabox Video Player