In this article, we'll walk through the creation of a real-time property price prediction model focusing on Nairobi County. You can explore my model repository here.
Overview
The model is divided into the following 6 major parts, plus one optional component:
- Web Scraping: Extract house data from relevant websites.
- Data Cleaning: Clean and preprocess the gathered data.
- Exploratory Data Analysis (EDA): Analyze and visualize the data.
- Modeling: Build and train the predictive models.
- Deployment: Deploy the model using a web framework.
- Chatbot Creation: Develop a chatbot using OpenAI APIs to provide housing information in Kenya.
- Automation with Airflow (Optional): Automate processes using Apache Airflow.
1. Web Scraping
Web scraping involves using a bot or web crawler to extract data from third-party websites. It plays a crucial role in today’s digital landscape, enabling web developers to build impactful applications and data scientists to gather relevant data for modeling.
There are several methods for web scraping. One straightforward approach is using API keys provided by websites, such as the Twitter API. However, these API keys can sometimes be costly, as many are not free. Alternatively, Python libraries offer powerful tools for scraping, including BeautifulSoup, Selenium, and Scrapy. Here’s a brief overview of each:
a. BeautifulSoup: An HTML and XML parser ideal for extracting data from static web pages. It's a great starting point for beginners. For a detailed tutorial, check out this video.
b. Selenium: Best for handling user interactions and JavaScript-heavy websites, making it suitable for dynamic content. For more information, see this tutorial.
c. Scrapy: Designed for large-scale, concurrent data extraction with built-in features for requests, parsing, crawling, and organizing data. Learn more from this tutorial.
In this project, we used BeautifulSoup to extract data from two websites: buyrentkenya.com and propertypro.co.ke. Here’s how BeautifulSoup works:
i. import the Requests Library:
The requests
library allows us to send requests to websites. Here’s an example:
import requests
url = 'https://www.buyrentkenya.com/houses-for-sale'
html_text = requests.get(url).text
print(html_text)
The output will be the HTML content of the webpage. A successful HTTP request (status code 200) will return the HTML contents. If the request fails, you will encounter an HTTP error response. For more information on HTTP status codes, refer to this article.
ii. import BeautifulSoup:
BeautifulSoup is a Python library that includes various parsers such as html.parser
, lxml
, and html5lib
. A parser reads and analyzes text to understand its structure and meaning, often converting it into a more usable format. In simple terms, a parser is like an interpreter who can bridge language barriers, allowing Python to understand HTML.
Once you use a parser, you store it in a variable called soup
(a common convention) and use this variable to find or extract specific text from the HTML classes. To identify the text you’re interested in, right-click on the highlighted content you want example the property price in the browser, select "inspect," and find the relevant class field.
Here’s a code snippet for extracting prices from buyrentkenya.com:
from bs4 import BeautifulSoup
import requests
url = 'https://www.buyrentkenya.com/houses-for-sale'
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
properties = soup.find_all('div', class_='relative w-full overflow-hidden rounded-2xl bg-white')
for prop in properties:
price_tag = prop.find('p', class_='text-xl font-bold leading-7 text-grey-900').find('a', class_='no-underline')
price_text = price_tag.text if price_tag else None
print(price_text)
This code retrieves prices from the first page of the website. To handle multiple pages, you will need to include pagination logic.
iii. import the CSV Library:
The csv
library allows you to save the extracted data to a CSV file. Here’s a standard way to do it:
import csv
df.to_csv('filename.csv', index=False)
In the repository, under the data_collection
subfolder, you will find a scraping_code
folder containing the code used to extract data from the mentioned sites. There is also a practice script for experimenting with other sites.
My primary focus during the scraping process was on properties, including houses, apartments, and bedsitters, for both rental and sale listings.
2 & 3. Data Cleaning and Exploratory Data Analysis
Data cleaning and exploratory data analysis (EDA) are crucial stages in the data modeling process. After extracting data from the relevant websites, the first step is to consolidate it into a single Excel sheet. Preliminary cleaning, such as removing irrelevant rows, can be performed using Excel.
Once the data is consolidated and cleaned, the next step is to prepare it for modeling through Exploratory Data Analysis (EDA). Introduced by American mathematician John Tukey in the 1970s, EDA is a fundamental process for understanding and preparing data. There is no standardized approach to EDA; it varies depending on the analyst's preferences and the specific context of the data.
EDA is essential for preparing data for modeling, as it involves various tasks such as statistical analysis, data visualization, and feature engineering. To excel in this stage, you need strong skills in mathematics and statistics, data visualization, domain or market knowledge, and a curious mindset. Asking critical questions about the data is key to uncovering valuable insights.
Domain or market knowledge is particularly important for generating new features—a process known as feature engineering. Introducing new features or refining existing ones helps the model better understand the data, improving its performance. Features are essentially the columns in your dataset, such as location or number of bedrooms.
Data cleaning and EDA are often the most time-consuming parts of the modeling process. They require a deep understanding of both the general and statistical aspects of the data. This stage can take days or even weeks to thoroughly analyze and interpret. Its importance cannot be overstated; as the saying goes, "Garbage in, garbage out." Providing the model with poor-quality input will result in poor-quality predictions.
For a practical example, refer to the code in the nairobi_house_price_prediction
notebook of the cleaning_eda_modeling
subfolder to see how EDA was conducted in this project.
4. Modeling
Modeling is the core of our data science process. Once the data has been thoroughly explored and features have been enhanced, the final dataset is handed over to the data scientist for advanced statistical exploration and modeling.
A data scientist typically possesses advanced statistical knowledge compared to the initial analyst. They use this expertise to extract deeper insights from the data and prepare it for modeling.
The next step is to determine the type of problem at hand. Modeling problems generally fall into two categories:
a. Classification Problems: In these problems, the goal is to predict a discrete class label. The output is a categorical label. For example, if a model is trained with images of boys and girls, it will assign probability scores to the "boy" and "girl" labels for a new image and classify it based on the label with the highest probability.
b. Regression Problems: These problems aim to predict a continuous (numerical) output variable based on one or more input features. For instance, predicting house prices based on various features like location and size is a regression problem.
Different algorithms are used for classification and regression problems, though some algorithms can be applied to both types. It is the data scientist's role to select the appropriate algorithms and train the model accordingly.
A special type of model known as an ensemble model combines two or more models or algorithms. Ensemble models often outperform individual models for many tasks. More information on ensemble models can be found here.
Before modeling, it is essential to pre-process the data. Algorithms perform better with binary data. Pre-processing involves transforming the data into a suitable format for the algorithms to train on. This pre-processor can be saved as a pickle (.pkl) file and used during inference to prepare user input for prediction. Next, divide the data into training and testing sets (and sometimes validation sets) using the train_test_split
function from scikit-learn:
from sklearn.model_selection import train_test_split
# Dividing into 70% train set and 30% test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
With the data divided, proceed with modeling. In this project, three main algorithms were used: LinearRegression
, RandomForestRegressor
, and GradientBoostingRegressor
. Selecting appropriate evaluation metrics is crucial for assessing model performance. For this task, metrics such as Mean Squared Error (MSE), R-squared (R²), Cross-Validation Mean Score (CV-Mean), and Cross-Validation Standard Deviation (CV-Std Dev) were used, as accuracy, precision, and recall were less relevant.
Hyperparameter tuning is another critical aspect for improving model performance. Grid Search was employed to tune the Random Forest and Gradient Boosting models, as Linear Regression has fewer hyperparameters to adjust. The ensemble of the two models yielded better results. After training, the models were saved as pickle files (.pkl) containing the trained weights. The model weights can be found in the model_preprocessor_weights
subfolder, which also includes the preprocessor. As noted, the ensemble model provided the most accurate results and should be used for inference.
5. Deployment
Deploy your model using frameworks like FastAPI, Flask, or Streamlit. You can dockerize your application to enhance compatibility across different environments.
Check out the inferencing_and_deployment
subfolder in the repo for details on how I did my deployment.
6. Chatbot Creation Using OpenAI API
Incorporating a chatbot gives our model a contemporary edge. While traditional machine learning methods are widely accepted and used in the industry, the advent of modern Large Language Models (LLMs), such as ChatGPT1 in 2018, has significantly disrupted the data field.
Chatbots today often utilize Retrieval Augmented Generation (RAG) applications, which combine vector databases with LLMs like gpt-4-turbo to provide sophisticated responses to user queries. For more information on RAG applications, you can explore this LangChain repository and watch this video on the topic.
OpenAI offers billable API keys to access their models, which you can find here.
For implementation details, refer to the chatbot
subfolder in the nairobi_house_price_prediction_model
repository.
7. Automation with Airflow (Optional)
Automate your data processes using Apache Airflow. Other tools like Apache Kafka or Redpanda can also be considered for data streaming. This component is still in progress.
For a comprehensive view, visit the repository.
For inquiries, connect with me on LinkedIn or email me at kamaugilbert9@gmail.com.