Data sample
date = [
{ "study_time": 1, "salary": 350, "absences": 5, "city": "San Francisco" },
{ "study_time": 2, "salary": 1600, "absences": 4, "city": "London" },
{ "study_time": 3, "salary": 2450, "absences": 3, "city": "Paris" },
{ "study_time": 4, "salary": 5150, "absences": 5, "city": "San Francisco" },
{ "study_time": 5, "salary": 5800, "absences": 4, "city": "London" },
{ "study_time": 6, "salary": 6050, "absences": 3, "city": "Paris" }
]
What the estimated salary with these values?
{ "study_time": 13, "salary": ???, "absences": 5, "city": "San Francisco" }
Results
Using Polynomial Regression the value 13 of this sequence would be: 24814
But the correct value was: 19550
Error: 5264
If I were to predict position 49 it would be: 182441
But the correct value was: 77150
Error: 105291
This was the "hidden algorithm" that produces the progression:
x = 0
absences_base = 50
salary_base = 1000
data = []
for i in range(50):
if x == 0:
x += 1
data.append({
"study_time": i + 1,
"salary": (i * salary_base + (300 * 2 * (i + 1))) - (5 * absences_base),
"absences": 5,
"city": "San Francisco"
})
elif x == 1:
x += 1
data.append({
"study_time": i + 1,
"salary": (i * salary_base + (200 * 2 * (i + 1))) - (4 * absences_base),
"absences": 4,
"city": "London"
})
else:
x = 0
data.append({
"study_time": i + 1,
"salary": (i * salary_base + (100 * 2 * (i + 1))) - (3 * absences_base),
"absences": 3,
"city": "Paris"
})
for entry in data:
print(entry)
{'study_time': 1, 'salary': 350, 'absences': 5, 'city': 'San Francisco'}
{'study_time': 2, 'salary': 1600, 'absences': 4, 'city': 'London'}
{'study_time': 3, 'salary': 2450, 'absences': 3, 'city': 'Paris'}
{'study_time': 4, 'salary': 5150, 'absences': 5, 'city': 'San Francisco'}
{'study_time': 5, 'salary': 5800, 'absences': 4, 'city': 'London'}
{'study_time': 6, 'salary': 6050, 'absences': 3, 'city': 'Paris'}
{'study_time': 7, 'salary': 9950, 'absences': 5, 'city': 'San Francisco'}
{'study_time': 8, 'salary': 10000, 'absences': 4, 'city': 'London'}
{'study_time': 9, 'salary': 9650, 'absences': 3, 'city': 'Paris'}
{'study_time': 10, 'salary': 14750, 'absences': 5, 'city': 'San Francisco'}
{'study_time': 11, 'salary': 14200, 'absences': 4, 'city': 'London'}
{'study_time': 12, 'salary': 13250, 'absences': 3, 'city': 'Paris'}
{'study_time': 13, 'salary': 19550, 'absences': 5, 'city': 'San Francisco'}
How to predict the exact value?
Polynomial regression is a statistical technique that can be used to model and predict the relationship between two variables. However, in cases like this, where there are several variables involved (study time, salary, absences and city), polynomial regression may not be sufficient to capture all patterns in the time series.
The problem in question is a classic example of time series, where we need to predict future values based on patterns observed in the past.
This problem could be solved with machine learning
- Analyze all relationships between variables
- Test multiple hypotheses to discover what produces the progression
Furthermore, it can be essential to analyze all relationships between variables and test various hypotheses to discover what produces the progression. This may include:
Exploratory analysis: Use exploratory analysis techniques to better understand the nature of the time series and identify possible patterns or relationships between variables.
Statistical tests: Carry out statistical tests to check whether there is significance in the relationships observed between the variables.
Another solution would be to create an algorithm that does this with the most basic hypotheses:
- Test the influence of "relational sums" on the progression: a+b->c, b+c->a, c+a->b, a+b+c->d, etc (-> == influences, produces )
- Test "relational subtractions", "relational divisions", "relational squares", etc.
This algorithm for testing "relational operations", it would be a direct machine learning (or explicit machine learning) approach. This means that the algorithm does not use advanced machine learning techniques, but rather implements rules and logical structures to learn time series patterns.
And by testing only basic hypotheses, the limitations would be:
- Overfitting: The algorithm may overspecialize on specific patterns in the trained dataset and not generalize well to new data.
- Limited scalability: If the dataset is very large or complex, the algorithm may not be able to test all possible hypotheses in real time.
While a machine learning model can:
- Learn complex patterns and generalize to new data, without needing them to be explicitly specified.
But what about the sample size?
Before looking for more complex solutions, it is best to make sure that a simpler solution has been adequately tested.
If we include just 3 more lines of the progression sequence, we can predict the exact value using polynomial progression
date = [
{ "study_time": 1, "salary": 350, "absences": 5, "city": "San Francisco" },
{ "study_time": 2, "salary": 1600, "absences": 4, "city": "London" },
{ "study_time": 3, "salary": 2450, "absences": 3, "city": "Paris" },
{ "study_time": 4, "salary": 5150, "absences": 5, "city": "San Francisco" },
{ "study_time": 5, "salary": 5800, "absences": 4, "city": "London" },
{ "study_time": 6, "salary": 6050, "absences": 3, "city": "Paris" },
{'study_time': 7, 'salary': 9950, 'absences': 5, 'city': 'San Francisco'},
{'study_time': 8, 'salary': 10000, 'absences': 4, 'city': 'London'},
{'study_time': 9, 'salary': 9650, 'absences': 3, 'city': 'Paris'}
]
Now
- study_time = 13 => Predicted Salary: 19550
- study_time = 49 => Predicted Salary: 77150
So this problem can be solved with polynomial regression, as long as the data sample is sufficient
It is interesting to note that the model only needs a sample of the data up to row 9 to make accurate predictions. This suggests that there is a regular pattern in the time series that can be captured with a limited amount of data. And there really was.
Complete code
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
data = pd.DataFrame({
"study_time": [1, 2, 3, 4, 5, 6, 7, 8, 9],
"absences": [5, 4, 3, 5, 4, 3, 5, 4, 3],
"San Francisco": [0, 1, 0, 0, 1, 0, 0, 1, 0], # dummy variables
"London": [0, 0, 1, 0, 0, 1, 0, 0, 1], # dummy variables
"Paris": [1, 0, 0, 1, 0, 0, 1, 0, 0], # dummy variables
"salary": [350, 1600, 2450, 5150, 5800, 6050, 9950, 10000, 9650]
})
# Independent and dependent variables
X = data[["study_time", "absences", "San Francisco", "London", "Paris"]]
y = data["salary"]
# Creating polynomial characteristics of degree 2
characteristics_2 = PolynomialFeatures(degree=2)
x_pol_2 = characteristics_2.fit_transform(X)
y_pol_2 = model2.predict(x_pol_2)
# Fitting the linear regression model
model2 = LinearRegression()
model2.fit(x_pol_2, y)
# New data provided for prediction
new_data = pd.DataFrame({
"study_time": [13],
"absences": [5],
"San Francisco": [0],
"London": [0],
"Paris": [1]
})
# Polynomial transformation of the new data
new_data_pol_2 = characteristics_2.transform(new_data)
predicted_salary = model2.predict(new_data_pol_2)
print("Predicted Salary:", int(predicted_salary[0]) )
# Plot
plt.subplot(1, 1, 1)
plt.scatter(new_data["study_time"], predicted_salary, color='green', label='Predicted Salary')
plt.scatter(data["study_time"], y, color='blue', label='Real Salary')
plt.scatter(data["study_time"], y_pol_2, color='red', label='Polynomial Fit', marker='x')
plt.title("Polynomial Regression - Salary and Study Time")
plt.xlabel("Study Time")
plt.ylabel("Salary")
plt.legend()
plt.show()