INTRODUCTION
Descriptive analysis is simply how we describe basic features of dataset and obtains of a short summary about the sample and measures of data.
IMPORTING LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
import seaborn as sns
from scipy import stats
from scipy.stats import pearsonr
Functions Used In Explarotary Data Analysis
df=pd.read_csv(Assingment2/automobileEDA.csv)
df.head()
df.describe()
df.describe(include='all')
VALUE COUNTS
drive_wheels_counts=df['drive-wheels'].value_counts()
drive_wheels_counts
BOX PLOTS
it is based on visualizing data based numeric data and various distributions
sns.boxplot(x='drive-wheels',y='price',data=df)
df.info()
SCATTER PLOTS
They are datapoints numbers contained in some range
y=df['price']
x=df['engine-size']
plt.scatter(x,y)
plt.title('scatter plot of Engine Size vs Price)
plt.xlabel('Engine Size')
plt.ylabel('Price')
Conclusion:As the engine size is increasing the price too increases
GROUP BY
It is used on categorical variable,groups the data into subsets according to different categories of that variable
GROUP BY DRIVEWHEELS
df_1grp=df_1.groupby(['drive-wheels'],as_insex=False).mean()
df_1grp
GROUP BY BODY STYLE
df[['body-style']].value_counts()
GROUP BY ROWS
df_test=df[['drive-wheels','body-style','price']]
df_test.head()
GROUP BASED ON TWO VARIABLES AND FIND MEAN
df_grp=df_test.groupby9['drive-wheels','body-style'],as_index=False).mean()
df_grp
PIVOT TABLE
it helps in visualizing data in a readable format
df_pivot=df_grp.pivot(index='drive-wheels',columns='body-style')
df_pivot
HEAT MAP PLOT
it takes a rectangular grid of data and assigns a color intensity based on the data value at the grid points.
plt.pcolor(df_pivot,cmap='RdBu')
plt.colorbar()
plt.show()
CORRELATION
It is a statistical metric for measuring to what extent different variables are independent.
sns.regplot(x='engine-size',y='price',data=df)
plt.ylim(0,)
WEAK CORRELATION
sns=regplot(x='highway-mpg',y='price',data=df)
plt.ylim(0,)
CORRELATION STATISTICS
it will be based according two methods:
1)correlation coefficient,it uses +1,-1,0
2)P-values,it tells us how certain we are about the correlation with calculated values
REMOVE ROWS WITH NAN OR INFINITE VALUES
df_cleaned=df[['horsepower','price']].dropna()
df_cleaned=df_cleaned[np.isfinite(df_cleaned).all(1)]
CALCULATE PEARSON CORRELATION
pearson_coef,p.value_stats.peasonor(df_cleaned['horsepower'],df_cleaned['price'])
pearson_coef,p_value
CORRELATION MATRIX
df_numeric=df.select_dtypes(include=[float,int])
corr_matri=df_numeric.corr()
corr_matrix
VISUALIZE HEATMAP
plt.figurre(figuresize=(10,8))
sns.heatmap(corr_matrix,annot=True,cmap='coolwarm',fmt='.2f',linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
CONCLUSION
As the engine size is increasing the price too increases.