EXPLAROTARY DATA ANALYSIS TECHNIQUES

Evans Jones - Aug 9 - - Dev Community

INTRODUCTION
Descriptive analysis is simply how we describe basic features of dataset and obtains of a short summary about the sample and measures of data.

IMPORTING LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
import seaborn as sns
from scipy import stats
from scipy.stats import pearsonr

Functions Used In Explarotary Data Analysis

df=pd.read_csv(Assingment2/automobileEDA.csv)
df.head()
df.describe()
df.describe(include='all')

VALUE COUNTS
drive_wheels_counts=df['drive-wheels'].value_counts()
drive_wheels_counts

BOX PLOTS
it is based on visualizing data based numeric data and various distributions

sns.boxplot(x='drive-wheels',y='price',data=df)
df.info()

SCATTER PLOTS
They are datapoints numbers contained in some range

y=df['price']
x=df['engine-size']
plt.scatter(x,y)
plt.title('scatter plot of Engine Size vs Price)
plt.xlabel('Engine Size')
plt.ylabel('Price')
Conclusion:As the engine size is increasing the price too increases

GROUP BY
It is used on categorical variable,groups the data into subsets according to different categories of that variable

GROUP BY DRIVEWHEELS
df_1grp=df_1.groupby(['drive-wheels'],as_insex=False).mean()
df_1grp

GROUP BY BODY STYLE
df[['body-style']].value_counts()

GROUP BY ROWS
df_test=df[['drive-wheels','body-style','price']]
df_test.head()

GROUP BASED ON TWO VARIABLES AND FIND MEAN
df_grp=df_test.groupby9['drive-wheels','body-style'],as_index=False).mean()
df_grp

PIVOT TABLE
it helps in visualizing data in a readable format
df_pivot=df_grp.pivot(index='drive-wheels',columns='body-style')
df_pivot

HEAT MAP PLOT
it takes a rectangular grid of data and assigns a color intensity based on the data value at the grid points.

plt.pcolor(df_pivot,cmap='RdBu')
plt.colorbar()
plt.show()

CORRELATION
It is a statistical metric for measuring to what extent different variables are independent.

sns.regplot(x='engine-size',y='price',data=df)
plt.ylim(0,)

WEAK CORRELATION
sns=regplot(x='highway-mpg',y='price',data=df)
plt.ylim(0,)

CORRELATION STATISTICS
it will be based according two methods:
1)correlation coefficient,it uses +1,-1,0
2)P-values,it tells us how certain we are about the correlation with calculated values

REMOVE ROWS WITH NAN OR INFINITE VALUES
df_cleaned=df[['horsepower','price']].dropna()
df_cleaned=df_cleaned[np.isfinite(df_cleaned).all(1)]

CALCULATE PEARSON CORRELATION
pearson_coef,p.value_stats.peasonor(df_cleaned['horsepower'],df_cleaned['price'])
pearson_coef,p_value

CORRELATION MATRIX
df_numeric=df.select_dtypes(include=[float,int])
corr_matri=df_numeric.corr()
corr_matrix

VISUALIZE HEATMAP
plt.figurre(figuresize=(10,8))
sns.heatmap(corr_matrix,annot=True,cmap='coolwarm',fmt='.2f',linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

CONCLUSION
As the engine size is increasing the price too increases.

. . . . . . .
Terabox Video Player