Understanding Your Data: The Essentials of Explanatory Data Analysis

Lameck Odhiambo - Aug 11 - - Dev Community

Explanatory Data Analysis is a data analytics process that aims to understand the data in depth and learn different characteristics, often using visual means. This allows one to get a better feel for the data and find useful patterns.

Types of Explanatory Data Analysis

  1. Univariate Analysis Focuses on analyzing single variable at a time. Helps to understand the variable’s distribution, central tendency and spread.

Techniques
• Descriptive statistics (mean, median, mode, variance, standard deviation)
• Visualizations (histograms, box plots, bar charts, pie charts)

  1. Bivariate Analysis Examines relationship between two variables. Helps to understand how one variable affects or is associated with another.

Techniques
• Scatter plots
• Correlation coefficient
• Visualizations (line plots, scatter plots etc)

Steps involved in Explanatory Data Analysis
1. Understand the Data
Familiarize yourself with the dataset, understand the domain, and identify the objectives of the analysis.

2. Data Collection
Collect the required data from various sources such as databases, web scraping or APIs.

3. Data Cleaning
Handle missing values: impute or remove missing data.

df.isnull().sum()
Enter fullscreen mode Exit fullscreen mode

when cleaning:

df_cleaned =df.dropna()
Enter fullscreen mode Exit fullscreen mode

Remove duplicates: Ensure there are no duplicate records.
Checking duplicates

df.duplicated().sum()
Enter fullscreen mode Exit fullscreen mode

Cleaning

df_cleaned=df.drop_duplicates()
Enter fullscreen mode Exit fullscreen mode

4. Data Transformations
Normalize or standardize the data if necessary
Create new features through feature engineering.
Aggregate or disaggregate data based on analysis needs.

5. Data Integration
Integrate data from various sources to create a complete data set.

6. Data Exploration
Univariate and bivariate analysis using histograms, box plots, line plots etc.

7. Data Visualization
Visualize data distribution and relationships using visual tools such as bar charts, line charts, scatter plots, heat maps, and box plots.

8. Descriptive Statistics
Calculate central tendency measures (mean, median, mode) and dispersion measures (range, variance, standard deviation)

df.describe()
Enter fullscreen mode Exit fullscreen mode

9. Identify patterns and Outliers
Detect patterns, trends and outliers in data using visualizations and statistical methods.
eg; using box plot

import matplotlib.pyplot as plt

plt.boxplot(df['column_name'])
plt.show()

Enter fullscreen mode Exit fullscreen mode

10. Documentation and Reporting
Document the EDA process, findings and insights clearly and structured.
Create reports and presentations to convey results to stake holders.

Explanatory Data Analysis Tools

Using the following tools for explanatory data analysis, data scientists can effectively gain deeper insights and prepare data for advanced analytics and modelling.

  1. Python Libraries
    • Pandas: Provides data structures and functions needed to manipulate structured data seamlessly. Used for summary statistics.
    • Matplotlib: A plotting library that produces static, animated and interactive visualizations.
    • Seaborn: Built on matplotlib, it provides a high level interface for drawing attractive statistical graphics.
    • SciPy: Builds on NumPy and provides many higher level scientific algorithms.

  2. R Libraries
    • ggplot2: A framework for creating graphics using principles of the grammar of graphics.
    • Dplyr: A set of tools for data manipulation offering consistent verbs to address common data manipulation tasks.
    • Tidyr: Provides function to help you organize data in tidy way.

. . .
Terabox Video Player