Understanding Your Data. The Essentials of EDA

William Metobo - Aug 11 - - Dev Community

The first step to dive into data analytics and data science is understanding your data. What is data? Data simply means facts and figures, facts and statistics, particulars or anything to deal with details. Before you start a project in analysis, you must understand the facts or details you are dealing with. Understanding your data includes the following:
Knowing The source of your data
Understanding the source of data is fundamental to assessing its reliability and relevance. Data can be broadly categorized into two types: primary and secondary. Primary data is data that you generate yourself and secondary data is data that is generated externally(by other people).
Understanding The nature of your data
Data is either quantitative or qualitative. Quantitative data is numerical and qualitative data is made of words and strings.
Understanding each field/column of your data
The key to unlocking the full potential of your data lies in understanding the intricacies of each field/column and employing a meticulous approach to data cleaning. How you clean the data is depends on how well you understand the data.

The Essentials Of Exploratory Data Analysis
Exploratory Data Analysis is a process used by data scientists and analysts to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It is used to reveal insights beyond the formal modelling of data and provides a better understanding of variables and their relationship.
EDA tools
Tools used include:
Python: An interpreted, object-oriented programming language with dynamic semantics. Its high-level, built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or glue language to connect existing components together. Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning.
R: An open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians in data science in developing statistical observations and data analysis.

Types of EDA

  1. Univariate Non-graphical- this is the simplest form of data analysis as during this we use just one variable to research the info. The standard goal of univariate non-graphical EDA is to know the underlying sample distribution/ data and make observations about the population. Outlier detection is additionally part of the analysis.

  2. Multivariate Non-graphical- Multivariate non-graphical EDA technique is usually used to show the connection between two or more variables within the sort of either cross-tabulation or statistics.

  3. Univariate graphical- Non-graphical methods are quantitative and objective, they are not able to give the complete picture of the data; therefore, graphical methods are used more as they involve a degree of subjective analysis, also are required.

  4. Multivariate graphical- Multivariate graphical data uses graphics to display relationships between two or more sets of knowledge. The sole one used commonly may be a grouped bar plot with each group representing one level of 1 of the variables and every bar within a gaggle representing the amount of the opposite variable.

Importance of EDA
EDA helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA is crucial for informing decisions by revealing patterns, not by confirming or rejecting assumptions. It is the initial examination of data and should occur before any assumptions or conclusions are made to avoid faulty analysis

. .
Terabox Video Player