Descriptive Statistics || Part 1(Central Tendency and Dispersion)

Neha Gupta - May 23 - - Dev Community

Hey reader👋. Hope you are doing well.
In the last post we have read about some of the basic terms used in statistics. In this post we are going to talk about measure of central tendency and measure of dispersion.
So let's get started🔥.

Measure of Central Tendency

What is Central Tendency?

Central Tendency refers to the measure used to determine the "center of the distribution of data". This refers to the central value which represents most of our dataset.
For e.g. suppose we have to find the ages of people residing in a colony. We collect the data and find that the average age of the people in the colony is 45. Therefore, this number, 45, represents the central tendency.

1. Mean or Average

The mean (or average) is the sum of all the values in a dataset divided by the number of values.
Image description

Example-:X={1,2,2,3,4,5} => Sample Mean=(1+2+2+3+4+5)/6=2.83

2. Median

This is the middle value of a data set when it is ordered from least to greatest. If there is an even number of data points, the median is the average of the two middle numbers. The median is less affected by outliers and skewed data.
Image description
To find Median follow these steps-:

  • Sort the data

  • Find the central element (for even number of datapoints the median is average of two central values and for odd the median is central value itself).

How median is helpful in outlier detection than mean?

Consider a dataset -: X={1,2,2,3,4,5,100}
Observing this dataset we can clearly state that 100 is an outlier.

An outlier is a value which is quite different from rest of the values in dataset.
Let's find out mean and median of this dataset.
Mean = 16.71 Median = 3
So addition of an outlier has skewed the mean from such a large amount whereas median remains nearly same. This is why median is more useful in detecting outlier than mean.

3. Mode

The mode is the value that appears most frequently in a data set. A data set may have one mode, more than one mode (bimodal or multimodal), or no mode at all if no number repeats.

Example-: X={1,2,2,3,3,3,4,4,5,5,5,5}
Calculating frequency of each number in dataset ,we will find that 5 has largest frequency therefore mode is 5.
Mode is helpful for Categorical Data.

Now we know that why we measure central tendency for our dataset. Let's have a look on measure of dispersion.

Measure of Dispersion

What is Dispersion or Spread?

Dispersion simply states how far data points are located from central tendency i.e. mean/median/mode. This is also called Variability.
Image description

The common measures of dispersion include:

1. Variance

The average of the squared differences between each data point and the mean. Variance provides a measure of how much the data points spread out around the mean.
Image description
(n-1 in Variance formula for sample represents degree of freedom)

2. Standard Deviation

The square root of the variance. It is in the same units as the data, making it more interpretable. It indicates the average amount by which data points differ from the mean.

3. Range

The difference between the highest and lowest values in a data set. It gives a quick sense of the spread but is influenced heavily by outliers.

4. Interquartile Range (IQR)

The difference between the first quartile (25th percentile) and the third quartile (75th percentile). It measures the spread of the middle 50% of the data, reducing the influence of outliers.

Percentile-: A percentile is a value below which a certain percentage of observations lie. E.g. 95 percentile = The person has got better marks than 95% of entire students.
Percentile of score X = (Number of values below X/n) * 100
X = (Percentile/100) * (n+1)

IQR = Q3 (third quartile) - Q1 (first quartile)

5. Mean Absolute Deviation (MAD)

The MAD is the average of the absolute differences between each data point and the mean. It provides a measure of dispersion that, unlike variance, is in the same units as the data.

6. Median Absolute Deviation (also MAD)

The MAD can also refer to the median of the absolute differences between each data point and the median of the data set. This measure is robust to outliers and skewed data.

7. Quartile Deviation (Semi-Interquartile Range)

This is half of the IQR, calculated as (Q3 - Q1) / 2. It provides a measure of dispersion around the median.

8. Coefficient of Variation (CV)

The CV is the ratio of the standard deviation to the mean, often expressed as a percentage. It is useful for comparing the relative dispersion between data sets with different units or widely different means.

Why we use Measure of Dispersion?

Measuring dispersion is important in statistics and data analysis because it shows how spread out the data points are. This helps us understand how consistent the data is and how it relates to the average values like the mean, median, or mode. It also lets us compare different data sets, find unusual values (outliers), choose the right statistical tests, and assess risk, especially in finance. Additionally, dispersion helps spot data quality issues, guides decision-making by showing stability or problems, and provides context beyond just the average values. Overall, understanding dispersion gives a clearer and more accurate picture of the data, leading to better analysis and decisions.

Example-: Class A -: {80,85,90,95,100} Mean(A)=90
Class B -: {70,75,85,95,105} Mean(B)=86
Variability(A)={|80-90|,|85-90|,|90-90|,|95-90|,|100-90|}
=>{10,5,0,5,10}
Variability(B)={16,11,1,9,19}
Mean absolute difference(A)=6
Mean absolute difference(B)=11.2
So above calculations show that B has high spread as compared to A.
Image description

I hope you have understood this blog. In the next blog we are going to see Probability Distribution Curves. Till then stay connected and don't forget to follow and if you like this post please leave some reaction.💙

. . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player