Statistics is an art about creating and contemplating strategies for gathering, examining, organizing and showing data path. Statistics is a science which finds its place virtually in all science and research fields in order to manage and improve data. Statisticians have been using different mathematical and computational tools to study this science.
Statistics forms the building blocks of data science, machine learning and automation. In this learning tutorial, we will cover the basics of statistics and probability in data science.
Two basic ideas in the science of statistics are uncertainty and variation. Instances are numerous in the science and life wherein we are uncertain about the outcome. Sometimes, this uncertainty is also because the outcome has not been determined yet. A perfect example is that we don’t know if it will rain tomorrow. There are even cases when an outcome is there but the causes are uncertain. For example it’s raining today, but the weather experts had forecasted a sunny day.
So, this uncertainty has resulted in the term “probability”. A probability is a mathematical tool or let’s say, language that is used to discuss uncertain events or causes. This probability plays the most important role in statistics. Every measurement or collection of data comes with a small margin of correction. We call it as variance in mathematics. So, if we collect or measure the same data again, we may get variation in the outcome compared to the last ones.
Data Science is the one who analyzes the data and make predictions. But the basis of this theory lies in the techniques that a data scientist uses to analyze that data to make predictions further. So, a data scientist first tries to understand the data by applying statistical techniques. By doing that a data scientist get a fair idea of the type of distribution a particular data has. After the only, he can apply algorithms and make some fancy predictions.
This tutorial aims to answer the following questions:
1. What is Descriptive Statistics?
2. Types of Descriptive Statistics?
3. Measure of Central Tendency (Mean, Median, Mode)
4. Measure of Spread / Dispersion (Standard Deviation, Mean Deviation, Variance, Percentile, Quartiles, Interquartile Range)
5. What areSkewness and the mathematical computation?
6. What are Kurtosis and the mathematical computation?
7. What are Correlation and the mathematical computation?
Let’s start with Descriptive Statistics:
What is Descriptive Statistics?
Descriptive statistics involves summarizing and organizing the data so they can be easily understood. Descriptive statistics, unlike inferential statistics, seeks to describe the data, but do not attempt to make inferences from the sample to the whole population. Here, we typically describe the data in a sample. This generally means that descriptive statistics, unlike inferential statistics, is not developed on the basis of probability theory.
Types of Descriptive Statistics?
Descriptive statistics are broken down into two categories. Measures of central tendency and measures of variability (spread).
Measure of Central Tendency
Central tendency refers to the idea that there is one number that best summarizes the entire set of measurements, a number that is in some way “central” to the set.
Mean / Average
Mean or Average is a central tendency of the data i.e. a number around which a whole data is spread out. In a way, it is a single number which can estimate the value of whole data set. In school, we have read that average is simply the sums of all the events divided by the number of events.
Let’s calculate mean of the data set having 10 integers.
Mean/Average = 1+2+3+4+5+6+7+8+9+10/10
Median of a data is the middle figure in the set. We can also say that median is also the number that is halfway into the set. The data should be first arranged in order from least to greatest.
Note: If we sort the data in descending order, will have no effect in the median, but IQR will become negative. That is an another statistics term that we will discuss later in this tutorial.
Median will be the middle term, if number of terms is odd
So, median of these numbers: 1+2+3+4+5+6+7+8+9+10+11 = 6
Median will be average of middle 2 terms, if number of terms is even.
So, median of these numbers: 1+2+3+4+5+6+7+8+9+10 = sum of two middle terms = 5+6 and their average is 11/2 = 5.5
Let’s discuss Arithmetic Progression here as well
Arithmetic progression is simply a data set wherein the differences between the consecutive terms are constant.
So, if we consider these data:
In the first data set, the difference between the successive terms is 1 while the 2nd data set, the difference between the successive terms is 2. So, both the above data sets are in arithmetic progressions.
Let’s now compute the median of both the above data sets:
a) 1+2+3+4+5+6+7+8+9+10 = mean of the two middle terms = 5+6 = 11/2 = 5.5
b) 2+4+6+8+10+12+14+16 + 18 = 10
Now let’s compute the mean of both the data sets:
a) 1+2+3+4+5+6+7+8+9+10 = 55/10 = 5.5
b) 2+4+6+8+10+12+14+16+18 = 90/9 = 10
So, if you can infer, Mean and Median for both the data sets are the same figures.
When values are in arithmetic progression, mean is always equal to median.
Mode is a term that appears maximum number of times in a data set. In mathematical term, Mode of a data set is the one that has got the highest frequency.
Mode for the above data set is 4 because it’s frequency is 4.
But If there’s same distribution for 5 and 4 in the above data and frequency for both is 4 then this data set will be Bimodal because If two values appeared same time and more than the rest of the values then the data set is bimodal same as If three values appeared same time and more than the rest of the values then the data set is trimodal and for n modes, that data set is multimodal.
Measure of Spread / Dispersion
Measures of spread describe how similar or varied the set of observed values are for a particular variable (data item). Measures of spread include the range, quartiles and the interquartile range, variance and standard deviation.
Measures of spread describe how similar or varied the set of observed values are for a particular variable (data item). Measures of spread include the range, quartiles and the interquartile range, variance and standard deviation.. A low standard deviation indicates that the data points tend to be close to the mean of the data set, while a high standard deviation indicates that the data points are spread out over a wider range of value.
In case of individual observations, Standard Deviation can be computed in any of the two ways:
1. Take the deviation of the items from the actual mean
2. Take the deviation of the item from the assumed mean
In case of a discrete series, any of the following methods can be used to calculate Standard Deviation:
1. Actual mean method
2. Assumed mean method
3. Step deviation method
Calculate the standard deviation for the following sample data using all methods: 2, 4, 8, 6, 10, and 12.
Mean Deviation / Mean Absolute Deviation
It is an average of absolute differences between each value in a set of values, and the average of all values of that set.
The variance measures how far each number in the set is from the mean.Variance is calculated by taking the differences between each number in the set and the mean, squaring the differences (to make them positive) and dividing the sum of the squares by the number of values in the set.
Marks obtained by few students are: 75, 83, 54, 90, 61
Range is one of the simplest techniques of descriptive statistics. It is the difference between lowest and highest value.
Percentile is a way to represent position of a values in data set. To calculate percentile, values in data set should always be in ascending order.
A percentile is a comparison score between a particular score and the scores of the rest of a group. It shows the percentage of scores that a particular score surpassed. For example, if you score 75 points on a test, and are ranked in the 85 th percentile, it means that the score 75 is higher than 85% of the scores.
The percentile rank is calculated using the formula :
R = P*(N)/100
where P is the desired percentile and N is the number of data points.
Interquartile range (IQR)
When a data set has outliers or extreme values, we summarize a typical value using the median as opposed to the mean. When a data set has outliers, variability is often summarized by a statistic called the interquartile range, which is the difference between the first and third quartiles. The first quartile, denoted Q1, is the value in the data set that holds 25% of the values below it. The third quartile, denoted Q3, is the value in the data set that holds 25% of the values above it. The quartiles can be determined following the same approach that we used to determine the median, but we now consider each half of the data set separately. The interquartile range is defined as follows:
Measure of skewness determines the extent of asymmetry or lack of symmetry. A distribution is said to be asymmetric if its graph does not appear similar to the right and to the left around the central position. In more statistical language, the skewness measures how much is the asymmetry of probability distribution of some given real-valued random variable about the mean. Skewness can be observed in the given data when number of observations are less.
For Example: When the numbers 9, 10, 11 are given, we may easily inspect that the values are equally distributed about the mean 10. But if we add a number 5, so as to get the data as 5, 9, 10, 11, then we can say that the distribution is not symmetric or it is skewed.
The skewness can be viewed by the having a look at the graph. The measure of skewness can be of two types : positive skew and negative skew.
Positive Skew: When the given distribution concentrates on the left side in the graph, it is known as the positive skew. In the following curve, we may easily observe that the right tail is bigger. This may be called as right-tailed or right-skewed distribution.
How to compute the skewness coefficient?
To calculate skewness coefficient of the sample, there are two methods:
1] Pearson First Coefficient of Skewness (Mode skewness)
2] Pearson Second Coefficient of Skewness (Median skewness)
· The direction of skewness is given by the sign. A zero means no skewness at all.
· A negative value means the distribution is negatively skewed. A positive value means the distribution is positively skewed.
· The coefficient compares the sample distribution with a normal distribution. The larger the value, the larger the distribution differs from a normal distribution.
Sample problem: Use Pearson’s Coefficient #1 and #2 to find the skewness for data with the following characteristics:
· Mean = 50.
· Median = 56.
· Mode = 60.
· Standard deviation = 8.5.
Pearson’s First Coefficient of Skewness: -1.17.
Pearson’s Second Coefficient of Skewness: -2.117.
Measure of skewness is applied very commonly, since skewed data is seen quite often in different situations. In commerce, the skewness has to be measured very frequently when incomes are skewed to the right or to the left.
On the other hand, the data which describes the lifetime of some commodity such as a tubelight, is right skewed. The smallest lifetime may be zero, whereas and the long lasting tubelights will provide the positive skewness to the distribution
The degree of tailedness of a distribution is measured by kurtosis. It tells us the extent to which the distribution is more or less outlier-prone (heavier or light-tailed) than the normal distribution. Three different types of curves, courtesy of Investopedia, are shown as follows −
It is difficult to discern different types of kurtosis from the density plots (left panel) because the tails are close to zero for all distributions. But differences in the tails are easy to see in the normal quantile-quantile plots (right panel).
The normal curve is called Mesokurtic curve. If the curve of a distribution is more outlier prone (or heavier-tailed) than a normal or mesokurtic curve then it is referred to as a Leptokurtic curve. If a curve is less outlier prone (or lighter-tailed) than a normal curve, it is called as a platykurtic curve. Kurtosis is measured by moments and is given by the following formula −
The main difference between skewness and kurtosis is that the skewness refers to the degree of symmetry, whereas the kurtosis refers to the degree of presence of outliers in the distribution.
Correlation is a statistical technique that can show whether and how strongly pairs of variables are related.
The main result of a correlation is called the correlation coefficient (or “r”). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.
If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an “inverse” correlation).
I hope, by now you have got a basic understanding of Descriptive Statistics.
If you want to earn via Data Scientist as a career, enroll for our DataTrained Full Stack Data Science Course with Guaranteed Placement.