MIT License
"By Statistics, we mean methods specially adopted to the elucidation of quantitative data affected to a marked extent by multiplicity of causes. --Yule and Kendal
It is different from Business Analytics that can be defined as:
Business Analytics (BA) can be defined as the broad use of data and quantitative analysis for decision making within organizations. -- Thomas Davenport
Statistics is broadly divided into two categories:
Descriptive Statistics involves describing the data, summarising information from the data, and providing a convenient overview to the data in a brief. It is like online dating: it is definite but can be misleading.
For example, if we try to answer the question who is the best batsman in the world of all time in cricket, then there can be many factors to determine this which leads to a confusion. Descriptive statistics based approach may involve say taking the average of all the scores by a batsman and providing it as a single convenient value to evaluate the performance of batsman. This helps in visualising the data in layman terms to provide a definitive perspective to the users.
Inferential Statistics involves learning to make a prediction from the given data or extract information with unusual patterns that can prove to be useful in decision making. It is concerned with population parameters from a sample.
For example, studying the stock market prices in order to predict the prices of the future, and perform risk analysis in order to determine the optimal amount to be invested to maximise profits and prevent loss of investments.
There is also a third category called 'Prescriptive Statistics'. This is used in presribing the test subject with a treatment to affect behaviour of the subject and observe the results to get better interpretations that can be applied to the population.
For example, a doctor treating a set of patients with a certain illness and he tries different treatment to each patient for a week to observe how they react to the treatment, and whichever gets the best results then that treatment is recommended or 'prescribed' to the others who are having similar symptoms.
The first major form of different types of data is in terms of structured and unstructured data. We are concerned with structured data more that has an organized nature and can be used to extract meaningful information and extract patterns from it.
A broad distinguishing feature of data is in terms of its measurement.
Hence, we will now get into details of the different types of such data.
Examples:
Now, why are we learning statistics?? Our aim is to get information from raw data.
Raw Data represent numbers and facts in the original format in which the data have been collected. We need to convert the raw data into information for decision making.
When we mine useful patterns after structuring this data, we get the data in form of information.
Mean = where N = Total number of records or Population n = Total number of records in Sample and n is a subset of N
Note: Mean is known to be prone to outlier or extreme values For example, we are calculating the average of incomes of persons sitting in a restaurant with lowest income being $100 and highest being $500, and mean income is nearly about $300. Suddenly, Bill Gates enter the restaurant and takes a seat. His income is $1 billion (assume), so now the new mean will be approximately a billion which is way higher than incomes of others in the same dataset or population. Hence, mean is sensitive to outliers.
Thus, we use a trimmed mean as described which drops these extreme values. Trimmed Mean = where p are the extreme values on both sides of the data.
Weighted Mean = / is used when the data does not represent all the groups equally and some variables are more intrinsic than other variables and need to be given higher weights as a result.
Median:
Notebook for Measures of Central Tendency: Link
Mean Absolute Deviation = 1/n
Standard Deviation =
Variance =
A robust estimate of variability is the median absolute deviation (MAD). MAD = Median() where Standard Deviation > Mean Absolute Deviation The range is given as follows:
Further, we divide the data into four quarters as follows:
IQR is the middle half of the data, which is not susceptible to outliers making it most suitable for dealing with skewed distributions.
When you have a skewed distribution, the median is a better measure of central tendency, and it makes sense to pair it with either the interquartile range or other percentile-based ranges because all of these statistics divide the dataset into groups with specific proportions.
For normally distributed data, or even data that arent terribly skewed, using the tried and true combination reporting the mean and the standard deviation is the way to go. This combination is by far the most common. You can still supplement this approach with percentile-base ranges as you need.
Notebook for Measures of Variability: Link
Pandas provides a simple five-point summary of the data that gives insight to different aspects:
Boxplot: It is a quick way to visualize the distribution of the data with the five point summary.
Frequency Table or Histograms: It is a frequency distribution of the data, and they are stacked together using bins. If they are too small, the result will be granular and the ability to see bigger pictures are lost.
Quantile Plot: Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences)
Scatter Plot: It shows the direct relationship between the dependent and independent variables in the data.
Correlations refers to simple idea of determining relationship between a set of features, and it can be simply seen as the following possibilities:
A fun example can be the amount of time you study and your GPA. A natural belief of people would be the more you study, the more marks should you get and hence a higher GPA, but most surveys and samples prove that if you study too much then your GPA is actually low compared to candidates who study for lesser hours but score more. Now, this is ofcourse subject to interpreting that there are many other factors associated with the example and not to mention the variance involved in the concentration of different students. But, the idea of correlation is to find unusual patterns, say as weird as the cost of a car wash and the time it takes to buy soda from the car wash station! That is when the actual data analysis comes into picture.
However, actually we have Covariance as the numerator of this coefficient. Covariance tells us how much variability is present in the data from the mean and how much far is the data point from the mean. It does not tell us about the direction of the relationship but provides a quantifying measurement of the strength of association. Covariance(x,y) =
Hypothesis is an assumption about the population parameter to test whether the sample selected is a generic representation of the entire population. Hypothesis Testing simply refers to the process of decision making in order to accept or reject the hypothesis.
The population can be a very large dataset, hence making it computationally expensive to train on the entire data at once, hence we make use of sampling. Sampling is the process of selecting a subset of the data, which is an appropriate representative of the entire data so that when we run a model on this subset it can generate results that can be generalized on the entire population. There are two types of sampling available for any data:
Probability Sampling refers to the technique of sampling where each data point is given an equal chance of being selected in the sample from the population.
Non-Probability based sampling does not give an equal chance to every data point
Our goal is to reject the and accept the . There are many statisticians who argue that there is nothing called as 'Acceptance' because we cannot accept a NULL Hypothesis, there is either rejection of or we fail to reject the . From a simple perspective, the process of Hypothesis Testing has four broad steps:
Hypothesis Testing plays an integral part in deciding whether the assumptions made by the data scientist are correct or just random by chance on any data regarding patterns present in them. We come across different errors based on decision taken. In a summary, we can say that hypothesis testing is a kind of inferential statistics, where we try to prove an assumption for the entire population through a generated subset called sample of observations from within the population. A good example to understand the idea of hypothesis testing is as follows: Consider the example of trying to determine the percentage of community spread, in Bangalore with the rising number of COVID-19 cases. The Health Minister of Karnataka says that on average there is a 75% spread in infection cases. However, we cannot simply agree without proof, so here we are!
: The mean percentage of COVID-19 spread is 75% in Bangalore, Karnataka /: The mean percentage of COVID-19 spread is not 75% in Bangalore, Karnataka where is mean percentage is not equal to 75%
Now, we will perform the Z-Test: Z = where = Sample Mean, =Hypothesized Population Mean, =Standard Error/Deviation and n=Sample Size Through this, we are standardizing the sample mean we got. So if the sample mean is close enough to hypothesized mean , then Z will be close to 0. and we will have to accept the Null Hypothesis. Now, coming to how to reject it! So, we have to determine how big Z-value should be in order to reject the Null Hypothesis! Thus, we carry out a two-tailed distribution test, _ which is generally used when the hypothesis involves sign of equality or inequality_ Hence, we can define the two tailed test as follows: A two-tailed test is a test of a statistical hypothesis, where the region of rejection is on both sides of the sampling distribution and the region of acceptance is in the middle. The shaded parts in the given tests, are the Rejection Regions. The threshold values depends on the Significance Level taken for the hypothesis to be true. For example, we have =0.05 then the threshold values on two sides of a 2-tailed test will be 0.05/2 = 0.025 and then look this value in the Z-table, we find that Z-value is 1.96 and -1.96 so that will be the threshold on either sides. Therefore, the value of Z we get from the test is lower than -1.96, or higher than 1.96, we will reject the null hypothesis. Otherwise, we will accept it. Now, it can turn out that your hypothesis results will be wrong in terms of decision making. Thus, in general, two types of errors can arise:
Now, before further we have to understand the concepts of Point Estimate in terms of Sample Mean () and Population Mean () with context to Hypothesis Testing.
Also, we can say that the reader is interested in pursuing machine learning later in general considering you are looking for Statistical Insights to build concept, and I am 90% confident that you will learn machine learning after going through this curated content!
Now, coming to the concept where we do not want to perform hypothesis testing on the basis of predefined significance levels, and hence we make use of a p-value. This is the smallest significance value that can still reject the NULL hypothesis.
HOW TO CALCULATE THE P-VALUE? We calculate p-value, by the following formulae after calculating the Z-value of the sample:
So, let us see a simple difference between Z-Test and T-Test:
So, before we proceed any further let us be clear about all these tests and when to use them:
The Simple Linear Regression is a method to determine linear relationship between a dependent variable and independent variable(s). In simple words, if the independent variable increases then dependent variable also increases, and vice versa if it decreases.
The major difference between correlation and regression is that the correlation is a measure of strength of association between two variables, while regression quantifies this relationship.
A few key terms in describing the regression are:
The Regression Equation is as follows:
y = b0 + b1.x
If the fitted values do not predict correctly (that is they do not fall on the line of regression) we compute residuals, which is used to be minimized in order to compute an extra error term explicitly.
y = b0 + b1.x + e
Residual Sum of Squares =
The method of minimizing the sum of residual squares is called the Ordinary Least Squares Regression (OLSR).
Till now, we discussed how the linear relationships are determined using regression and there is a non-linear aspect to regression to this.
Profiling: It is the term used to describe the relationship between prediction and explanation. A regression model can act as a predictive model and give us the values which are needed as the prediction of input features. However, it does not give us an explicit explanation of why this prediction is correct or not. This is done by our own knowledge.
For example, we can predict the number of purchases based on the number of ad clicks on a website but in the end it is our marketing knowledge that will tell us whether the sales increased and was profitable or not and not the other way around. This is called Profiling.
Hence, we can conclude that A regression model that fits the data well is set up such that changes in X lead to changes in Y. However, by itself, the regression equation does not prove the direction of causation. Conclusions about causation must come from a broader context of understanding about the relationship.
Thus, we come across three types of modelling in total as follows:
When multiple predictors are combined to get a single strong predictor as regression line then we refer to it as multiple linear regression.
It can be simply be represented as:
y = b0 + b1.x + b2.x2 + b3.x3 +.....+ bnxn + e where e is the root mean square error term.
Key Terms for Multiple Linear Regression:
If we replace the denominator term in RMSE with n-p-1 then it is called Residual Standard Error and it is the basis for degrees of freedom.