Sunday, 3 March 2019

Regression Analysis

Inferential Statistics

Regression Analysis is common method of prediction. It is used when ever there is causal relationship between variables.

Points to note
  • Correlation doesn't imply causation. Some variables are strangely correlated while few are unexpectedly not correlated
Linear Regression is a linear approximation of a causal relationship between two or more variables.
  • Regression models are highly valuable as they are one of the most common ways to make inferences and predictions. 
  • Process of Linear regression
    • Get sample data
    • Design model that works for the sample
    • Make predictions for the whole population
    • Dependent variables are predicted from Independent variables.
      Y=F(x1,x2,x3.....)
      Dependent variable Y is a function of the independent variables x1,x2,...
Simple Linear Regression is the simplest regression model.
  • y hat=b0 +b1x1 (hat stands for estimated or predicted value)
    • b0 is intercept on the line graph
    • b1 is the slope of the line
Correlation vs Regression
  • Correlation doesn't imply causation. It is degree of relationship between two variables.
  • Correlation is degree of inter relation between two variables. 
  • Regression Analysis is about how one variable effects another. 
  • Regression is based on causaulity. It shows no degree of connection but cause and effect.
  • Correlation P(x,y) is same as P(y,x)
  • Regression is one way
  • Correlation is single point on graph.
  • Regression is best fitting line between the data points that minimizes distance between them.
Decomposing Linear model
  • Sum of squares total(SST) - sigma(yi - ymu)^2 - diff between actual/actual value & mean
  • sum of squares regression(SSR) - sigma(yhat - ymu)^2  - diff between predicted value & mean
  • sum of squares error(SSE) - diff between observed value & predicted value
SST=SSR+SSE
Total variability = Explained variability + unexplained variability

R ^2(R squared ) = SSR/SST = variability explained by the regression/total variability
    The R-squared shows how much of the total variability of the dataset is explained by your regression model. This may be expressed as: how well your model fits your data. It is incorrect to say your regression line fits the data, as the line is the geometrical representation of the regression equation. It also incorrect to say the data fits the model or the regression line, as you are trying to explain the data with a model, not vice versa.
  • R Squared measures the goodness of fit of your model
  • More factors you include, higher the R Squared
  • R Squared ranges between 0 & 1. 1 means the model explains entire variability of data.

Ordinary Least squares (min SSE)
    =min sigma ei^2
    s(b) is the OLS estimator of beta for a simple linear regression
    s(b) = sigma(yi - xi^Tb)^2 = (y-Xb)^T(y-Xb)

Regression Tables
  • Model summary
    • Multiple R
    • R square
    • Adjusted R Square
    • Standared error - sqrt(SSE/(n-2))
    • Observations
  • Anova table (Analysis of Variance)
    • SSE
    • SSR
    • SST 
  • Table with coeffecients(This is heart of regressions)
  •         intercept(beta 0)
  •         independent variable(beta 1)

Adjusted R Square
  • It penalizes execessive use of variables
  • The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance.

No comments:

Post a Comment