Regression

Confidence intervals

The confidence interval for a point estimate measures is the interval within which we have a particular degree of confidence the true value resides. For example, the 95% confidence interval for the mean height in a population may be [1.78m, 1.85m].

Confidence intervals can be calculated in this way:

  1. Let \alpha be the specified confidence level. eg \alpha = 0.95 for the 95% confidence level.
  2. Let f(x; n-1) be the pdf for Student’s t distribution, parameterised by the number of degrees of freedom which is the sample size (n) minus 1.
  3. Calculate t = f(1 - \alpha/2; n-1)
  4. Then the confidence interval for the point estimate is:

\bar{x} - t \frac{s}{\sqrt{n}} \leq x \leq \bar{x} + t \frac{s}{\sqrt{n}}

Where \bar{x} is the estimated value of the statistic, x is the true value and s is the sample standard deviation.

Isotonic regression

Fits a step-wise monotonic function to the data. A useful way to avoid overfitting if there is a strong theoretical reason to believe that the function y = f(x) is monotonic. For example, the relationship between the floor area of houses and their prices.

Linear regression

The simplest form of regression. Estimates a model with the equation:

\hat{y} = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n

where the \beta_i are parameters to be estimated by the model and the x_i are the features.

The loss function is usually the squared error.

Normal equation

The equation that gives the optimal parameters for a linear regression.

Rewrite the regression equation as \hat{y} = X \beta.

Then the formula for \beta which minimizes the squared error is:

\beta = (X^T X)^{-1} X^T y

Logistic regression

Used for modelling probabilities. It uses the sigmoid function (\sigma) to ensure the predicted values are between 0 and 1. Values outside of this range would not make sense when predicting a probability. The functional form is:

\hat{y} = \sigma(\beta_0 + \beta_1 x_1 + ... + \beta_n x_n)

Multicollinearity

When one of the features is a linear function of one or more of the others.

P-values

Measure the statistical significance of the coefficients of a regression. The closer the p-value is to 0, the more statistically significant that result is.

The p-value is the probability of seeing an effect greater than or equal to the one observed if there is in fact no relationship.

In a regression the formula for calculating the p-value of a coefficient is:

TODO