The main goal of this document is to provide a refresher on regression methods and model selection using statistical learning. This guide also introduces several useful R functions that are common in Machine Learning.
A second goal of this document is to provide an R code companion of the ‘prostate cancer’ dataset that is introduced in The Elements of Statistical Learning (ESL). The Dataset can be obtained here: https://web.stanford.edu/~hastie/ElemStatLearn/data.html.
The original dataset comes from a study by Stamey et al. (1989). Although real world data sets may cover tough topics, they are often those with the highest utility for Case studies.
In linear regression, we assumed that we can create a model of the form:
\[Y=\beta_{0}+\sum_{j=1}^{p}\beta_{j}X_{j}\]
Recall for least squares regression, we wish to minimize the Residual Sum of Squares
\[RSS(\beta)=(\mathbf{y}-\mathbf{X}\beta)^{T}(\mathbf{y}-\mathbf{X} \beta)\]
To minimize RSS with respect to \(\beta\) we use calculus to find the critical points (set the first deravative to zero). Note it is truly a minimization as RSS is quadratic.
This gives
\[\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0\]
hence assuming non-singularity, we obtain the unique solution of :
\[\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X^{T}}\mathbf{y}\]
Thus, we can form our model with our fitted values:
\[\hat Y = \hat \beta_{0}+\sum_{j=1}^{p}X_{j}\hat \beta_{j}\]
Linear regression requires several assumptions to be made. One important assumption is that the outcomes, \(y_{i}\) are uncorrelated and have some constant variance, \(\sigma^{2}\). We also assume that the \(x_{i}\) are non random.
we have a set of predictors \(\mathbf{X}^{T}=(X_{1},X_{2},..X_{p})\) and an outcome of interest \(\mathbf{Y}\). We wish to form a model for predicting the response \(\mathbf{Y}\). Our model will include coefficients, \(\beta_{j}\), for each predictor that we include. We also include an intercept term, \(\beta_{0}\).
The response on interest (\(\mathbf{Y}\)) in this data set is lpsa (log prostate specific antigen). The predictors arelog cancer volume (lcavol),log of weight (lweight ), age, log(benign prostatic hyperplasia amount) (lbpsa), seminal vesicle invasion (svi), log(capsular penetration) (lcp), Gleason score and the percentage of gleason scores (pgg45).
One key concept of Machine learning is that of “Train and Test sets”. Typically, researchers split the dataset themselves in order to obtain a train and test set. The prostate data set comes prelabeled with which observations should belong to the training set and which belong to the test set. The test set will allow us to determine how well our model performs on data that it was not trained on. It is key that test data is NOT used to build the initial model!
Note the following: \(N\) refers to the number of examples while p refers to the number of predictors included in the model.
In Part 2, we will move on to more advanced regression topics such as shrinkage. In that chapter we will learn about other types of regression such as Ridge Regression.
In the final Part, we will discuss more advanced techniques such as the Lasso, Principal Component Regression and Partial Least squares.
options(digits=2)
suppressMessages(library(tidyverse)) # Used for piping %>%
prostate_data=read.table(file='prosdats.txt',header=FALSE) # Read the data in.
names(prostate_data)=c('id','lcavol', 'lweight', 'age', 'lbph', 'svi', 'lcp', 'gleason', 'pgg45','lpsa', 'train') # Read in the names manually
prostate_data=prostate_data%>% # We don't need to consider ID variable so we can drop it.
dplyr::select(-(id))
prostate_data_train=prostate_data%>% # We only wish to keep the training data to model.
filter(train==TRUE)
prostate_data_test=prostate_data%>% # We only wish to keep the training data to model.
filter(train==FALSE)
# Reorder the variables for better visualization. The following pairs plot should look similar to the pairs plot in ESL page 3.
prostate_data_train=prostate_data_train[ ,c('lpsa','lcavol','lweight','age','lbph','svi','lcp','gleason','pgg45')]
predictors_train=prostate_data_train[ ,c('lcavol','lweight','age','lbph','svi','lcp','gleason','pgg45')]
predictors_test=prostate_data_test[ ,c('lcavol','lweight','age','lbph','svi','lcp','gleason','pgg45')]
dim(prostate_data_train)
## [1] 67 9
dim(prostate_data_test)
## [1] 30 10
Our training data set has 67 observations (N=67). Our test set has 30 observations.
Often, researchers are not simply interested on an outcome, but we are interested in the correlation between predictors. One way to visualize these relationships is known as a pairs plot.
pairs(prostate_data_train,col="darkorchid3")
Notice that lcavol, lweight and age appear to be positively correlated with lpsa.
Although most researchers focus on building a model to decide which predictors should be included, it is wise to consider correlation among predictors themselves as seen in the above pairs plot.
cor(prostate_data_train)
## lpsa lcavol lweight age lbph svi lcp gleason pgg45
## lpsa 1.00 0.733 0.485 0.23 0.263 0.56 0.489 0.342 0.448
## lcavol 0.73 1.000 0.300 0.29 0.063 0.59 0.692 0.426 0.483
## lweight 0.49 0.300 1.000 0.32 0.437 0.18 0.157 0.024 0.074
## age 0.23 0.286 0.317 1.00 0.287 0.13 0.173 0.366 0.276
## lbph 0.26 0.063 0.437 0.29 1.000 -0.14 -0.089 0.033 -0.030
## svi 0.56 0.593 0.181 0.13 -0.139 1.00 0.671 0.307 0.481
## lcp 0.49 0.692 0.157 0.17 -0.089 0.67 1.000 0.476 0.663
## gleason 0.34 0.426 0.024 0.37 0.033 0.31 0.476 1.000 0.757
## pgg45 0.45 0.483 0.074 0.28 -0.030 0.48 0.663 0.757 1.000
The correlations support our hypothesis based off the pairs plots.
It can be shown by deriving the variance-covariance matrix that \(Var(\hat \beta)=(\mathbf{X}^{T}\mathbf{X})^{-1} \sigma^{2}\)
The most common estimate of \(\sigma^{2}\) is \(\frac{1}{N-p-1} \sum_{i=1}^{N}(y_{i}-\hat y_{i})^{2}\). Note this is an unbiased estimator.
To make conclusions regarding our model, we also must assume that it is valid and can even be represented as a linear combonation of our predictors. That is we assume that Y can be written in the form \(Y=\beta_{0}+\sum_{j=1}^{p}X_{j}\beta_{j}+\epsilon\). where \(\epsilon\) is assumed to be \(N(0,\sigma^{2})\). With these assumtions, it can be shown using properties of mathematical statistics that \(\hat \beta \sim N(\beta,(\mathbf{X}^{T}\mathbf{X})^{-1} \sigma^{2})\)
It follows that \((N-p-1)\hat \sigma^{2} \sim \sigma^{2}\chi^{2}_{N-p-1}\) (A chi squared distribution with N-p-1 degrees of freedom, in addition \(\hat \beta and \hat \sigma^{2}\) are independent.)
Now that we have laid out these assumptions, we can discuss how they can be used to form hypothesis tests.
To test whether a coefficient of the linear model, \(\beta_{j}=0\) we must form the standard Z score,
\(Z_{j}= \frac{ \hat \beta_{j}}{\hat \sigma \sqrt{v_{j}}}\) Note that \(v_{j}\) is the jth diagonal of \((\mathbf{X}^{T}\mathbf{X})^{-1}\).
If we assume that \(\beta_{j}=0\), then \(z_{j}\) is t distributed with N-p-1 degrees of freedom. (If we actually know \(\sigma\) , we can use a normal distribution instead). If the absolute value of \(z_{j}\) is large, this indicates we can reject the null hypothesis and hence conclude it is an important predictor.
Hence, to test if a predictor with k levels can be excluded from the model, we test the hypothesis that the associated coefficient (or coefficients of the dummy variable if it is categorical) can be set to zero. This can be accomplished using the F-statistic,
\(F=\frac{\frac{RSS_{0}-RSS_{1}}{p_{1}-p_{0}}}{\frac{RSS_{1}}{N-p_{1}-1}}\)
The F statistic allows us to compute the change in residual error per additional parameters in a larger model.
Under the typical Gaussian assumptions, if the null hypothesis is that the smaller model is correct than the F statistic will have \(F_{p_{1}-p_{0},N-{p_{1}-1}}\) distribution. For large values of N the quantiles of the previously mentioned F distribution approach that of a \(\frac{\chi^{2}_{p_{1}-p_{0}}}{p_{1}-p_{0}}\)
Hence, we can form a \(1-2*\alpha\) confidence interval for each \(\beta_{j}\), that being
\[(\hat \beta_{j}-z^{(1-\alpha)} v_{j}^{0.5}\hat \sigma,\hat \beta_{j}+z^{(1-\alpha)}v_{j}^{0.5}\hat \sigma)\]
It is often wise to standardize predictors before analysis. This is especially important when predictors are on differing scales. We standardize by mean and standard deviation. We do NOT scale our response variable. In larger datasets, standardization also speeds up learning algorithms.
#Scale function to standardize mean and sd
predictors_scaled=as.data.frame(scale(predictors_train))
prostate_data_train=data.frame(prostate_data_train$lpsa,predictors_scaled)
names(prostate_data_train)= c('lpsa', 'lcavol', 'lweight', 'age', 'lbph' , 'svi' , 'lcp', 'gleason', 'pgg45')
predictors_scaled_test=as.data.frame(scale(predictors_test))
prostate_data_test=data.frame(prostate_data_test$lpsa,predictors_scaled_test)
names(prostate_data_test)= c('lpsa', 'lcavol', 'lweight', 'age', 'lbph' , 'svi' , 'lcp', 'gleason', 'pgg45')
The first model we fit is a model using least squares regression and all predictors.
prostate_linear=lm(lpsa~lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45,data=prostate_data_train)
summary(prostate_linear)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp +
## gleason + pgg45, data = prostate_data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6487 -0.3415 -0.0542 0.4494 1.4868
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4523 0.0870 28.18 < 2e-16 ***
## lcavol 0.7164 0.1335 5.37 1.5e-06 ***
## lweight 0.2926 0.1064 2.75 0.0079 **
## age -0.1425 0.1021 -1.40 0.1681
## lbph 0.2120 0.1031 2.06 0.0443 *
## svi 0.3096 0.1254 2.47 0.0165 *
## lcp -0.2890 0.1548 -1.87 0.0670 .
## gleason -0.0209 0.1426 -0.15 0.8839
## pgg45 0.2773 0.1596 1.74 0.0875 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.71 on 58 degrees of freedom
## Multiple R-squared: 0.694, Adjusted R-squared: 0.652
## F-statistic: 16.5 on 8 and 58 DF, p-value: 2.04e-12
The above table summarizes the basic information of our linear model. The most important predictor appears to be lcavol, which makes sense. lweight , lbph and svi also appear significant. A positive estimate of a coefficient indicates a positive correlation with the response.
Suppose we wish to compare the above full model to a reduced model. Suppose we consider a reduced model that only contains the significant predictors indicated above. That is, we remove age, lcp , gleason and pgg45.
In that case, \(p_{1}=9\) and \(p_{0}=5\) and \(N=67\). (Note that the intercept counts as a predictor).
To calculate the Residual Sum of Squares (RSS) for the training data we use the predict function. RSS for the full model is about 29.
predictions=predict(prostate_linear,newdata=data.frame(prostate_data_train))
RSS1=sum((predictions-prostate_data_train$lpsa)^2)
prostate_linear_reduced=lm(lpsa~lcavol+lweight+lbph+svi,data=prostate_data_train)
summary(prostate_linear_reduced)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + lbph + svi, data = prostate_data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8709 -0.3903 -0.0172 0.5676 1.4227
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4523 0.0889 27.59 < 2e-16 ***
## lcavol 0.6282 0.1150 5.46 8.9e-07 ***
## lweight 0.2568 0.1052 2.44 0.018 *
## lbph 0.2049 0.1031 1.99 0.051 .
## svi 0.2822 0.1148 2.46 0.017 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.73 on 62 degrees of freedom
## Multiple R-squared: 0.659, Adjusted R-squared: 0.637
## F-statistic: 30 on 4 and 62 DF, p-value: 6.91e-14
predictions_reduced=predict(prostate_linear_reduced,newdata=data.frame(prostate_data_train))
RSS0=sum((predictions_reduced-prostate_data_train$lpsa)^2)
Thus we have that RSS1=29 and RSS0=33
Hence we can calculate \(F=\frac{\frac{(33-29)}{(9-5)}}{\frac{29}{67-9}}=2\)
But \(Pr(F_{4,58} >2)= 0.15\) (Can be found using df function in R). Recall the null hypothesis is that the smaller reduced model is correct. Since we cannot reject the null hypothesis, we cannot reject the smaller reduced model. Hence, it would be appropriate to work with the reduced model.ddf
predict_test=predict(prostate_linear,newdata=data.frame(prostate_data_test))
RSS_test=sum((predict_test-prostate_data_test$lpsa)^2)
Note that the Gauss-Markov Theorem proves that the least squares estimate of \(\beta\) has the smallest variance among any unbiased linear estimates.
However, least squares is not without issue. Firstly, least square errors often have low bias, but high variance.
The second reason is that of “interpretation”. When the number of predictors is large, reasoning out an explanation of the coefficients becomes difficult.
The next method we consider is “Best Subset Selection”. In this method, for each integer ranging from 0 to p, we find the subset of predictors that gives the smallest RSS.
Adding extra predictors comes with a cost. Although extra predictors will always lower training error, they complicate the model and increase variance. Including to many extra predictors can also hurt performance on the test set. Thus, researchers need to decide between more or less complicated models. The deciding factor usually has to do with what is known as the “Bias Variance Tradeoff”
library(caret)
library(leaps)
best_subsets <- regsubsets(lpsa~., data = prostate_data_train, nvmax = 8,nbest=1)
summary(best_subsets)
## Subset selection object
## Call: regsubsets.formula(lpsa ~ ., data = prostate_data_train, nvmax = 8,
## nbest = 1)
## 8 Variables (and intercept)
## Forced in Forced out
## lcavol FALSE FALSE
## lweight FALSE FALSE
## age FALSE FALSE
## lbph FALSE FALSE
## svi FALSE FALSE
## lcp FALSE FALSE
## gleason FALSE FALSE
## pgg45 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## lcavol lweight age lbph svi lcp gleason pgg45
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " " " " " "
## 3 ( 1 ) "*" "*" " " " " "*" " " " " " "
## 4 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 5 ( 1 ) "*" "*" " " "*" "*" " " " " "*"
## 6 ( 1 ) "*" "*" " " "*" "*" "*" " " "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
For example, if we consider the best possible subset containing two predictors, then our model would contain lcavol and lweight. This is also known as the “one standard error” rule, whereby the model that is simplest but also within one standard deviation of the minimum sum of squares model. Note the stars indicate which variable we would include in the best subet. The argument nbests returns the best subset for each size up to nvmax. I just include the best model for each but you can play around with the parameter to see all models.
linear_best_subset=lm(lpsa~lcavol+lweight,data=prostate_data_train)
summary(linear_best_subset)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight, data = prostate_data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.589 -0.442 0.013 0.526 1.931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4523 0.0930 26.37 < 2e-16 ***
## lcavol 0.7799 0.0982 7.94 4.1e-11 ***
## lweight 0.3519 0.0982 3.58 0.00066 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.76 on 64 degrees of freedom
## Multiple R-squared: 0.615, Adjusted R-squared: 0.603
## F-statistic: 51.1 on 2 and 64 DF, p-value: 5.54e-14
#We need to calculate RSS for the intercept only model first. RSS_int
prostate_intercept=lm(lpsa~1,data=prostate_data_train)
intercept_pred=predict(prostate_intercept,new.data=prostate_data_train)
RSS_intercept=sum((intercept_pred-prostate_data_train$lpsa)^2)
plot(x=0:8,c(RSS_intercept,summary(best_subsets)$rss),xlim=c(0,8),ylim=c(0,100),xlab="Subset Size k",ylab="Residual Sum-of-Squares",col='red')
lines(x=0:8,c(RSS_intercept,summary(best_subsets)$rss),col='red')
As discussed before, too many parameters can complicate a model. Thus, we may wish to penalize the size of coefficient estimates. Shrinkage is a method that forces the coefficient sizes down (also known as “weight decay” in Deep Learning).
These methods require a shrinkage or complexity parameter, often denoted as \(\lambda\). \((\lambda \ge 0)\)
One popular of shrinkage is “Ridge Regression”.
Instead of simply minimizing the regular RSS, in Ridge Regression we form the following estimate:
\[\hat \beta^{ridge} = argmin_{\beta}(\sum_{i=1}^{N}(y_{i}-\beta_{0}-\sum_{j=1}^{p}x_{ij}\beta_{j})^{2}+\lambda \sum_{j=1}^{p}\beta_{j}^{2}))\]
Notice we penalize large values of coefficients. From a glance, this will push or “shrink” our estimated coefficients.
In other words, the above is equivelent to finding the coefficients that minimize the follow: \[RSS(\lambda)=(y-X\beta)^{T}(y-X\beta)+\lambda\beta^{T}\beta\]
Which using calculus as before, can be shown to have a solution of
\[\hat \beta^{ridge}=(\mathbf{X}^{T}\mathbf{X}+\lambda \mathbf{I})^{-1}\mathbf{X}^{T}\mathbf{y}\]
suppressMessages(library(glmnet))
#Choose an intermediate value of Lambda , say 0.5
model_crossv <- glmnet(x=as.matrix(predictors_scaled), as.matrix(prostate_data_train$lpsa), alpha = 0, lambda = 0.5, standardize = TRUE)
coef(model_crossv)
## 9 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 2.452
## lcavol 0.410
## lweight 0.246
## age -0.037
## lbph 0.163
## svi 0.228
## lcp 0.021
## gleason 0.045
## pgg45 0.129
Notice the coefficients have shrunk when compared to the original linear model! If we increase lambda, we would shrink the coefficients even more.
In the next tutorial, we will cover the Lasso and more advanced techniques.