Logistic regression models a Bernoulli distributed response variable in terms of linear combinations of explanatory variables.
This extends least squares regression to the case where the response variable captures a “success” or “failure” type outcome.
If \(Y \sim \mbox{Bernoulli}(p)\), then its pmf is:
\[ \begin{aligned} f(y; p) & = p^{y} (1-p)^{1-y} \\ & = \exp\left\{ \log\left(\frac{p}{1-p}\right)y + \log(1-p) \right\} \end{aligned} \]
In exponential family distribution (EFD) notation,
\[ \eta(p) = \log\left(\frac{p}{1-p}\right) \equiv {\operatorname{logit}}(p), \]
\(A(\eta(p)) = \log(1 + \exp(\eta)) = \log(1-p)\), and \(y\) is the sufficient statistic.
\(({\boldsymbol{X}}_1, Y_1), ({\boldsymbol{X}}_2, Y_2), \ldots, ({\boldsymbol{X}}_n, Y_n)\) are distributed so that \(Y_i | {\boldsymbol{X}}_i \sim \mbox{Bernoulli}(p_i)\), where \(\{Y_i | {\boldsymbol{X}}_i\}_{i=1}^n\) are jointly independent and
\[ {\operatorname{logit}}\left({\operatorname{E}}[Y_i | {\boldsymbol{X}}_i]\right) = \log\left( \frac{\Pr(Y_i = 1 | {\boldsymbol{X}}_i)}{\Pr(Y_i = 0 | {\boldsymbol{X}}_i)} \right) = {\boldsymbol{X}}_i {\boldsymbol{\beta}}. \]
From this it follows that
\[ p_i = \frac{\exp\left({\boldsymbol{X}}_i {\boldsymbol{\beta}}\right)}{1 + \exp\left({\boldsymbol{X}}_i {\boldsymbol{\beta}}\right)}. \]
The \({\boldsymbol{\beta}}\) are estimated from the MLE calculated from:
\[ \begin{aligned} \ell\left({\boldsymbol{\beta}}; {\boldsymbol{y}}, {\boldsymbol{X}}\right) & = \sum_{i=1}^n \log\left(\frac{p_i}{1-p_i}\right) y_i + \log(1-p_i) \\ & = \sum_{i=1}^n ({\boldsymbol{x}}_i {\boldsymbol{\beta}}) y_i - \log\left(1 + \exp\left({\boldsymbol{x}}_i {\boldsymbol{\beta}}\right) \right) \end{aligned} \]
Initialize \({\boldsymbol{\beta}}^{(1)}\).
For each iteration \(t=1, 2, \ldots\), set \[ p_i^{(t)} = {\operatorname{logit}}^{-1}\left( {\boldsymbol{x}}_i {\boldsymbol{\beta}}^{(t)} \right), \ \ \ \ z_i^{(t)} = {\operatorname{logit}}\left(p_i^{(t)}\right) + \frac{y_i - p_i^{(t)}}{p_i^{(t)}(1-p_i^{(t)})} \] and let \({\boldsymbol{z}}^{(t)} = \left\{z_i^{(t)}\right\}_{i=1}^n\).
Form \(n \times n\) diagonal matrix \(\boldsymbol{W}^{(t)}\) with \((i, i)\) entry equal to \(p_i^{(t)}(1-p_i^{(t)})\).
Obtain \({\boldsymbol{\beta}}^{(t+1)}\) by performing the wieghted least squares regression (see GLS from earlier)
\[ {\boldsymbol{\beta}}^{(t+1)} = \left({\boldsymbol{X}}^T \boldsymbol{W}^{(t)} {\boldsymbol{X}}\right)^{-1} {\boldsymbol{X}}^T \boldsymbol{W}^{(t)} {\boldsymbol{z}}^{(t)}. \]
For exponential family distribution response variables, the generalized linear model is
\[ \eta\left({\operatorname{E}}[Y | {\boldsymbol{X}}]\right) = {\boldsymbol{X}}{\boldsymbol{\beta}}\]
where \(\eta(\theta)\) is function of the expected value \(\theta\) into the natural parameter. This is called the canonical link function in the GLM setting.
The iteratively reweighted least squares algorithm presented above for calculating (local) maximum likelihood estimates of \({\boldsymbol{\beta}}\) has a generalization to a large class of exponential family distribution response vairables.
glm()
Function in R> mydata <-
+ read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
> dim(mydata)
[1] 400 4
> head(mydata)
admit gre gpa rank
1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1
4 1 640 3.19 4
5 0 520 2.93 4
6 1 760 3.00 2
Data and analysis courtesy of http://www.ats.ucla.edu/stat/r/dae/logit.htm.
> apply(mydata, 2, mean)
admit gre gpa rank
0.3175 587.7000 3.3899 2.4850
> apply(mydata, 2, sd)
admit gre gpa rank
0.4660867 115.5165364 0.3805668 0.9444602
>
> table(mydata$admit, mydata$rank)
1 2 3 4
0 28 97 93 55
1 33 54 28 12
> ggplot(data=mydata) +
+ geom_boxplot(aes(x=as.factor(admit), y=gre))
> ggplot(data=mydata) +
+ geom_boxplot(aes(x=as.factor(admit), y=gpa))
> mydata$rank <- factor(mydata$rank, levels=c(1, 2, 3, 4))
> myfit <- glm(admit ~ gre + gpa + rank,
+ data = mydata, family = "binomial")
> myfit
Call: glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = mydata)
Coefficients:
(Intercept) gre gpa rank2
-3.989979 0.002264 0.804038 -0.675443
rank3 rank4
-1.340204 -1.551464
Degrees of Freedom: 399 Total (i.e. Null); 394 Residual
Null Deviance: 500
Residual Deviance: 458.5 AIC: 470.5
> summary(myfit)
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6268 -0.8662 -0.6388 1.1490 2.0790
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
rank2 -0.675443 0.316490 -2.134 0.032829 *
rank3 -1.340204 0.345306 -3.881 0.000104 ***
rank4 -1.551464 0.417832 -3.713 0.000205 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
Number of Fisher Scoring iterations: 4
> anova(myfit, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: admit
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 399 499.98
gre 1 13.9204 398 486.06 0.0001907 ***
gpa 1 5.7122 397 480.34 0.0168478 *
rank 3 21.8265 394 458.52 7.088e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> anova(myfit, test="LRT")
Analysis of Deviance Table
Model: binomial, link: logit
Response: admit
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 399 499.98
gre 1 13.9204 398 486.06 0.0001907 ***
gpa 1 5.7122 397 480.34 0.0168478 *
rank 3 21.8265 394 458.52 7.088e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> cuse <-
+ read.table("http://data.princeton.edu/wws509/datasets/cuse.dat",
+ header=TRUE)
> dim(cuse)
[1] 16 5
> head(cuse)
age education wantsMore notUsing using
1 <25 low yes 53 6
2 <25 low no 10 4
3 <25 high yes 212 52
4 <25 high no 50 10
5 25-29 low yes 60 14
6 25-29 low no 19 10
Data and analysis courtesy of http://data.princeton.edu/R/glms.html.
Note that in this data set there are multiple observations per explanatory variable configuration.
The last two columns of the data frame count the successes and failures per configuration.
> head(cuse)
age education wantsMore notUsing using
1 <25 low yes 53 6
2 <25 low no 10 4
3 <25 high yes 212 52
4 <25 high no 50 10
5 25-29 low yes 60 14
6 25-29 low no 19 10
When this is the case, we call the glm()
function slighlty differently.
> myfit <- glm(cbind(using, notUsing) ~ age + education + wantsMore,
+ data=cuse, family = binomial)
> myfit
Call: glm(formula = cbind(using, notUsing) ~ age + education + wantsMore,
family = binomial, data = cuse)
Coefficients:
(Intercept) age25-29 age30-39 age40-49
-0.8082 0.3894 0.9086 1.1892
educationlow wantsMoreyes
-0.3250 -0.8330
Degrees of Freedom: 15 Total (i.e. Null); 10 Residual
Null Deviance: 165.8
Residual Deviance: 29.92 AIC: 113.4
> summary(myfit)
Call:
glm(formula = cbind(using, notUsing) ~ age + education + wantsMore,
family = binomial, data = cuse)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5148 -0.9376 0.2408 0.9822 1.7333
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8082 0.1590 -5.083 3.71e-07 ***
age25-29 0.3894 0.1759 2.214 0.02681 *
age30-39 0.9086 0.1646 5.519 3.40e-08 ***
age40-49 1.1892 0.2144 5.546 2.92e-08 ***
educationlow -0.3250 0.1240 -2.620 0.00879 **
wantsMoreyes -0.8330 0.1175 -7.091 1.33e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 165.772 on 15 degrees of freedom
Residual deviance: 29.917 on 10 degrees of freedom
AIC: 113.43
Number of Fisher Scoring iterations: 4
> anova(myfit)
Analysis of Deviance Table
Model: binomial, link: logit
Response: cbind(using, notUsing)
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 15 165.772
age 3 79.192 12 86.581
education 1 6.162 11 80.418
wantsMore 1 50.501 10 29.917
See http://data.princeton.edu/R/glms.html for more on fitting logistic regression to this data set.
A number of interesting choices are made that reveal more about the data.
The generalized linear model (GLM) builds from OLS and GLS to allow for the case where \(Y | {\boldsymbol{X}}\) is distributed according to an exponential family distribution. The estimated model is
\[ g\left({\operatorname{E}}[Y | {\boldsymbol{X}}]\right) = {\boldsymbol{X}}{\boldsymbol{\beta}}\]
where \(g(\cdot)\) is called the link function. This model is typically fit by numerical methods to calculate the maximum likelihood estimate of \({\boldsymbol{\beta}}\).
Recall that if \(Y\) follows an EFD then it has pdf of the form
\[f(y ; \boldsymbol{\theta}) = h(y) \exp \left\{ \sum_{k=1}^d \eta_k(\boldsymbol{\theta}) T_k(y) - A(\boldsymbol{\eta}) \right\} \]
where \(\boldsymbol{\theta}\) is a vector of parameters, \(\{T_k(y)\}\) are sufficient statistics, \(A(\boldsymbol{\eta})\) is the cumulant generating function.
The functions \(\eta_k(\boldsymbol{\theta})\) for \(k=1, \ldots, d\) map the usual parameters \({\boldsymbol{\theta}}\) (often moments of the rv \(Y\)) to the natural parameters or canonical parameters.
\(\{T_k(y)\}\) are sufficient statistics for \(\{\eta_k\}\) due to the factorization theorem.
\(A(\boldsymbol{\eta})\) is sometimes called the log normalizer because
\[A(\boldsymbol{\eta}) = \log \int h(y) \exp \left\{ \sum_{k=1}^d \eta_k(\boldsymbol{\theta}) T_k(y) \right\}.\]
A natural single parameter EFD simplifies to the scenario where \(d=1\) and \(T(y) = y\)
\[f(y ; \eta) = h(y) \exp \left\{ \eta(\theta) y - A(\eta(\theta)) \right\} \]
where without loss of generality we can write \({\operatorname{E}}[Y] = \theta\).
The family of distributions for which GLMs are most typically developed are dispersion EFDs. An example of a dispersion EFD that extends the natural single parameter EFD is
\[f(y ; \eta) = h(y, \phi) \exp \left\{ \frac{\eta(\theta) y - A(\eta(\theta))}{\phi} \right\} \]
where \(\phi\) is the dispersion parameter.
Let \(Y \sim \mbox{Normal}(\mu, \sigma^2)\). Then:
\[ \theta = \mu, \eta(\mu) = \mu \]
\[ \phi = \sigma^2 \]
\[ A(\mu) = \frac{\mu^2}{2} \]
\[ h(y, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2}\frac{y^2}{\sigma^2}} \]
There has been a very broad development of GLMs and extensions. A common setting for introducting GLMs is the dispersion EFD with a general link function \(g(\cdot)\).
See the classic text Generalized Linear Models, by McCullagh and Nelder, for such a development.
Random: The particular exponential family distribution. \[ Y \sim f(y ; \eta, \phi) \]
Systematic: The determination of each \(\eta_i\) from \({\boldsymbol{X}}_i\) and \({\boldsymbol{\beta}}\). \[ \eta_i = {\boldsymbol{X}}_i {\boldsymbol{\beta}}\]
Parametric Link: The connection between \(E[Y_i|{\boldsymbol{X}}_i]\) and \({\boldsymbol{X}}_i {\boldsymbol{\beta}}\). \[ g(E[Y_i|{\boldsymbol{X}}_i]) = {\boldsymbol{X}}_i {\boldsymbol{\beta}}\]
Even though the link function \(g(\cdot)\) can be considered in a fairly general framework, the canonical link function \(\eta(\cdot)\) is often utilized.
The canonical link function is the function that maps the expected value into the natural paramater.
In this case, \(Y | {\boldsymbol{X}}\) is distributed according to an exponential family distribution with
\[ \eta \left({\operatorname{E}}[Y | {\boldsymbol{X}}]\right) = {\boldsymbol{X}}{\boldsymbol{\beta}}. \]
Given the model \(g\left({\operatorname{E}}[Y | {\boldsymbol{X}}]\right) = {\boldsymbol{X}}{\boldsymbol{\beta}}\), the EFD should be fully parameterized. The Newton-Raphson method or Fisher’s scoring method can be utilized to find the MLE of \({\boldsymbol{\beta}}\).
Initialize \({\boldsymbol{\beta}}^{(0)}\). For \(t = 1, 2, \ldots\)
Calculate the score \(s({\boldsymbol{\beta}}^{(t)}) = \nabla \ell({\boldsymbol{\beta}}; {\boldsymbol{X}}, {\boldsymbol{Y}}) \mid_{{\boldsymbol{\beta}}= {\boldsymbol{\beta}}^{(t)}}\) and observed Fisher information \[H({\boldsymbol{\beta}}^{(t)}) = - \nabla \nabla^T \ell({\boldsymbol{\beta}}; {\boldsymbol{X}}, {\boldsymbol{Y}}) \mid_{{\boldsymbol{\beta}}= {\boldsymbol{\beta}}^{(t)}}\]. Note that the observed Fisher information is also the negative Hessian matrix.
Update \({\boldsymbol{\beta}}^{(t+1)} = {\boldsymbol{\beta}}^{(t)} + H({\boldsymbol{\beta}}^{(t)})^{-1} s({\boldsymbol{\beta}}^{(t)})\).
Iterate until convergence, and set \(\hat{{\boldsymbol{\beta}}} = {\boldsymbol{\beta}}^{(\infty)}\).
Initialize \({\boldsymbol{\beta}}^{(0)}\). For \(t = 1, 2, \ldots\)
Calculate the score \(s({\boldsymbol{\beta}}^{(t)}) = \nabla \ell({\boldsymbol{\beta}}; {\boldsymbol{X}}, {\boldsymbol{Y}}) \mid_{{\boldsymbol{\beta}}= {\boldsymbol{\beta}}^{(t)}}\) and expected Fisher information \[I({\boldsymbol{\beta}}^{(t)}) = - {\operatorname{E}}\left[\nabla \nabla^T \ell({\boldsymbol{\beta}}; {\boldsymbol{X}}, {\boldsymbol{Y}}) \mid_{{\boldsymbol{\beta}}= {\boldsymbol{\beta}}^{(t)}} \right].\]
Update \({\boldsymbol{\beta}}^{(t+1)} = {\boldsymbol{\beta}}^{(t)} + I({\boldsymbol{\beta}}^{(t)})^{-1} s({\boldsymbol{\beta}}^{(t)})\).
Iterate until convergence, and set \(\hat{{\boldsymbol{\beta}}} = {\boldsymbol{\beta}}^{(\infty)}\).
When the canonical link function is used, the Newton-Raphson algorithm and Fisher’s scoring algorithm are equivalent.
Exercise: Prove this.
For the canonical link, Fisher’s scoring method can be written as an iteratively reweighted least squares algorithm, as shown earlier for logistic regression. Note that the Fisher information is
\[ I({\boldsymbol{\beta}}^{(t)}) = {\boldsymbol{X}}^T {\boldsymbol{W}}{\boldsymbol{X}}\]
where \({\boldsymbol{W}}\) is an \(n \times n\) diagonal matrix with \((i, i)\) entry equal to \({\operatorname{Var}}(Y_i | {\boldsymbol{X}}; {\boldsymbol{\beta}}^{(t)})\).
The score function is
\[ s({\boldsymbol{\beta}}^{(t)}) = {\boldsymbol{X}}^T \left( {\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}}^{(t)} \right) \]
and the current coefficient value \({\boldsymbol{\beta}}^{(t)}\) can be written as
\[ {\boldsymbol{\beta}}^{(t)} = ({\boldsymbol{X}}^T {\boldsymbol{W}}{\boldsymbol{X}})^{-1} {\boldsymbol{X}}^T {\boldsymbol{W}}{\boldsymbol{X}}{\boldsymbol{\beta}}^{(t)}. \]
Putting this together we get
\[ {\boldsymbol{\beta}}^{(t)} + I({\boldsymbol{\beta}}^{(t)})^{-1} s({\boldsymbol{\beta}}^{(t)}) = ({\boldsymbol{X}}^T {\boldsymbol{W}}{\boldsymbol{X}})^{-1} {\boldsymbol{X}}^T {\boldsymbol{W}}{\boldsymbol{z}}^{(t)} \]
where
\[ {\boldsymbol{z}}^{(t)} = {\boldsymbol{X}}{\boldsymbol{\beta}}^{(t)} + {\boldsymbol{W}}^{-1} \left( {\boldsymbol{Y}}- {\boldsymbol{X}}{\boldsymbol{\beta}}^{(t)} \right). \]
This is a generalization of the iteratively reweighted least squares algorithm we showed earlier for logistic regression.
For the simple dispersion model above, it is typically straightforward to calculate the MLE \(\hat{\phi}\) once \(\hat{{\boldsymbol{\beta}}}\) has been calculated.
Given that \(\hat{{\boldsymbol{\beta}}}\) is a maximum likelihood estimate, we have the following CLT result on its distribution as \(n \rightarrow \infty\):
\[ \sqrt{n} (\hat{{\boldsymbol{\beta}}} - {\boldsymbol{\beta}}) \stackrel{D}{\longrightarrow} \mbox{MVN}_{p}({\boldsymbol{0}}, \phi ({\boldsymbol{X}}^T {\boldsymbol{W}}{\boldsymbol{X}})^{-1}) \]
The previous CLT gives us the following two approximations for pivtoal statistics. The first statistic facilitates getting overall measures of uncertainty on the estimate \(\hat{{\boldsymbol{\beta}}}\).
\[ \hat{\phi}^{-1} (\hat{{\boldsymbol{\beta}}} - {\boldsymbol{\beta}})^T ({\boldsymbol{X}}^T \hat{{\boldsymbol{W}}} {\boldsymbol{X}}) (\hat{{\boldsymbol{\beta}}} - {\boldsymbol{\beta}}) {\; \stackrel{.}{\sim}\;}\chi^2_1 \]
This second pivotal statistic allows for performing a Wald test or forming a confidence interval on each coefficient, \(\beta_j\), for \(j=1, \ldots, p\).
\[ \frac{\hat{\beta}_j - \beta_j}{\sqrt{\hat{\phi} [({\boldsymbol{X}}^T \hat{{\boldsymbol{W}}} {\boldsymbol{X}})^{-1}]_{jj}}} {\; \stackrel{.}{\sim}\;}\mbox{Normal}(0,1) \]
Let \(\hat{\boldsymbol{\eta}}\) be the estimated natural parameters from a GLM. For example, \(\hat{\boldsymbol{\eta}} = {\boldsymbol{X}}\hat{{\boldsymbol{\beta}}}\) when the canonical link function is used.
Let \(\hat{\boldsymbol{\eta}}_n\) be the saturated model wwhere \(Y_i\) is directly used to estimate \(\eta_i\) without model constraints. For example, in the Bernoulli logistic regression model \(\hat{\boldsymbol{\eta}}_n = {\boldsymbol{Y}}\), the observed outcomes.
The deviance for the model is defined to be
\[ D\left(\hat{\boldsymbol{\eta}}\right) = 2 \ell(\hat{\boldsymbol{\eta}}_n; {\boldsymbol{X}}, {\boldsymbol{Y}}) - 2 \ell(\hat{\boldsymbol{\eta}}; {\boldsymbol{X}}, {\boldsymbol{Y}}) \]
Let \({\boldsymbol{X}}_0\) be a subset of \(p_0\) columns of \({\boldsymbol{X}}\) and let \({\boldsymbol{X}}_1\) be a subset of \(p_1\) columns, where \(1 \leq p_0 < p_1 \leq p\). Also, assume that the columns of \({\boldsymbol{X}}_0\) are a subset of \({\boldsymbol{X}}_1\).
Without loss of generality, suppose that \({\boldsymbol{\beta}}_0 = (\beta_1, \beta_2, \ldots, \beta_{p_0})^T\) and \({\boldsymbol{\beta}}_1 = (\beta_1, \beta_2, \ldots, \beta_{p_1})^T\).
Suppose we wish to test \(H_0: (\beta_{p_0+1}, \beta_{p_0 + 2}, \ldots, \beta_{p_1}) = {\boldsymbol{0}}\) vs \(H_1: (\beta_{p_0+1}, \beta_{p_0 + 2}, \ldots, \beta_{p_1}) \not= {\boldsymbol{0}}\)
We can form \(\hat{\boldsymbol{\eta}}_0 = {\boldsymbol{X}}\hat{{\boldsymbol{\beta}}}_0\) from the GLM model \(g\left({\operatorname{E}}[Y | {\boldsymbol{X}}_0]\right) = {\boldsymbol{X}}_0 {\boldsymbol{\beta}}_0\). We can analogously form \(\hat{\boldsymbol{\eta}}_1 = {\boldsymbol{X}}\hat{{\boldsymbol{\beta}}}_1\) from the GLM model \(g\left({\operatorname{E}}[Y | {\boldsymbol{X}}_1]\right) = {\boldsymbol{X}}_1 {\boldsymbol{\beta}}_1\).
The \(2\log\) generalized LRT can then be formed from the two deviance statistics
\[ 2 \log \lambda({\boldsymbol{X}}, {\boldsymbol{Y}}) = 2 \log \frac{L(\hat{\boldsymbol{\eta}}_1; {\boldsymbol{X}}, {\boldsymbol{Y}})}{L(\hat{\boldsymbol{\eta}}_0; {\boldsymbol{X}}, {\boldsymbol{Y}})} = D\left(\hat{\boldsymbol{\eta}}_0\right) - D\left(\hat{\boldsymbol{\eta}}_1\right) \]
where the null distribution is \(\chi^2_{p_1-p_0}\).
Let’s revisit a logistic regression example now that we know how the various statistics are calculated.
> mydata <-
+ read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
> dim(mydata)
> head(mydata)
Fit the model with basic output. Note the argument family = "binomial"
.
> mydata$rank <- factor(mydata$rank, levels=c(1, 2, 3, 4))
> myfit <- glm(admit ~ gre + gpa + rank,
+ data = mydata, family = "binomial")
> myfit
Call: glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = mydata)
Coefficients:
(Intercept) gre gpa rank2
-3.989979 0.002264 0.804038 -0.675443
rank3 rank4
-1.340204 -1.551464
Degrees of Freedom: 399 Total (i.e. Null); 394 Residual
Null Deviance: 500
Residual Deviance: 458.5 AIC: 470.5
This shows the fitted coefficient values, which is on the link function scale – logit aka log odds here. Also, a Wald test is performed for each coefficient.
> summary(myfit)
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6268 -0.8662 -0.6388 1.1490 2.0790
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
rank2 -0.675443 0.316490 -2.134 0.032829 *
rank3 -1.340204 0.345306 -3.881 0.000104 ***
rank4 -1.551464 0.417832 -3.713 0.000205 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
Number of Fisher Scoring iterations: 4
Here we perform a generalized LRT on each variable. Note the rank
variable is now tested as a single factor variable as opposed to each dummy variable.
> anova(myfit, test="LRT")
Analysis of Deviance Table
Model: binomial, link: logit
Response: admit
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 399 499.98
gre 1 13.9204 398 486.06 0.0001907 ***
gpa 1 5.7122 397 480.34 0.0168478 *
rank 3 21.8265 394 458.52 7.088e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> mydata <- data.frame(mydata, probs = myfit$fitted.values)
> ggplot(mydata) + geom_point(aes(x=gpa, y=probs, color=rank)) +
+ geom_jitter(aes(x=gpa, y=admit), width=0, height=0.01, alpha=0.3)
> ggplot(mydata) + geom_point(aes(x=gre, y=probs, color=rank)) +
+ geom_jitter(aes(x=gre, y=admit), width=0, height=0.01, alpha=0.3)
> ggplot(mydata) + geom_boxplot(aes(x=rank, y=probs)) +
+ geom_jitter(aes(x=rank, y=probs), width=0.1, height=0.01, alpha=0.3)
glm()
FunctionThe glm()
function has many different options available to the user.
glm(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, contrasts = NULL, ...)
To see the different link functions available, type:
help(family)
Recall the set up for simple linear regression. For random variables \((X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)\), simple linear regression estimates the model
\[ Y_i = \beta_1 + \beta_2 X_i + E_i \]
where \({\operatorname{E}}[E_i | X_i] = 0\), \({\operatorname{Var}}(E_i | X_i) = \sigma^2\), and \({\operatorname{Cov}}(E_i, E_j | X_i, X_j) = 0\) for all \(1 \leq i, j \leq n\) and \(i \not= j\).
Note that in this model \({\operatorname{E}}[Y | X] = \beta_1 + \beta_2 X.\)
In simple nonparametric regression, we define a similar model while eliminating the linear assumption:
\[ Y_i = s(X_i) + E_i \]
for some function \(s(\cdot)\) with the same assumptions on the distribution of \({\boldsymbol{E}}| {\boldsymbol{X}}\). In this model, we also have
\[ {\operatorname{E}}[Y | X] = s(X). \]
Suppose we consider fitting the model \(Y_i = s(X_i) + E_i\) with the restriction that \(s \in C^2\), the class of functions with continuous second derivatives. We can set up an objective function that regularizes how smooth vs wiggly \(s\) is.
Specifically, suppose for a given set of observed data \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\) we wish to identify a function \(s \in C^2\) that minimizes for some \(\lambda\)
\[ \sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int |s''(x)|^2 dx \]
When minimizing
\[ \sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int |s^{\prime\prime}(x)|^2 dx \]
it follows that if \(\lambda=0\) then any function \(s \in C^2\) that interpolates the data is a solution.
As \(\lambda \rightarrow \infty\), then the minimizing function is the simple linear regression solution.
For an observed data set \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\) where \(n \geq 4\) and a fixed value \(\lambda\), there is an exact solution to minimizing
\[ \sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int |s^{\prime\prime}(x)|^2 dx. \]
The solution is called a natural cubic spline, which is constructed to have knots at \(x_1, x_2, \ldots, x_n\).
Suppose without loss of generality that we have ordered \(x_1 < x_2 < \cdots < x_n\). We assume all \(x_i\) are unique to simplify the explanation here, but ties can be deal with.
A natural cubic spline (NCS) is a function constructed from a set of piecewise cubic functions over the range \([x_1, x_n]\) joined at the knots so that the second derivative is continuous at the knots. Outside of the range (\(< x_1\) or \(> x_n\)), the spline is linear and it has continuous second derivatives at the endpoint knots.
Depending on the value \(\lambda\), a different ncs will be constructed, but the entire family of ncs solutions over \(0 < \lambda < \infty\) can be constructed from a common set of basis functions.
We construct \(n\) basis functions \(N_1(x), N_2(x), \ldots, N_n(x)\) with coefficients \(\theta_1(\lambda), \theta_2(\lambda), \ldots, \theta_n(\lambda)\). The NCS takes the form
\[ s(x) = \sum_{i=1}^n \theta_i(\lambda) N_i(x). \]
Define \(N_1(x) = 1\) and \(N_2(x) = x\). For \(i = 3, \ldots, n\), define \(N_i(x) = d_{i-1}(x) - d_{i-2}(x)\) where
\[ d_{i}(x) = \frac{(x - x_i)^3 - (x - x_n)^3}{x_n - x_i}. \]
Recall that we’ve labeled indices so that \(x_1 < x_2 < \cdots < x_n\).
Let \({\boldsymbol{\theta}}_\lambda = (\theta_1(\lambda), \theta_2(\lambda), \ldots, \theta_n(\lambda))^T\) and let \({\boldsymbol{N}}\) be the \(n \times n\) matrix with \((i, j)\) entry equal to \(N_j(x_i)\). Finally, let \({\boldsymbol{\Omega}}\) be the \(n \times n\) matrix with \((i,j)\) entry equal to \(\int N_i^{\prime\prime}(x) N_j^{\prime\prime}(x) dx\).
The solution to \({\boldsymbol{\theta}}_\lambda\) are the values that minimize
\[ ({\boldsymbol{y}}- {\boldsymbol{N}}{\boldsymbol{\theta}})^T ({\boldsymbol{y}}- {\boldsymbol{N}}{\boldsymbol{\theta}}) + \lambda {\boldsymbol{\theta}}^T {\boldsymbol{\Omega}}{\boldsymbol{\theta}}. \]
which results in
\[ \hat{{\boldsymbol{\theta}}}_\lambda = ({\boldsymbol{N}}^T {\boldsymbol{N}}+ \lambda {\boldsymbol{\Omega}})^{-1} {\boldsymbol{N}}^T {\boldsymbol{y}}. \]
Letting
\[ {\boldsymbol{S}}_\lambda = {\boldsymbol{N}}({\boldsymbol{N}}^T {\boldsymbol{N}}+ \lambda {\boldsymbol{\Omega}})^{-1} {\boldsymbol{N}}^T \]
it folows that the fitted values are
\[ \hat{{\boldsymbol{y}}} = {\boldsymbol{S}}_\lambda {\boldsymbol{y}}. \]
Thus, the fitted values from a NCS are contructed by taking linear combination of the response variable values \(y_1, y_2, \ldots, y_n\).
Recall that in OLS, we formed projection matrix \({\boldsymbol{P}}= {\boldsymbol{X}}({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} {\boldsymbol{X}}^T\) and noted that the number of columns \(p\) of \({\boldsymbol{X}}\) is also found in the trace of \({\boldsymbol{P}}\) where \(\operatorname{tr}({\boldsymbol{P}}) = p\).
The effective degrees of freedom for a model fit by a linear operator is calculated as the trace of the operator.
Therefore, for a given \(\lambda\), the effective degrees of freedom is
\[ \operatorname{df}_\lambda = \operatorname{tr}({\boldsymbol{S}}_\lambda). \]
Minimizing
\[ \sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int |s^{\prime\prime}(x)|^2 dx \]
is equivalent to maximizing
\[ \exp\left\{ -\frac{\sum_{i=1}^n (y_i - s(x_i))^2}{2\sigma^2} \right\} \exp\left\{-\frac{\lambda}{2\sigma^2} \int |s^{\prime\prime}(x)|^2 dx\right\}. \]
Therefore, the NCS solution can be interpreted as calculting the MAP where \(Y | X\) is Normal and there’s an Exponential prior on the smoothness of \(s\).
Typically we will choose some \(0 < \lambda < \infty\) in an effort to balance the bias and variance. Let \(\hat{Y} = \hat{s}(X; \lambda)\) where \(\hat{s}(\cdot; \lambda)\) minimizes the above for some chosen \(\lambda\) on an independent data set. Then
\[ \begin{aligned} {\operatorname{E}}\left[\left(Y - \hat{Y}\right)^2\right] & = {\rm E}\left[\left(s(x) + E - \hat{s}(x; \lambda)\right)^2 \right] \\ \ & = {\rm E}\left[\left( s(x) - \hat{s}(x; \lambda) \right)^2 \right] + {\rm Var}(E) \\ \ & = \left( s(x) - {\rm E}[\hat{s}(x; \lambda)]\right)^2 + {\rm Var}\left(\hat{s}(x; \lambda)\right) + {\rm Var}(E) \\ \ & = \mbox{bias}^2_{\lambda} + \mbox{variance}_{\lambda} + {\rm Var}(E) \end{aligned} \]
where all of the above calculations are conditioned on \(X=x\).
In minimizing
\[ \sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int |s^{\prime\prime}(x)|^2 dx \]
the relationship is such that:
\[ \uparrow \lambda \Longrightarrow \mbox{bias}^2 \uparrow, \mbox{variance} \downarrow \]
\[ \downarrow \lambda \Longrightarrow \mbox{bias}^2 \downarrow, \mbox{variance} \uparrow \]
There are several approaches that are commonly used to identify a value of \(\lambda\), including:
We investigated one type of nonparametric regression model here, the NCS. However, in general there are many such “smoother” methods available in the simple nonparametric regression scenario.
Splines are particularly popular since splines are constructed from putting together polynomials and are therefore usually tractable to compute and analyze.
There are several functions and packages available in R for computing smoothers and tuning smoothness parameters. Examples include:
splines
librarysmooth.spline()
loess()
lowess()
> y2 <- smooth.spline(x=x, y=y, df=2)
> y5 <- smooth.spline(x=x, y=y, df=5)
> y25 <- smooth.spline(x=x, y=y, df=25)
> ycv <- smooth.spline(x=x, y=y)
> ycv
Call:
smooth.spline(x = x, y = y)
Smoothing Parameter spar= 0.5162045 lambda= 0.0002730906 (11 iterations)
Equivalent Degrees of Freedom (Df): 7.293673
Penalized Criterion: 14.80602
GCV: 1.180651
Recall that OLS estimates the model
\[ \begin{aligned} Y_i & = \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_p X_{ip} + E_i \\ & = {\boldsymbol{X}}_i {\boldsymbol{\beta}}+ E_i \end{aligned} \]
where \({\operatorname{E}}[{\boldsymbol{E}}| {\boldsymbol{X}}] = {\boldsymbol{0}}\) and \({\operatorname{Cov}}({\boldsymbol{E}}| {\boldsymbol{X}}) = \sigma^2 {\boldsymbol{I}}\).
The additive model (which could also be called “ordinary nonparametric additive regression”) is of the form
\[ \begin{aligned} Y_i & = s_1(X_{i1}) + s_2(X_{i2}) + \ldots + s_p(X_{ip}) + E_i \\ & = \sum_{j=1}^p s_j(X_{ij}) + E_i \end{aligned} \]
where the \(s_j(\cdot)\) for \(j = 1, \ldots, p\) are a set of nonparametric (or flexible) functions. Again, we assume that \({\operatorname{E}}[{\boldsymbol{E}}| {\boldsymbol{X}}] = {\boldsymbol{0}}\) and \({\operatorname{Cov}}({\boldsymbol{E}}| {\boldsymbol{X}}) = \sigma^2 {\boldsymbol{I}}\).
The additive model can be fit through a technique called backfitting.
Note that some extra steps have to be taken to deal with the intercept.
\(Y | {\boldsymbol{X}}\) is distributed according to an exponential family distribution. The extension of additive models to this family of response variable is called generalized additive models (GAMs). The model is of the form
\[ g\left({\operatorname{E}}[Y_i | {\boldsymbol{X}}_i]\right) = \sum_{j=1}^p s_j(X_{ij}) \]
where \(g(\cdot)\) is the link function and the \(s_j(\cdot)\) are flexible and/or nonparametric functions.
Fitting GAMs involves putting together the following three tools:
Three common ways to fit GAMs in R:
glm()
on explanatory variables constructed from ns()
or bs()
gam
librarymgcv
library> set.seed(508)
> x1 <- seq(1, 10, length.out=50)
> n <- length(x1)
> x2 <- rnorm(n)
> f <- 4*log(x1) + sin(x1) - 7 + 0.5*x2
> p <- exp(f)/(1+exp(f))
> summary(p)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.001842 0.074170 0.310700 0.436200 0.860400 0.944800
> y <- rbinom(n, size=1, prob=p)
> mean(y)
[1] 0.42
> df <- data.frame(x1=x1, x2=x2, y=y)
Here, we use the gam()
function from the mgcv
library. It automates choosing the smoothness of the splines.
> library(mgcv)
> mygam <- gam(y ~ s(x1) + s(x2), family = binomial(), data=df)
> library(broom)
> tidy(mygam)
term edf ref.df statistic p.value
1 s(x1) 1.870282 2.374897 12.742930 0.005308795
2 s(x2) 1.000024 1.000048 1.163059 0.280836876
> summary(mygam)
Family: binomial
Link function: logit
Formula:
y ~ s(x1) + s(x2)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1380 0.6723 -1.693 0.0905 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(x1) 1.87 2.375 12.743 0.00531 **
s(x2) 1.00 1.000 1.163 0.28084
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.488 Deviance explained = 47%
UBRE = -0.12392 Scale est. = 1 n = 50
True probabilities vs. estimated probabilities.
> plot(p, mygam$fitted.values, pch=19); abline(0,1)
Smoother estimated for x1
.
> plot(mygam, select=1)
Smoother estimated for x2
.
> plot(mygam, select=2)
Here, we use the glm()
function and include as an explanatory variable a NCS built from the ns()
function from the splines
library. We include a df
argument in the ns()
call.
> library(splines)
> myglm <- glm(y ~ ns(x1, df=2) + x2, family = binomial(), data=df)
> tidy(myglm)
term estimate std.error statistic p.value
1 (Intercept) -10.9228749 5.3078903 -2.057856 0.039603941
2 ns(x1, df = 2)1 21.3848301 10.1317747 2.110670 0.034800710
3 ns(x1, df = 2)2 6.3266262 2.1103294 2.997933 0.002718174
4 x2 0.7341812 0.6089442 1.205663 0.227947633
The spline basis evaluated at x1
values.
> ns(x1, df=2)
1 2
[1,] 0.00000000 0.00000000
[2,] 0.03114456 -0.02075171
[3,] 0.06220870 -0.04138180
[4,] 0.09311200 -0.06176867
[5,] 0.12377405 -0.08179071
[6,] 0.15411442 -0.10132630
[7,] 0.18405270 -0.12025384
[8,] 0.21350847 -0.13845171
[9,] 0.24240131 -0.15579831
[10,] 0.27065081 -0.17217201
[11,] 0.29817654 -0.18745121
[12,] 0.32489808 -0.20151430
[13,] 0.35073503 -0.21423967
[14,] 0.37560695 -0.22550571
[15,] 0.39943343 -0.23519080
[16,] 0.42213406 -0.24317334
[17,] 0.44362840 -0.24933170
[18,] 0.46383606 -0.25354429
[19,] 0.48267660 -0.25568949
[20,] 0.50006961 -0.25564569
[21,] 0.51593467 -0.25329128
[22,] 0.53019136 -0.24850464
[23,] 0.54275927 -0.24116417
[24,] 0.55355797 -0.23114825
[25,] 0.56250705 -0.21833528
[26,] 0.56952943 -0.20260871
[27,] 0.57462513 -0.18396854
[28,] 0.57787120 -0.16253131
[29,] 0.57934806 -0.13841863
[30,] 0.57913614 -0.11175212
[31,] 0.57731586 -0.08265339
[32,] 0.57396762 -0.05124405
[33,] 0.56917185 -0.01764570
[34,] 0.56300897 0.01802003
[35,] 0.55555939 0.05563154
[36,] 0.54690354 0.09506722
[37,] 0.53712183 0.13620546
[38,] 0.52629468 0.17892464
[39,] 0.51450251 0.22310315
[40,] 0.50182573 0.26861939
[41,] 0.48834478 0.31535174
[42,] 0.47414005 0.36317859
[43,] 0.45929198 0.41197833
[44,] 0.44388099 0.46162934
[45,] 0.42798748 0.51201003
[46,] 0.41169188 0.56299877
[47,] 0.39507460 0.61447395
[48,] 0.37821607 0.66631397
[49,] 0.36119670 0.71839720
[50,] 0.34409692 0.77060206
attr(,"degree")
[1] 3
attr(,"knots")
50%
5.5
attr(,"Boundary.knots")
[1] 1 10
attr(,"intercept")
[1] FALSE
attr(,"class")
[1] "ns" "basis" "matrix"
Plot of basis function values vs x1
.
> summary(myglm)
Call:
glm(formula = y ~ ns(x1, df = 2) + x2, family = binomial(), data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0214 -0.3730 -0.0162 0.5762 1.7616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.9229 5.3079 -2.058 0.03960 *
ns(x1, df = 2)1 21.3848 10.1318 2.111 0.03480 *
ns(x1, df = 2)2 6.3266 2.1103 2.998 0.00272 **
x2 0.7342 0.6089 1.206 0.22795
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 68.029 on 49 degrees of freedom
Residual deviance: 35.682 on 46 degrees of freedom
AIC: 43.682
Number of Fisher Scoring iterations: 7
> anova(myglm, test="LRT")
Analysis of Deviance Table
Model: binomial, link: logit
Response: y
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 49 68.029
ns(x1, df = 2) 2 30.755 47 37.274 2.097e-07 ***
x2 1 1.592 46 35.682 0.207
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
True probabilities vs. estimated probabilities.
> plot(p, myglm$fitted.values, pch=19); abline(0,1)
Let’s first discuss how one can utilize the bootstrap on any of the three homoskedastic models:
In each of these scenarios we sample data \(({\boldsymbol{X}}_1, Y_1), ({\boldsymbol{X}}_2, Y_2), \ldots, ({\boldsymbol{X}}_n, Y_n)\). Let suppose we calculate fitted values \(\hat{Y}_i\) and they are unbiased:
\[ {\operatorname{E}}[\hat{Y}_i | {\boldsymbol{X}}] = {\operatorname{E}}[Y_i | {\boldsymbol{X}}]. \]
We can calculate residuals \(\hat{E}_i = Y_i - \hat{Y}_i\) for \(i=1, 2, \ldots, n\).
One complication is that the residuals have a covariance. For example, in OLS we showed that
\[ {\operatorname{Cov}}(\hat{{\boldsymbol{E}}}) = \sigma^2 ({\boldsymbol{I}}- {\boldsymbol{P}}) \]
where \({\boldsymbol{P}}= {\boldsymbol{X}}({\boldsymbol{X}}^T {\boldsymbol{X}})^{-1} {\boldsymbol{X}}^T\).
To correct for this induced heteroskedasticity, we studentize the residuals by calculating
\[ R_i = \frac{\hat{E}_i}{\sqrt{1-P_{ii}}} \]
which gives \({\operatorname{Cov}}({\boldsymbol{R}}) = \sigma^2 {\boldsymbol{I}}\).
The following is a bootstrap procedure for calculating a confidence interval on some statistic \(\hat{\theta}\) calculated from a homoskedastic model fit. An example is \(\hat{\beta}_j\) in an OLS.
The bootstrap statistics \(\hat{\theta}^{*(1)}, \hat{\theta}^{*(2)}, \ldots, \hat{\theta}^{*(B)}\) are then utilized through one of the techniques discussed earlier (percentile, pivotal, studentized pivotal) to calculate a bootstrap CI.
Suppose we are testing the hypothesis \(H_0: {\operatorname{E}}[Y | {\boldsymbol{X}}] = f_0({\boldsymbol{X}})\) vs \(H_1: {\operatorname{E}}[Y | {\boldsymbol{X}}] = f_1({\boldsymbol{X}})\). Suppose it is possible to form unbiased estimates \(f_0({\boldsymbol{X}})\) and \(f_1({\boldsymbol{X}})\) given \({\boldsymbol{X}}\), and \(f_0\) is a restricted version of \(f_1\).
Suppose also we have a statistic \(T(\hat{f}_0, \hat{f}_1)\) for performing this test so that the larger the statistic, the more evidence there is against the null hypothesis in favor of the alternative.
The big picture strategy is to bootstrap studentized residuals from the unconstrained (alternative hypothesis) fitted model and then add those to the constrained (null hypothesis) fitted model to generate bootstrap null data sets.
The p-value is then calculated as
\[ \frac{\sum_{b=1}^B 1\left(T(\hat{f}^{*(b)}_0, \hat{f}^{*(b)}_1) \geq T(\hat{f}_0, \hat{f}_1) \right) }{B} \]
For more complex scenarios, such as GLMs, GAMs, and heteroskedastic models, it is typically more straightforward to utilize a parametric bootstrap.
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.4
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] splines stats graphics grDevices utils datasets
[7] methods base
other attached packages:
[1] mgcv_1.8-17 nlme_3.1-131 broom_0.4.2
[4] dplyr_0.5.0 purrr_0.2.2 readr_1.1.0
[7] tidyr_0.6.2 tibble_1.3.0 ggplot2_2.2.1
[10] tidyverse_1.1.1 knitr_1.15.1 magrittr_1.5
[13] devtools_1.12.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.10 highr_0.6 cellranger_1.1.0
[4] plyr_1.8.4 forcats_0.2.0 tools_3.3.2
[7] digest_0.6.12 lubridate_1.6.0 jsonlite_1.4
[10] evaluate_0.10 memoise_1.1.0 gtable_0.2.0
[13] lattice_0.20-35 Matrix_1.2-10 psych_1.7.5
[16] DBI_0.6-1 yaml_2.1.14 parallel_3.3.2
[19] haven_1.0.0 xml2_1.1.1 withr_1.0.2
[22] stringr_1.2.0 httr_1.2.1 revealjs_0.9
[25] hms_0.3 rprojroot_1.2 grid_3.3.2
[28] R6_2.2.0 readxl_1.0.0 foreign_0.8-68
[31] rmarkdown_1.5 modelr_0.1.0 reshape2_1.4.2
[34] backports_1.0.5 scales_0.4.1 htmltools_0.3.6
[37] rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5
[40] colorspace_1.3-2 labeling_0.3 stringi_1.1.5
[43] lazyeval_0.2.0 munsell_0.4.3