1 Overview

This tutorial reviews regression analysis, a cornerstone method that supports many of the other methods we will consider. We begin by translating the basic equations to matrix form - in order to connect to basic knowledge of matrix algebra (covered in class). Then, we go through examples of how to fit regression models for continuous, count, and binary outcomes.

1.0.0.1 Prelim - Loading libraries used in this script.

2 Mathematical Model Basics

In basic form, a regression model describes a single outcome variable as a function of a set of predictor variables. \[ outcome_i = f(inputs_i) = f(\cdot) \] Generally, the regression model is written as

\[ y_{i} = \beta_0(1_i) + \beta_1(x_{1i}) + \beta_2(x_{2i}) + ... + \beta_q(x_{qi}) + \epsilon_{i} \]

where \(y_{i}\) is the value of the outcome variable for individual \(i\); \(b_0\) is an intercept parameter that indicates the expected value of \(y_i\) when the predictor variables, \(\left\{ x_{1i}, ..., x_{qi}\right\}\), are all \(= 0\); \(\left\{\beta_1,..., \beta_q\right\}\) are regression parameters that each indicate the unique relation between a specific predictor variable \(x_{qi}\) and the outcome variable, \(y_i\); and \(\epsilon_{i}\) are residuals (i.e., random) portions of \(y_i\) that are unaccounted for by the deterministic (i.e., fixed) part of the model, \(\beta_0 + \beta_1(x_{1i}) + \beta_2(x_{2i}) + ... + \beta_q(x_{qi})\).

Given our long-term goals, we make connections between this representation of the model and the matrix algerbra representation.

2.1 Regression and Matrix Notation

We can translate the regression model into matrix form. First, let’s take a simpler form of the model, an intercept-only model where

\[ y_i = 1_i \cdot \beta_0 + \epsilon_i\] Note that we have made the “silent” 1 explicit. This will become important later (e.g., when fitting growth models).

Translating into matrix form, \(y_i\) can be written as an \(N\) x 1 matrix (a column vector). More specifically, for \(i = 1\) to \(N\) individuals, \[ y_i = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \boldsymbol{Y}\]

(matrices are often designated as bold capital letters).

Doing the same for all the other parts of the model, we get

\[ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \left[ \begin{array}{c} 1 \\ 1 \\ \vdots \\ 1 \end{array} \right] [\beta_0] + \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right]\]

Note that we have taken care that each matrix is of an order that will allow for matrix multiplication.

What is the interpretation of \([\beta_0]\)?

What is the interpretation of \[\left[ \begin{array}{c} e_1 \\ e_2 \\ \vdots \\ e_N \end{array} \right]\]?

Now, let’s expand our regression model by adding a predictor \(x_{1i}\). Our model becomes

\[ y_i = (1_i)\beta_0 + (x_{1i})\beta_1 + \epsilon_i \]

Written out explicitly in matrix form, the model is
\[ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \left[ \begin{array}{cc} 1, x_{11}\\ 1, x_{12}\\ \vdots \\ 1, x_{1N}\end{array} \right] \left[ \begin{array}{c}b_0\\ b_1\end{array}\right] + \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right]\]

Finally, extending the model to the general case with \(q\) predictor variables, we have \[ y_i = \beta_0 + \beta_1(x_{1i}) + \beta_2(x_{2i}) + ... + \beta_q(x_{qi}) + \epsilon_i \]

which is written out in matrix form as

\[ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \left[ \begin{array}{cccc} 1, x_{11}, \ldots, x_{q1}\\ 1, x_{12}, \ldots, x_{q2}\\ \vdots \\ 1, x_{1N}, \ldots, x_{qN}\end{array} \right] \left[ \begin{array}{c}\beta_0\\ \beta_1\\ \vdots\\ \beta_q\end{array}\right] + \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right]\]

When we implement this model in R, it will be important to know the portions of the model that are in our data frame, \(y_i\) and \({x_{1}, ..., x_{q}}\), and to have them structured properly. This will become clear in the examples below.

Now that we have the model written out explicitly as matrices, we can easily simplify the notation.

\[ \boldsymbol{Y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] ;\boldsymbol{X} = \left[ \begin{array}{cccc} 1, x_{11}, \ldots, x_{q1}\\ 1, x_{12}, \ldots, x_{q2}\\ \vdots \\ 1, x_{1N}, \ldots, x_{qN}\end{array} \right]; \boldsymbol{\beta} = \left[ \begin{array}{c}\beta_0\\ \beta_1\\ \vdots\\ \beta_q\end{array}\right]; \boldsymbol{e} = \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right] \]

In compact matrix notation, the regression model then can be written as

\[ \boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

2.2 Solving the Regression Equation

In practice, we would like to know the contents of (i.e., solve for) \(\boldsymbol{\beta}\).

Assuming the model is correct, the expected value of \(\boldsymbol{\epsilon}\) is 0. \[ \boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta}\] Then we just need to solve for \(\boldsymbol{\beta}\).

Premultiplying both sides of the equation by \(\boldsymbol{X'}\), we obtain

\[ \boldsymbol{X'}\boldsymbol{Y} = \boldsymbol{X'}\boldsymbol{X}\boldsymbol{\beta} \] This gets us a quantity, \(\left(\boldsymbol{X'}\boldsymbol{X}\right)\) that is a square matrix that contains information about the relations among the \(x_q\) variables. When this matrix has an inverse, we can premultiply both sides by \(\left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\), to obtain

\[ \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1} \left( \boldsymbol{X'}\boldsymbol{Y}\right) = \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1} (\boldsymbol{X'}\boldsymbol{X}) \boldsymbol{\beta} \]

Because \[\left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\left(\boldsymbol{X'}\boldsymbol{X}\right) = \boldsymbol{I}\] the equation simpifies to \[ \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\left(\boldsymbol{X'}\boldsymbol{Y}\right) = \boldsymbol{I}\boldsymbol{\beta} \] and \[ \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\left(\boldsymbol{X'}\boldsymbol{Y}\right) = \boldsymbol{\beta} \] Yay! We’ve isolated the unknowns, \(\boldsymbol{\beta}\) onto one side of the equation.

Quite literally, this algebra is what allows for estimation of the parameters when fitting a regression model to data.

We will now work through some practical examples - staying aware that this kind of matrix algebra is being done in the background.

2.2.0.1 Prelim - Reading in Data File

Examples make use of a new WISC data file. For this tutorial the data have, given our didactic purposes, been supplemented with some simulated data. In particular we have added a female variable that indicates each participant’s gender.

Reading in these example data.

Subsetting to variables of interest.

Sample-level descriptives for the chosen variables.

vars n mean sd median trimmed mad min max range skew kurtosis se
id 1 204 102.5000000 59.0338886 102.500 102.5000000 75.612600 1.00 204.00 203.00 0.0000000 -1.2176609 4.1331989
verb1 2 204 19.5850490 5.8076950 19.335 19.4976829 5.411490 3.33 35.15 31.82 0.1301071 -0.0528376 0.4066200
verb2 3 204 25.4153431 6.1063772 25.980 25.4026220 6.567918 5.95 39.85 33.90 -0.0639307 -0.3445494 0.4275319
verb4 4 204 32.6077451 7.3198841 32.820 32.4240244 7.183197 12.60 52.84 40.24 0.2317998 -0.0776524 0.5124944
verb6 5 204 43.7499020 10.6650511 42.545 43.4560976 11.297412 17.35 72.59 55.24 0.2356459 -0.3605210 0.7467029
momed 6 204 10.8112745 2.6982790 11.500 10.9969512 2.965200 5.50 18.00 12.50 -0.3586143 0.0056095 0.1889173
grad 7 204 0.2254902 0.4189328 0.000 0.1585366 0.000000 0.00 1.00 1.00 1.3040954 -0.3007374 0.0293312
female 8 204 0.5000000 0.5012300 0.500 0.5000000 0.741300 0.00 1.00 1.00 0.0000000 -2.0097799 0.0350931

3 Multiple Regression - Empirical example with a Normally Distributed (Gaussian) outcome variable

For the first example, we focus on verbal ability at Grade 2 as an outcome (verb2 in the data frame wiscsub).

Examining the distribution for ‘verb2’.

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 204 25.41534 6.106377 25.98 25.40262 6.567918 5.95 39.85 33.9 -0.0639307 -0.3445494 0.4275319

The simplest model is an intercept only model. In this case, we would fit the model \[ verb_{2i} = b_0 + \epsilon_{i}\]

Written out explicitly with the “silent” 1 in it, we get \[ verb_{2i} = b_0(1_i) + \epsilon_{i}\] This is helpful for explicit translation into the R code, specifically the formula within the lm() function.

We fit the model using the following code … note that the code has the ‘1’ predictor variable stated explicitly.

## 
## Call:
## lm(formula = verb2 ~ 1, data = wiscsub, na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4653  -4.6403   0.5647   4.2822  14.4347 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.4153     0.4275   59.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.106 on 203 degrees of freedom
## [1] 0

The output indicates that \(b_0\) = 25.4153, and its standard error = 0.4275.

The intercept reflects the expected value of the outcome variable when all of the predictor variables (i.e. \(\left\{ x_{1i}, ..., x_{qi}\right\}\)) = 0. So, in the absence of any additional information other than the descriptive statistics of \(verb_{2i}\), what is our best guess for a person’s \(verb_{2i}\) score? It is the mean of \(verb_{2i}\). The regression above confirms this notion; regressing the outcome on a vector of 1s allows us to ‘recover’ the mean.

## [1] 25.41534

Yes - we recovered the mean, but we did not attempt to explain any of the variance. It thus makes sense why the R-sqaured = 0.

Let’s build up the model further.

For example, we could attempt to explain some of the between-person variance in the Grade 2 verbal score from the Grade 1 verbal scores. But, before we do, let’s examine the distribution of the between-person differences in the Grade 1 verbal scores.

And the relation between the Grade 2 and Grade 1 verbal ability scores.

Our regression model becomes

\[ verb_{2i} = b_0(1_i) + b_1(verb_{1i}) + \epsilon_{i}\]

## 
## Call:
## lm(formula = verb2 ~ 1 + verb1, data = wiscsub, na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5305  -3.0362   0.2526   2.7147  12.5020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.62965    1.05164   10.11   <2e-16 ***
## verb1        0.75495    0.05149   14.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.261 on 202 degrees of freedom
## Multiple R-squared:  0.5156, Adjusted R-squared:  0.5132 
## F-statistic:   215 on 1 and 202 DF,  p-value: < 2.2e-16

How do we interpret the parameters here?

The intercept, \(b_0\), is the expected value for the outcome variable when all of the predictor variables equal zero. So, we would expect a child to have a Grade 2 verbal score of 10.62965 if they have a Grade 1 verbal score of 0.

The slope, \(b_1\) is the expected difference in the outcome variable for each 1-unit difference in the predictor variable. So, across children for each 1-point difference in a child’s Grade 1 verbal score, we would expect a 0.75495 point difference in the Grade 2 verbal score.

We can plot the relation between ‘verb1’ and ‘verb2’, and include the predicted line from the analysis.

Note that the lm() is run within the ggplot() + geom_smooth() function. Have to be a bit careful about this as the models get more complicated, but it is very useful for communication.

In this case, and in many other cases, the intercept does not have a ‘useful’ interpretation for the empirical example. This is because none of the students had a Grade 1 verbal score equal to 0. Therefore, if we want to make the intercept more meaningful, we need to make a Grade 1 verbal score with a more meaningful 0 point. Typically we center the predictor variables in regression analysis. For example, we create a centered variable, \(x^{*}_{1i}\) by subtracting the sample mean, \(\bar{x_1}\) from each observation,
\[ x^{*}_{1i} = x_{1i} - \bar{x_1} \] Our model becomes
\[ y_i = b_0(1_i) + b_1(x^{*}_{1i}) + \epsilon_i \]

Sample-mean centering \(verb_{1i}\).

Then we can fit a new model using \(verb^{*}_{1i}\), such that

\[ verb_{2i} = b_0(1_i) + b_1(verb^{*}_{1i}) + \epsilon_i \]

## 
## Call:
## lm(formula = verb2 ~ 1 + verb1_star, data = wiscsub, na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5305  -3.0362   0.2526   2.7147  12.5020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.41534    0.29831   85.20   <2e-16 ***
## verb1_star   0.75495    0.05149   14.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.261 on 202 degrees of freedom
## Multiple R-squared:  0.5156, Adjusted R-squared:  0.5132 
## F-statistic:   215 on 1 and 202 DF,  p-value: < 2.2e-16

Note that the estimate for the slope \(b_1\) stays the same, but the estimate for the intercept is different. This is because the variable ‘verb1_star’ equals 0 when a child has an average 1st grade verbal score. Therefore the expected value for the 2nd grade verbal score, for a child with an average 1st grade verbal score, is expected to be 25.41534.

Note the change of scale on the x-axis.

Now, let’s include a second predictor. We have information on the number of years of education for the children’s mothers, variable momed. The values in momed indicate the number of years of education each mother completed. First, let’s take a look at the distribution of this new predictor variable.

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 204 10.81127 2.698279 11.5 10.99695 2.9652 5.5 18 12.5 -0.3586143 0.0056095 0.1889173
## Warning: Removed 2 rows containing missing values (geom_bar).

And the relation between Grade 2 verbal scores and momed.

Our model now becomes

\[ verb_{2i} = b_0(1_{i}) + b_1(verb^{*}_{1i}) + b_2(momed^{*}_{i}) + \epsilon_{i}\]

Where \(verb^{*}_{1i}\) is the sample-centered version of \(verb_{1i}\), and \(momed^{*}_{i}\) is the sample-centered version of \(momed_{i}\)

Calculate centered momed variable.

Fit model to the data.

## 
## Call:
## lm(formula = verb2 ~ 1 + verb1_star + momed_star, data = wiscsub, 
##     na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4354 -2.9189 -0.1542  2.3746 11.1678 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.41534    0.29069  87.430  < 2e-16 ***
## verb1_star   0.66786    0.05626  11.872  < 2e-16 ***
## momed_star   0.41454    0.12108   3.424 0.000749 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.152 on 201 degrees of freedom
## Multiple R-squared:  0.5422, Adjusted R-squared:  0.5377 
## F-statistic: 119.1 on 2 and 201 DF,  p-value: < 2.2e-16

Now we have an intercept and two slopes.

\(b_0\) is the expected value of the outcome variable when all other variables are 0. Therefore, in this case, \(b_0\) is the expcected Grade 2 verbal score for a child with an average Grade 1 verbal score (i.e. \(verb^{*}_{1i}\) = 0) and whose mother had an average education (i.e. \(momed^{*}_{i}\) = 0, \(\bar{momed_{i}}\) = 10.81 years of education.

\(b_1\) is the expected difference in the outcome for a 1-unit difference in \(x_{1i}\). In this example (i.e. ‘model4’), \(b_1\) is the expected difference in Grade 2 verbal score (outcome variable, \(y_i\) = \(verb_{2i}\)) for a 1 point difference in the Grade 1 verbal score (\(x_{1i}\) = \(verb^{*}_{1i}\)), assuming average level of mother’s education.

\(b_2\) is the expected difference in the outcome for a 1-unit difference in \(x_{2i}\). For this example (i.e. ‘model4’), \(b_2\) is the expected difference in Grade 2 verbal score (outcome variable, \(y_i\) = \(verb_{2i}\)) for each year difference in mother’s education (\(x_{2i}\) = \(momed^{*}_i\)).

To better understand the implied relation between \(verb_{2i}\), \(verb^{*}_{1i}\), and \(momed^{*}_i\), we can plot the predicted values implied by the model. This will be a 3-dimensional plot because we are interested in the relations among three variables: \(\left\{verb_{2i}, verb^{*}_{1i}, momed^{*}_{i}\right\}\)

To make a 3-d plot, we create a matrix of the ‘decision space’. The rows correspond to levels of one predictor variable \((verb^{*}_{1i})\), the columns of the other predictor variable \((momed^{*}_{i})\), and the value at the intersections is the predicted value of \(verb_{2i}\).

The range of observed values of \((momed^{*}_{i})\) is [-5, 7], and the range of observed values of \((verb^{*}_{1i})\) is [-15, 16]. This implies a 32 row (for the integer values of \((verb^{*}_{1i})\)) by 13 columns (for the integer values of \((momed^{*}_{i})\)) matrix decision space.

Along the right face of the cube one can see the linear relation between Grade 1 verbal score and Grade 2 verbal score. Along the left face one can see the linear relation between years of mother’s education and Grade 2 verbal score. The surface of the plane indicates how the two variables contribute to the Grade 2 verbal scores.

Ok, let’s move on to the topic of an ‘interaction’ which uses the product of two predictor variables as a new predictor.

Working up a slightly different example with the ‘grad’ variable (whether mom graduated high school),

\[ verb_{2i} = b_0(1_i) + b_1(verb^{*}_{1i}) + b_2(grad_{i}) + b_3(verb^{*}_{1i})(grad_{i}) + \epsilon_{i}\]

Where \(verb^{*}_{1i}\) is the mean-centered version of \(verb_{1i}\), and \(grad_i\) is a dummy coded variable that equals 0 if the child’s mother did not graduate high school, and equals 1 if the child’s mother did graduate high school.

Please note that we did not sample-mean center \(grad_i\) in this example because a value of 0 already has substantive meaning for the current example (i.e. when \(grad_i\) equals 0, the mother did not graduate high school).

Often, we describe phenomena in terms of moderation; or that the relation between two variables (i.e. \(y_i\) and \(x_{1i}\)) is moderated by a third variable (i.e. \(x_{2i}\)). For example, the relation between Grade 1 and Grade 2 verbal scores may be moderated by mother’s graduation status. More specificially, the relation between 1st and 2nd grade verbal score may be different for children whose mothers’ did not or did graduate from high school.

The inclusion of product terms (i.e. interactions) allows for a direct investigation of a moderation hypothesis.

When we use a product term, we should define one of the variables as the moderator and one of the variables as the predictor of interest. Let’s call \(verb^{*}_{1i}\) the predictor of interest, and \(grad_{i}\) the moderator. When the moderator is a dummy variable then the form of the moderation becomes fairly simple; we will have one equation for \(grad_{i} = 0\), and a second equation for \(grad_i = 1\). To illustrate the notion of two equations, let’s rewrite the regression equation

\[ verb_{2i} = b_0(1_i) + b_1(verb^{*}_{1i}) + b_2(grad_{i}) + b_3(verb^{*}_{1i})(grad_{i}) + \epsilon_{i}\] as \[ verb_{2i} = [b_0(1_i) + b_2(grad_{i})] + [b_1 + b_3(grad_{i})](verb^{*}_{1i}) + \epsilon_{i}\]

Then, we insert the two possible values for \(grad_i\). For example, if \(grad_i = 0\) , then the equation simplifies to

\[ verb_{2i} = [b_0](1_i) + [b_1](verb^{*}_{1i}) + \epsilon_{i}\]

Please note that the equation above represents the relation between \(verb_{2i}\) and \(verb^{*}_{1i}\) for children whose mother’s did not graduate high school. Therefore, we can interpret the parameter estimates \(b_0\) and \(b_1\) with respect to those children. More specifically, the expected Grade 2 verbal score for a child whose mother did not graduate high school and who had an average Grade 1 verbal score is \(b_0\). Also, for a child whose mother did not graduate high school, \(b_1\) is the expected difference in their Grade 2 verbal score for a one-point difference in their Grade 1 verbal score.

Now, we return to the original equation and insert \(grad_i = 1\) in order to make meaning out of \(b_2\) and \(b_3\). If \(grad_i = 1\), then we have

\[ verb_{2i} = [b_0(1_i) + b_2(1)] + [(b_1 + b_3(1))](verb^{*}_{1i}) + \epsilon_{i}\]

This new equation above represents the relation between \(verb_{2i}\) and \(verb^{2}_{1i}\) for children whose mother’s did graduate high school. The parameter estimates \(b_0\) and \(b_1\) maintain their interpretation from before. But now each of them is moderated (i.e. shifted or altered) by \(b_2\) or \(b_3\).

Specifically, the expected Grade 2 verbal score for a child whose mother did graduate high school and who earned an average Grade 1 verbal score is \(b_0 + b_2\). And, for a child whose mother did graduate high school, \(b_1 + b_3\) is the expected difference in their Grade 2 verbal score for a one-point change in their Grade 1 verbal score.

OK - let’s fit the model! Note that within this model we use the code ‘I(verb1_star * grad)’. This produces the interaction term within the model. The wrapper function ‘I()’ indicates to R to perform this data computation as-is, otherwise we would need to perform this computation (i.e. the multiplication of ‘verb1_star’ by ‘grad’) outside of the function lm(). A word of caution, the model may run without I() used, but it may not give you back the intended result. Therefore, it is best practice to use the I() function to be explicit about what variables are multiplied together to form the interaction. Alternatively, one can also use the code ‘verb1_star:grad’ (with the colon). Be careful of the assumptions made by the lm() function. Always check that the output makes sense in relation to the intended model.

## 
## Call:
## lm(formula = verb2 ~ 1 + verb1_star + grad + I(verb1_star * grad), 
##     data = wiscsub, na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3433  -3.0761  -0.0825   2.5689  10.7289 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           25.2663     0.3416  73.956   <2e-16 ***
## verb1_star             0.7861     0.0604  13.015   <2e-16 ***
## grad                   1.4632     0.8107   1.805   0.0726 .  
## I(verb1_star * grad)  -0.2430     0.1324  -1.836   0.0678 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.232 on 200 degrees of freedom
## Multiple R-squared:  0.5268, Adjusted R-squared:  0.5197 
## F-statistic: 74.22 on 3 and 200 DF,  p-value: < 2.2e-16

The parameter estimates from this model indicate that, for children whose mother did not graduate high school, the expected Grade 2 verbal score for a child that earned an average 1st grade verbal score equals 25.2663 (\(b_0\)). Also, for children whose mother did not graduate high school, a 1-point difference in their Grade 1 verbal score is expected to correspond with a 0.7861 (\(b_1\)) point difference in the Grade 2 verbal score.

Moreover, the parameter estimates indicate that, for children whose mother did graduate high school, the expected Grade 2 verbal score for a child that earned an average Grade 1 verbal score is 25.2663 + 1.4632 = 26.7295 (\(b_0 + b_2\)). Also, for children whose mother did not graduate high school, a 1-point difference in their Grade 1 verbal score is expected to correspond with a (\(b_1 + b_3\)) = 0.7861 - 0.2430 = 0.5431 point difference in the Grade 2 verbal score.

The example from ‘model5’ contained an interaction using a dummy variable (i.e., \(grad_i\)). Interactions may also occur between two continuous variables (i.e., \(verb^{*}_{1i}\) and \(momed^{*}_{i}\)). We will not cover here, but note that it is still very useful to consider and communicate those interactions as moderation. There are many resources on interactions of two (or more) continuous variables.

See the regression bible … Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Newbury Park, London, Sage. and http://www.quantpsy.org/interact/mlr2_instructions.pdf

Do we have everything that is needed for reporting results in (1) text, (2) table, and (3) figure?

Which should we report? - R-square or adjusted R-square Google it … http://stackoverflow.com/questions/2870631/what-is-the-difference-between-multiple-r-squared-and-adjusted-r-squared-in-a-si

Now, let’s move on to non-Gaussian outcome variables.

4 Generalized Linear Model for non-normal outcomes

Earlier we noted that simple linear regression hinges on the assumption that the outcome variable is normally distributed. However, many of our research enterprises contain measures of constructs that are not continuous. Therefore, we need to be able to adjust our analytic approach to accomodate non-normal outcome variables.

When data are not normally distributed, we set up a pair of equations. One corresponds to the right hand side of the original regression equations, such that

\[ \boldsymbol{\eta} = \boldsymbol{X}\boldsymbol{b} + \boldsymbol{\epsilon} \]

Then we transform using a link function called \(g(\cdot)\) which transforms \(\boldsymbol{\eta}\) to be on the same scale as \(\boldsymbol{Y}\). If we denote the link function as \(g(\cdot)\), then we have

\[g(\boldsymbol{Y})=\boldsymbol{\eta}\]

Each link function also has an inverse, \(h(\cdot)=g^{-1}(\cdot)\), which allows us to define

\[\boldsymbol{Y}=g^{-1}(\boldsymbol{\eta})=h(\boldsymbol{\eta})\]

Therefore, we use these link functions to formalize that the conditional expectation for \(\boldsymbol{Y}\) (conditional because it is the expected value of Y depending on the level of the predictors and the chosen link). From a conceptual point of view, the link function \(g(\cdot)\) “transforms” the outcome variable \(\boldsymbol{Y}\) into a normal outcome.

4.2 Interpretation

After inclusion of the link function (i.e., \(g(\cdot)\)) - the modeling proceeds as usual in a linear space. In this linearized space the interpretation of significance and sign of parameters is straight-forward. One just has to remember that the outcome is in log or logit units. For substantive interpretation, it is often easier to back transform the results to the original metric (using the inverse link-function; \(h(\cdot)\)). For example, in a poisson regression, one might want to communicate about the expected count (e.g., number of alcoholic drinks) rather than about the expected log count. Similarly, in a logistic regression, one might want to communicate about the probability of an event (e.g., relapse) given some specific values of the predictors rather than the logarithm of the odds of some event. These transformations complicate matters because they are nonlinear and so intercepts and slopes no longer have a strictly additive role. Good explanation can be found, among other places, at

http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm

https://onlinecourses.science.psu.edu/stat504/node/165

http://www.ats.ucla.edu/stat/r/dae/logit.htm

5 Poisson Regression: Empirical example with a Normally Distributed (Gaussian) outcome variable

Returning to the ‘wiscsub’ data set, notice that the variable ‘momed’ represents the number of years of mothers’ education. We can round this variable to obtain a count variable - the integer number of years of education. We use ‘momed_count’ as an outcome variable, and use Poisson regression to understand the relation between ‘momed’ and some predictor variables.

Forming the ‘momed_count’ by rounding the variable ‘momed’. We do not usually suggest degrading a measurement scale. We only do this here for didactic illustration.

Let’s look at the distribution for number of years of mother’s education

## Warning: Removed 2 rows containing missing values (geom_bar).

Alright, let’s start with the simplest model for examining number of years of mother’s education. In this case we have

\[ log_{e}(momedCount_i) = b_0 \]

Note that within Poisson regression we no longer have a residual term. In the Poisson distribution, the assumption is the that mean and variance are equal, so we do not need an additional variance term.

## 
## Call:
## glm(formula = momed_count ~ 1, family = "poisson", data = wiscsub, 
##     na.action = na.exclude)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6968  -0.3555   0.2464   0.2464   1.8765  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.41293    0.02095   115.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 131.12  on 203  degrees of freedom
## Residual deviance: 131.12  on 203  degrees of freedom
## AIC: 997.07
## 
## Number of Fisher Scoring iterations: 4

For Poisson regression, the intercept has ‘log’ units for the outcome variable. Therefore, if we want to make better sense of the parameter estimate, we need to exponentiate the coefficient (i.e., the inverse link function is exponentiation).

So, to interpret the intercept, we do the following \[ e^{b_0} \]

## [1] 11.16663

So, according to the model, the expected number of years of mother’s education is 11.16663. Let’s confirm that the back-transformed parameter from this intercept-only Poisson Regression matches the expectation we get from the descriptives of the raw data (i.e., the mean of the original variable).

## [1] 11.16667

The expected number of years of education matches through both forms of parameter estimation.

When preparing output from a statistical analysis for publication, we often want to include confidence intervals. For the Poisson regression, 95% confidence intervals for the parameter estimates may be obtained using the ‘confint()’ function. More specifically, we may use

## Waiting for profiling to be done...
##    2.5 %   97.5 % 
## 2.371585 2.453718

Ok, now let’s include a predictor \(verb^{*}_{1i}\) into the model. More specifically, we will have

\[ log_{e}(momedCount_i) = b_0 + b_1(verb^{*}_{1i}) \]

First, let’s examine the relation between mothers’ education and childrens’ 1st grade verbal scores. But, since we use the log-link function for count outcomes, let’s also plot the natural logarithm of the number of years of mothers’ education.

Now, to run the model.

## 
## Call:
## glm(formula = momed_count ~ 1 + verb1_star, family = "poisson", 
##     data = wiscsub, na.action = na.exclude)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.21689  -0.33002   0.08319   0.39723   1.42025  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 2.407459   0.021066 114.283  < 2e-16 ***
## verb1_star  0.018020   0.003592   5.016 5.28e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 131.12  on 203  degrees of freedom
## Residual deviance: 106.07  on 202  degrees of freedom
## AIC: 974.02
## 
## Number of Fisher Scoring iterations: 4

Within this model the intercept (\(b_0\)) is the expected ‘log’ of number of years of education for a child that earned an average 1st grade verbal score. Exponentiate to obtain an estimate on the original ‘count’ scale.

## [1] 11.10571

Therefore, a mother is expected to have 11.10571 years of education if their child had an average 1st grade verbal score.

Within model7, \(b_1\) is the difference in ‘log’ number of years of education for each 1 point difference in their child’s first grade verbal score. Therefore, we expect a 0.018020 difference in the log-count of mothers’ number of years of education for a 1-point difference in 1st grade verbal score. This relation may be difficult to conceptualize because the outcome variable is in terms of logarithm units, so a plot may be a more intuitive display of the results.

Again, we may obtain 95% confidence intervals by using the function ‘confint’.

## Waiting for profiling to be done...
##                  2.5 %     97.5 %
## (Intercept) 2.36588150 2.44846189
## verb1_star  0.01097259 0.02505496

Now, let’s plot the model prescribed relationship, which follows the form \[ log_{e}(momedCount_i) = 2.407459 + 0.018020(verb^{*}_{1i}) \]

Now, let’s transform back to the relation between the raw count of mother’s number of years of education and sample-mean centered 1st grade verbal score. Which follows the form \[ momedCount_i = exp(2.407459 + 0.018020(verb^{*}_{1i})) \]

Ok, now let’s add another variable into the model. Let’s examine the relation of gender. Specifically, the variable \(female_i\) equals 0 if the child was male, and 1 if the child was a female. Then we have the following Poisson regression.

\[ log_{e}(momedCount_i) = b_0 + b_1(female_i)+ b_2(verb^{*}_{1i})+ b_3(verb^{*}_{1i})(female_i) \]

## 
## Call:
## glm(formula = momed_count ~ 1 + female + verb1_star + I(verb1_star * 
##     female), family = "poisson", data = wiscsub, na.action = na.exclude)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.28473  -0.33641   0.07876   0.42523   1.46899  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            2.4071832  0.0298161  80.734  < 2e-16 ***
## female                 0.0008384  0.0421333   0.020 0.984124    
## verb1_star             0.0162466  0.0048763   3.332 0.000863 ***
## I(verb1_star * female) 0.0039115  0.0072258   0.541 0.588278    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 131.12  on 203  degrees of freedom
## Residual deviance: 105.78  on 200  degrees of freedom
## AIC: 977.72
## 
## Number of Fisher Scoring iterations: 4

Coefficients \(b_0\) and \(b_2\) describe the relation between \(verb^{*}_1i\) and \(momedCount_i\) for males, and \(b_0 + b_1\) and \(b_2 + b_3\) describe the relation for females. Note that neither \(b_1\) nor \(b_3\) are significantly different from zero, so boys and girls have about the same relation between 1st grade verbal scores and the number of years of mothers’ education.

We can see this first on the ‘log’ scale

\[ log_{e}(momedCount_i) = 2.4071832 + 0.0008384(female_i)+ 0.0162466(verb^{*}_{1i})+ 0.0039115(verb^{*}_{1i})(female_i) \]

And we can also see this relation on the count scale after application of the link function. \[ momedCount_i = exp(2.4071832 + 0.0008384(female_i)+ 0.0162466(verb^{*}_{1i})+ 0.0039115(verb^{*}_{1i})(female_i)) \]

6 Logistic Regression: Empirical example with a Binomial (Bernoulli) distributed outcome variable

In logistic regression, we are interested in how various perdictors are related to the probability of a specific outcome \(Pr(Y_i = 1) = \pi_i\). Making use of the logit link function, the general equation for logistic regression is
\[logit(\pi_i) = \beta_{0} + \beta_{1}x_1 + ... + \beta_{q}x_q\]

Which after back transformation gives us … \[ Pr(Y_i = 1) = \pi_i = \frac{e^{\beta_{0} + \beta_{1}x_1 + ... + \beta_{q}x_q}}{1+e^{\beta_{0} + \beta_{1}x_1 + ... + \beta_{q}x_q}} \]

For the logistic regression example, let’s shift our focus on the outcome variable \(grad_i\), which indicates whether the mother graduated from high school. The variable ‘grad’ in dataframe ‘wiscsub’ equals 0 if the mother did not graduate high school, and 1 if the mother did graduate high school. Let’s start with the simplest model for predicting ‘grad’, the intercept-only model. More specifically, we have

\[ logit(\pi_i) = b_0(1_i)\] where \(\pi_i = Pr(grad_i = 1)\).

We can use the glm() function to fit the model to the data

## 
## Call:
## glm(formula = grad ~ 1, family = "binomial", data = wiscsub, 
##     na.action = na.exclude)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7149  -0.7149  -0.7149  -0.7149   1.7260  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2340     0.1675  -7.365 1.77e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 217.78  on 203  degrees of freedom
## Residual deviance: 217.78  on 203  degrees of freedom
## AIC: 219.78
## 
## Number of Fisher Scoring iterations: 4

The interpret, \(b_0\), reflects the expected log-odds of the outcome variable.

To make better sense of the intercept, we can use the inverse link function \[ Pr(grad_i = 1) = \pi_i = \frac{e^{b_0}}{1+e^{b_0}} \]

## [1] 0.2254821

So the expected probability of the mother graduating is 0.2254821. Let’s confirm that the backward transformed parameter from this intercept-only logistic regression matches the expectation we get from the descriptives of the raw data.

## [1] 0.2254902

The answers match, both indicate that the expected probability of mother having graduated from high school in this sample is about .225.

Ok, let’s include a predictor. Let’s include \(verb^{*}_{1i}\) such that

\[ logit(\pi_i) = b_0(1_i) + b_1(verb^{*}_{1i}) + \epsilon_i \] where \(\pi_i = Pr(grad_i = 1)\).

And now let’s fit the model

## 
## Call:
## glm(formula = grad ~ 1 + verb1_star, family = "binomial", data = wiscsub, 
##     na.action = na.exclude)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5030  -0.6956  -0.5529  -0.3288   2.2344  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.39038    0.18954  -7.336  2.2e-13 ***
## verb1_star   0.13681    0.03278   4.174  3.0e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 217.78  on 203  degrees of freedom
## Residual deviance: 197.87  on 202  degrees of freedom
## AIC: 201.87
## 
## Number of Fisher Scoring iterations: 4

The parameter estimate \(b_0\) reflects the expected log-odds of a mother graduating for a child with an average Grade 1 verbal score. Using the inverse link function …

## [1] 0.1993471

We note that, for the average Grade 1 child (in this sample), there is a .2 probability of having a mother that graduated from high school.

Going back to the model paraemters, the estimate for \(b_1\) indicates the expected difference of the log-odds of a mother graduating for a 1-point difference in their child’s Grade 1 verbal score. Therefore, we expect a 0.13681 difference in the log-odds of mothers graduating for a 1 point difference in children’s 1st grade verbal score.

Parameter estimates from a logistic regression are often reported in terms of ‘odds’ rather than ‘log-odds’. To obtain parameters in odds units, we simply exponentiate the coefficients. Note that this is just one of the steps of the inverse link function (which would take us all the way to probability units). Therefore, we have

## (Intercept)  verb1_star 
##   0.2489803   1.1466101

And to obtain 95% confidence intervals for these coefficients, we use

## Waiting for profiling to be done...
##                    OR     2.5 %    97.5 %
## (Intercept) 0.2489803 0.1685443 0.3554798
## verb1_star  1.1466101 1.0776626 1.2262119

Using these parameter estimates, we can say that the odds of a mother graduating is 0.2489 if her child had an average first grade verbal score. Also, for each 1-point difference in the child’s 1st grade verbal score we expect a 1.146 difference in the odds of the child’s mother graduating high school. Communicating results from logistic regression is always difficult, but translation to odds or probabilites can help tremendously. Do what you can to make it easy for your readers.

Using the inverse link function, we can produce an informative plot of the relation between 1st grade verbal scores and probability of mothers’ graduation in the original 0 to 1 scale.

\[ Pr(grad_i=1) = \pi_i = \frac{e^{-1.39038+0.13681(verb^{*}_{1i})}}{1+e^{-1.39038+0.13681(verb^{*}_{1i})}}\]

We can see that with higher 1st grade verbal scores, the probability of mothers having graduated is higher.

Ok, let’s include another predictor. Let’s look at verbal score and gender.

\[ logit(\pi_i) = b_0 + b_1(verb^{*}_{1i}) + b_2(female_i) + b_2(verb^{*}_{1i})(female_i) \] where \(\pi_i = Pr(grad_i = 1)\).

## 
## Call:
## glm(formula = grad ~ 1 + female + verb1_star + I(verb1_star * 
##     female), family = "binomial", data = wiscsub, na.action = na.exclude)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4224  -0.6894  -0.5705  -0.2692   2.4419  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.38195    0.25825  -5.351 8.73e-08 ***
## female                 -0.07282    0.38885  -0.187   0.8514    
## verb1_star              0.09119    0.04146   2.200   0.0278 *  
## I(verb1_star * female)  0.11490    0.06987   1.644   0.1001    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 217.78  on 203  degrees of freedom
## Residual deviance: 194.79  on 200  degrees of freedom
## AIC: 202.79
## 
## Number of Fisher Scoring iterations: 5

Parameters \(b_0\) and \(b_2\) describe the relation between \(verb^{*}_{1i}\) and \(grad_i\) for males, and \(b_0 + b_1\) and \(b_2 + b_3\) describe the relation for females. Note that neither \(b_1\) nor \(b_3\) are significantly different from zero, so boys and girls have about the same relation between 1st grade verbal scores and the number of years of mothers’ education.

We can see this first on the log-odds scale as

\[ logit(\pi_i) = -1.38195 + 0.09119(verb^{*}_{1i})+ 0.0162466(female_i)+ 0.11490 (verb^{*}_{1i})(female_i) \] where \(\pi_i = Pr(grad_i = 1)\).

Transformed to the probability (0 to 1) scale, this is

\[ grad_i = \frac{e^{-1.38195 + 0.09119(verb^{*}_{1i})+ 0.0162466(female_i)+ 0.11490 (verb^{*}_{1i})(female_i)}}{1+e^{-1.38195 + 0.09119(verb^{*}_{1i})+ 0.0162466(female_i)+ 0.11490 (verb^{*}_{1i})(female_i)}} \]

So, in summary, fitting linear models to binary outcomes is very similar to fitting models to Gaussian outcomes. We use the logit link function to normalize the outcome variable, and then all of the parameter estimates are in terms of log-odds. Transformation back into the probability space may provide for easier interpretation.

What the equivalent of an R-square is, is not clear … Relvant info and discussion can be found in the following places -

http://statisticalhorizons.com/r2logistic

http://stats.stackexchange.com/questions/3559/which-pseudo-r2-measure-is-the-one-to-report-for-logistic-regression-cox-s?rq=1

with general info on pseudo R-sq here http://statistics.ats.ucla.edu/stat/mult_pkg/faq/general/Psuedo_RSquareds.htm

and it looks like R-specifc implementation can be done with … Harrell’s rms package gives Nagalkerke’s R2 http://stats.stackexchange.com/questions/8511/how-to-calculate-pseudo-r2-from-rs-logistic-regression or for McFadden’s R2 the pR2() function in the pscl library (as noted in the second answer)

#Some Take-home Points

  1. We use regression analysis to understand how a focal construct differs in relation to other variables. Regression, at its core, can be performed through simple matrix manipulation.

  2. The regression analysis produces an ‘intercept’ and a set of ‘slopes’. We derive meaning out of these parameters by connecting to a specifc phenomenon. For example, the ‘intercept’ represents the expected value of the outcome variable when all predictor variables are zero. Centering of the predictor variables may provide for interpretative value. The ‘slopes’ represent the expected difference in the outcome variable given a 1-unit difference in the predictor variable.

  3. Many non-Gaussian outcome variables can be accommodated within the Generalized Linear Modeling framework through use of link-functions that “transform” the outcome into a linear space. For count outcomes we often use the log link to invoke Poisson Regression. For binary outcomes we often use a logit link to invoke Logistic Regression. Interpretation of results is often facilitated through back-transformation using inverse link functions.

  4. The estimation programs (e.g., R) do a lot of the heavy lifting for us, but there is no substitute for understanding how the data are being manipulated to produce parameter estimates. Readers will be appreciative of concise and clear communication of results in meaningful and interpretable units.