Overview

In this tutorial we walk through the very basics of conducting confirmatory factor analysis (CFA) in R. This is not a comprehensive coverage, just something to get started. This is a one-off done as part of a guest lecture.

Outline

In this session we cover …

A. Introduction to the Factor Analysis Model
B. Example 1: Basic CFA orientation & interpretation
C. Example 2: A Different Set of Variables

Prelim - Loading libraries used in this script.

library(psych)
library(ggplot2)
library(corrplot) #plotting correlation matrices
library(lavaan)  #for fitting structural equation models
library(semPlot)  #for automatically making diagrams 

A. Introduction to the Factor Analyis Model

Recall from last time that the basic factor analysis model is written as series of equations of the form …

\[y_{pi} = \lambda_{pq} f_{qi} + u_{pi}\] where \(y_{pi}\) is individual i’s score on the pth observed variable, \(f_{qi}\) is individual i’s score on the qth latent common factor, \(u_{pi}\) is individual i’s score on the pth latent unique factor, and \(\lambda_{pq}\) is the factor loading that indicates the relation between the pth observed variable and the qth latent common factor.

Typically, we have multiple observed variables and one or more common factors. For instance in the 4 variable, 2 factor case we would have …

\[y_{1i} = \lambda_{11} f_{1i} + \lambda_{12} f_{2i} + u_{1i}\] \[y_{2i} = \lambda_{21} f_{1i} + \lambda_{22} f_{2i} + u_{2i}\] \[y_{3i} = \lambda_{31} f_{1i} + \lambda_{32} f_{2i} + u_{3i}\] \[y_{4i} = \lambda_{41} f_{1i} + \lambda_{42} f_{2i} + u_{4i}\]

which can be written in a compact matrix form as

\[ \boldsymbol{Y_{i}} = \boldsymbol{\Lambda}\boldsymbol{F_{i}} + \boldsymbol{U_{i}} \] where \(\boldsymbol{Y_{i}}\) is a \(p\) x 1 vector of observed variable scores, \(\boldsymbol{\Lambda}\) is a p x q matrix of factor loadings, \(\boldsymbol{F_{i}}\) is a \(q\) x 1 vector of common factor scores, and \(\boldsymbol{U_{i}}\) is a p x 1 vector of unique factor scores.

Extension to multiple persons provided for mapping to the observed correlation matrix, \(\boldsymbol{\Sigma} = \boldsymbol{Y}'\boldsymbol{Y}\) and the common factor model becomes

\[ \boldsymbol{\Sigma} = \boldsymbol{\Lambda}\boldsymbol{\Psi}\boldsymbol{\Lambda}' + \boldsymbol{\Theta} \] where \(\boldsymbol{\Sigma}\) is a p x p covariance (or correlation) matrix of the observed variables, \(\boldsymbol{\Lambda}\) is a p x q matrix of factor loadings, \(\boldsymbol{\Psi}\) is a q x q covariance matrix of the latent factor variables, and \(\boldsymbol{\Theta}\) is a diagonal matrix of unique factor variances.

Confirmatory Factor Analysis is usually conduced within a Structural Equation Modeling (SEM) framework. There are a number of SEM packages in R. We currently tend to be using the lavaan package.
Lots of good information and instruction can be found on the package website … http://lavaan.ugent.be , including one of the examples we use here.

Prelim - Reading in Repeated Measures Data

For this example, we use classic data from Osborne & Sudick (1972) that has been the basis for many examples in Jack McArdles and colleagues papers, courses, and workshops.
The data were obtained from 204 participants (rows) who completed a vareity of cognitive ability tests that are part of the WISC.

#set filepath for data file
filepath <- "https://quantdev.ssri.psu.edu/sites/qdev/files/WISC_CFAexample.csv"
#read in the .csv file using the url() function
dat <- read.csv(file=url(filepath),header=TRUE)

#subsetting to the variables of interest 
dat <- dat[,c("id","info_06","comp_06","simi_06","voca_06","picc_06","pica_06","bloc_06","obje_06")]

Lets have a quick look at the data file and the descriptives.

#data structure
head(dat,10)
##    id  info_06   comp_06   simi_06  voca_06   picc_06  pica_06    bloc_06
## 1   1 31.28655 25.626566 22.932331 22.21491 25.877193 8.310249  9.4098884
## 2   2 13.80117 14.786967  7.581454 15.37281 10.526316 7.417667  3.8915470
## 3   3 34.96970 34.675325 28.051948 26.84091 42.818182 6.443381  4.9586777
## 4   4 24.79532 31.390977  8.208020 20.19737 50.000000 2.708526  8.8676236
## 5   5 25.26316 30.263158 15.977444 35.41667 44.298246 9.295168  8.7400319
## 6   6 15.40230 23.399015 11.453202 20.56034 18.103448 8.832426  1.5047022
## 7   7 15.38012 -1.253133  2.318296 13.00439  8.596491 1.662050 -0.3827751
## 8   8 19.88304  6.704261 14.160401 14.86842 39.912281 6.463527  7.0813397
## 9   9 12.63158 13.847118 10.275689 23.22368 19.736842 6.401970 -0.3189793
## 10 10 23.69048 25.446429 11.415816 21.78571 35.178571 7.675439  7.8571429
##      obje_06
## 1  39.164087
## 2   4.179567
## 3  41.764706
## 4  63.054696
## 5  43.085655
## 6  13.793103
## 7   4.437564
## 8  19.762642
## 9  25.902993
## 10 12.710084
#descriptives
describe(dat)
##         vars   n   mean    sd median trimmed   mad   min    max  range
## id         1 204 102.50 59.03 102.50  102.50 75.61  1.00 204.00 203.00
## info_06    2 204  19.78  6.12  19.13   19.77  5.53  1.04  34.97  33.93
## comp_06    3 204  21.80  9.74  21.90   21.98 10.32 -1.25  46.49  47.74
## simi_06    4 204  14.90  7.56  14.54   14.57  7.83 -4.52  37.76  42.28
## voca_06    5 204  20.40  6.29  19.84   20.11  6.52  2.77  39.66  36.89
## picc_06    6 204  28.25 12.15  29.20   28.60 12.68 -3.42  55.18  58.60
## pica_06    7 204   9.31  8.41   7.40    7.97  5.77 -1.85  54.30  56.14
## bloc_06    8 204   6.94  6.60   5.81    6.12  4.62 -5.58  42.19  47.78
## obje_06    9 204  25.26 16.39  22.34   23.89 15.60 -4.54  71.10  75.64
##          skew kurtosis   se
## id       0.00    -1.22 4.13
## info_06 -0.06     0.17 0.43
## comp_06 -0.13    -0.49 0.68
## simi_06  0.39    -0.14 0.53
## voca_06  0.37     0.02 0.44
## picc_06 -0.24    -0.40 0.85
## pica_06  1.97     5.36 0.59
## bloc_06  2.05     6.77 0.46
## obje_06  0.66    -0.24 1.15

B. Example 1: Basic CFA orientation & interpretation

Our goal is to code a model that matches an a priori hypothesis about the structure of the data, and evaluate the match between that model, specifically the mean and variance-covariance expectations, and the observed data (i.e., the observed means and variance-covaraince matrix).

We go through a series of models here that match those examined in the lecture slides.

Of particular interest for what we are doing is the correlation matrix of the first 4 substantive variables in the data set. We can look at both the numeric and visual versions of the correlation matrix.

round(cor(dat[,c("info_06","voca_06","comp_06","simi_06")]),2)
##         info_06 voca_06 comp_06 simi_06
## info_06    1.00    0.56    0.51    0.54
## voca_06    0.56    1.00    0.58    0.44
## comp_06    0.51    0.58    1.00    0.45
## simi_06    0.54    0.44    0.45    1.00

And a compact visual version from the corrplot package.

#visusal correlation matrix
corrplot(cor(dat[,c("info_06","voca_06","comp_06","simi_06")]), order = "original", tl.col='black', tl.cex=.75) 

We will consider a set of 3 models …
* 0-Factor model * 1-Factor model * 2-Factor model

Each time we engage with … Model Specification, Model Estimation, and Model Evaluation.

0-Factor Model.

We start with a 0-factor model - basically a null model. But it is still a set of assumptions. Namely, that scores on one test are NOT related to scores on any of the other tests - i.e., that there are no covariances among the 4 tests.

Model Specification

First, code the model into a lavaan model object. Note that there are a variety of short-hand ways to program these models (i.e, defualts). We take a very explicit and comprehensive approach - where we basically ‘draw’ every single arrow into the model.

Basically, there are four formula types that correspond to different kinds of arrows.

latent variable definition =~ “is measured by” regression ~ “is regressed on” (residual) (co)variance ~~ “is correlated with” intercept ~ 1 “intercept” A complete lavaan model syntax is simply a combination of these formula types, enclosed between single quotes.

We set up a simple zero factor model …

#Model with 0 common factors 
verbal_0factor <- ' #start of model

# latent variable definitions (common factors)
# latent variable variances
# latent variable covariances
# latent variable means

# manifest variable variances (uniquenesses)
  info_06 ~~ info_06
  voca_06 ~~ voca_06
  comp_06 ~~ comp_06
  simi_06 ~~ simi_06
            
# manifest variable covariances (uniquenesses)

#manifest variable means 
  info_06 ~ 1
  voca_06 ~ 1
  comp_06 ~ 1
  simi_06 ~ 1
' #end of model
Model Fitting

Second, we fit the specified model to the data to obrain parameter estimates.

fit0 <- lavaan(verbal_0factor, data=dat, mimic = "mplus")
summary(fit0, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after   8 iterations
## 
##   Number of observations                           204
## 
##   Number of missing patterns                         1
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic              261.106
##   Degrees of freedom                                 6
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              261.106
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.000
##   Tucker-Lewis Index (TLI)                       0.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2777.649
##   Loglikelihood unrestricted model (H1)      -2647.095
## 
##   Number of free parameters                          8
##   Akaike (AIC)                                5571.297
##   Bayesian (BIC)                              5597.842
##   Sample-size adjusted Bayesian (BIC)         5572.496
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.457
##   90 Percent Confidence Interval          0.410  0.505
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.337
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Standard Errors                             Standard
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     info_06          19.776    0.427   46.273    0.000   19.776    3.240
##     voca_06          20.396    0.439   46.416    0.000   20.396    3.250
##     comp_06          21.797    0.680   32.036    0.000   21.797    2.243
##     simi_06          14.903    0.528   28.224    0.000   14.903    1.976
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     info_06          37.261    3.689   10.100    0.000   37.261    1.000
##     voca_06          39.390    3.900   10.100    0.000   39.390    1.000
##     comp_06          94.434    9.350   10.100    0.000   94.434    1.000
##     simi_06          56.881    5.632   10.100    0.000   56.881    1.000
Model Evaluation

Some basic questions to ask after fitting a model -
Did it work?
Does it make sense? Are the parameters viable?

Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)

semPaths(fit0, what="std", 
         sizeLat = 7, sizeMan = 7, edge.label.cex = .75)

Note that we asked for standardized estimates in the diagram.

We can also look more closely at how well the model fit, and interpret the model more generally. * Note that there are 6 degrees of freedom in this model. * Note how bad the fit of the model is. * There is room for improvement. Do we have another theory?

1-Factor Model

We have another theory - that the 4 tests are indicators of a single latent factor we’ll call Verbal ability.

Model Specification

We specify a 1-factor model by defining a latent variable and adding in some additional arrows. The factor must be indicated by manifest variables, and must have a variance (maybe covariance and mean).

Basically, there are four formula types that correspond to different kinds of arrows.

latent variable definition =~ “is measured by” regression ~ “is regressed on” (residual) (co)variance ~~ “is correlated with” intercept ~ 1 “intercept” A complete lavaan model syntax is simply a combination of these formula types, enclosed between single quotes.

We set up a 1-factor model …

#Model with 1 common factor 
verbal_1factor <- ' #start of model

# latent variable definitions (common factors)
  verbal =~ NA*info_06 + 
            NA*voca_06 +
            NA*comp_06 +
            NA*simi_06
  
# latent variable variances
  verbal ~~ 1*verbal

# latent variable covariances
# latent variable means

# manifest variable variances (uniquenesses)
  info_06 ~~ info_06
  voca_06 ~~ voca_06
  comp_06 ~~ comp_06
  simi_06 ~~ simi_06
            
# manifest variable covariances (uniquenesses)

#manifest variable means 
  info_06 ~ 1
  voca_06 ~ 1
  comp_06 ~ 1
  simi_06 ~ 1
' #end of model
Model Fitting

Second, we fit the specified model to the data to obrain parameter estimates.

fit1 <- lavaan(verbal_1factor, data=dat, mimic = "mplus")
summary(fit1, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after  40 iterations
## 
##   Number of observations                           204
## 
##   Number of missing patterns                         1
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                7.147
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.028
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              261.106
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.980
##   Tucker-Lewis Index (TLI)                       0.939
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2650.669
##   Loglikelihood unrestricted model (H1)      -2647.095
## 
##   Number of free parameters                         12
##   Akaike (AIC)                                5325.338
##   Bayesian (BIC)                              5365.155
##   Sample-size adjusted Bayesian (BIC)         5327.136
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.112
##   90 Percent Confidence Interval          0.032  0.206
##   P-value RMSEA <= 0.05                          0.088
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.025
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   verbal =~                                                             
##     info_06           4.582    0.412   11.112    0.000    4.582    0.751
##     voca_06           4.714    0.421   11.192    0.000    4.714    0.751
##     comp_06           7.013    0.659   10.642    0.000    7.013    0.722
##     simi_06           4.843    0.527    9.190    0.000    4.843    0.642
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .info_06          19.776    0.427   46.273    0.000   19.776    3.240
##    .voca_06          20.396    0.439   46.416    0.000   20.396    3.250
##    .comp_06          21.797    0.680   32.036    0.000   21.797    2.243
##    .simi_06          14.903    0.528   28.224    0.000   14.903    1.976
##     verbal            0.000                               0.000    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     verbal            1.000                               1.000    1.000
##    .info_06          16.271    2.419    6.725    0.000   16.271    0.437
##    .voca_06          17.167    2.517    6.820    0.000   17.167    0.436
##    .comp_06          45.258    6.176    7.328    0.000   45.258    0.479
##    .simi_06          33.426    4.031    8.293    0.000   33.426    0.588
Model Evaluation

Some basic questions to ask after fitting a model - are Did it work?
Does it make sense? Are the parameters viable?

Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)

semPaths(fit1, what="std", 
         sizeLat = 7, sizeMan = 7, edge.label.cex = .75)

Then, we can look more closely at how well the model fit, and interpret the model more generally.

  • Note that there are 2 degrees of freedom in this model.
  • Note how much better the fit of the model is.
  • We can formalize the improvement in fit

We do model comparison…
Looking at fit statistics

#Compare model fit statistics
round(cbind(m1=inspect(fit0, 'fit.measures'), m2=inspect(fit1, 'fit.measures')),3)
##                            m1        m2
## npar                    8.000    12.000
## fmin                    0.640     0.018
## chisq                 261.106     7.147
## df                      6.000     2.000
## pvalue                  0.000     0.028
## baseline.chisq        261.106   261.106
## baseline.df             6.000     6.000
## baseline.pvalue         0.000     0.000
## cfi                     0.000     0.980
## tli                     0.000     0.939
## nnfi                    0.000     0.939
## rfi                     0.000     0.918
## nfi                     0.000     0.973
## pnfi                    0.000     0.324
## ifi                     0.000     0.980
## rni                     0.000     0.980
## logl                -2777.649 -2650.669
## unrestricted.logl   -2647.095 -2647.095
## aic                  5571.297  5325.338
## bic                  5597.842  5365.155
## ntotal                204.000   204.000
## bic2                 5572.496  5327.136
## rmsea                   0.457     0.112
## rmsea.ci.lower          0.410     0.032
## rmsea.ci.upper          0.505     0.206
## rmsea.pvalue            0.000     0.088
## rmr                    21.820     1.531
## rmr_nomean             21.820     1.531
## srmr                    0.337     0.025
## srmr_bentler            0.337     0.025
## srmr_bentler_nomean     0.399     0.029
## srmr_bollen             0.337     0.025
## srmr_bollen_nomean      0.399     0.029
## srmr_mplus              0.337     0.025
## srmr_mplus_nomean       0.399     0.029
## cn_05                  10.838   172.016
## cn_01                  14.135   263.893
## gfi                     0.953     0.998
## agfi                    0.889     0.984
## pgfi                    0.408     0.143
## mfi                     0.535     0.987
## ecvi                       NA        NA

Formal model test

#Chi-square difference test for nested models 
anova(fit0, fit1)
## Chi Square Difference Test
## 
##      Df    AIC    BIC   Chisq Chisq diff Df diff Pr(>Chisq)    
## fit1  2 5325.3 5365.2   7.147                                  
## fit0  6 5571.3 5597.8 261.106     253.96       4  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the null hypothesis that the models fit the same.

2-Factor Model

We have yet another theory - that the 4 tests are indicators of two correlated latent factor we’ll call F1 and F2.

Model Specification

We specify a 2-factor model by defining the latent variables and adding in some additional arrows. Each factor must be indicated by manifest variables, and must have a variance (maybe covariance and mean).
We won’t go into it here, but identification constraints are needed and important.

We set up a 2-factor model …

#Model with 1 common factor 
verbal_2factor <- ' #start of model

# latent variable definitions (common factors)
  f1 =~ NA*info_06 + 
        NA*voca_06 
  f2 =~ NA*comp_06 +
        NA*simi_06
  
# latent variable variances
  f1 ~~ 1*f1
  f2 ~~ 1*f2

# latent variable covariances
  f1 ~~ f2
# latent variable means

# manifest variable variances (uniquenesses)
  info_06 ~~ info_06
  voca_06 ~~ voca_06
  comp_06 ~~ comp_06
  simi_06 ~~ simi_06
            
# manifest variable covariances (uniquenesses)

#manifest variable means 
  info_06 ~ 1
  voca_06 ~ 1
  comp_06 ~ 1
  simi_06 ~ 1
' #end of model
Model Fitting

Second, we fit the specified model to the data to obrain parameter estimates.

fit2 <- lavaan(verbal_2factor, data=dat, mimic = "mplus")
## Warning in lav_object_post_check(object): lavaan WARNING: covariance matrix of latent variables
##                 is not positive definite;
##                 use inspect(fit,"cov.lv") to investigate.
summary(fit2, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after  55 iterations
## 
##   Number of observations                           204
## 
##   Number of missing patterns                         1
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                6.805
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.009
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              261.106
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.977
##   Tucker-Lewis Index (TLI)                       0.863
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2650.498
##   Loglikelihood unrestricted model (H1)      -2647.095
## 
##   Number of free parameters                         13
##   Akaike (AIC)                                5326.996
##   Bayesian (BIC)                              5370.132
##   Sample-size adjusted Bayesian (BIC)         5328.944
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.169
##   90 Percent Confidence Interval          0.067  0.298
##   P-value RMSEA <= 0.05                          0.030
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.024
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   f1 =~                                                                 
##     info_06           4.553    0.413   11.032    0.000    4.553    0.746
##     voca_06           4.675    0.424   11.015    0.000    4.675    0.745
##   f2 =~                                                                 
##     comp_06           6.882    0.696    9.889    0.000    6.882    0.708
##     simi_06           4.778    0.538    8.886    0.000    4.778    0.634
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   f1 ~~                                                                 
##     f2                1.035    0.061   17.067    0.000    1.035    1.035
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .info_06          19.776    0.427   46.273    0.000   19.776    3.240
##    .voca_06          20.396    0.439   46.416    0.000   20.396    3.250
##    .comp_06          21.797    0.680   32.036    0.000   21.797    2.243
##    .simi_06          14.903    0.528   28.224    0.000   14.903    1.976
##     f1                0.000                               0.000    0.000
##     f2                0.000                               0.000    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     f1                1.000                               1.000    1.000
##     f2                1.000                               1.000    1.000
##    .info_06          16.529    2.423    6.821    0.000   16.529    0.444
##    .voca_06          17.537    2.562    6.846    0.000   17.537    0.445
##    .comp_06          47.070    6.912    6.810    0.000   47.070    0.498
##    .simi_06          34.047    4.174    8.158    0.000   34.047    0.599
Model Evaluation

Some basic questions to ask after fitting a model -
Did it work?
Does it make sense? Are the parameters viable?

NOTICE THERE WAS A WARNING Something is wrong with this model. Look at the correlation between the factors, r = 1.03. That is not viable.

Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)

semPaths(fit2, what="std", 
         sizeLat = 7, sizeMan = 7, edge.label.cex = .75)

Then, we can look more closely at how well the model fit, and interpret the model more generally.

  • Note that there are 2 degrees of freedom in this model.
  • Note how much better the fit of the model is.
  • We can formalize the improvement in fit

We do model comparison…
Looking at fit statistics

#Compare model fit statistics
round(cbind(m1=inspect(fit0, 'fit.measures'), m2=inspect(fit1, 'fit.measures'),m3=inspect(fit2, 'fit.measures')),3)
##                            m1        m2        m3
## npar                    8.000    12.000    13.000
## fmin                    0.640     0.018     0.017
## chisq                 261.106     7.147     6.805
## df                      6.000     2.000     1.000
## pvalue                  0.000     0.028     0.009
## baseline.chisq        261.106   261.106   261.106
## baseline.df             6.000     6.000     6.000
## baseline.pvalue         0.000     0.000     0.000
## cfi                     0.000     0.980     0.977
## tli                     0.000     0.939     0.863
## nnfi                    0.000     0.939     0.863
## rfi                     0.000     0.918     0.844
## nfi                     0.000     0.973     0.974
## pnfi                    0.000     0.324     0.162
## ifi                     0.000     0.980     0.978
## rni                     0.000     0.980     0.977
## logl                -2777.649 -2650.669 -2650.498
## unrestricted.logl   -2647.095 -2647.095 -2647.095
## aic                  5571.297  5325.338  5326.996
## bic                  5597.842  5365.155  5370.132
## ntotal                204.000   204.000   204.000
## bic2                 5572.496  5327.136  5328.944
## rmsea                   0.457     0.112     0.169
## rmsea.ci.lower          0.410     0.032     0.067
## rmsea.ci.upper          0.505     0.206     0.298
## rmsea.pvalue            0.000     0.088     0.030
## rmr                    21.820     1.531     1.481
## rmr_nomean             21.820     1.531     1.481
## srmr                    0.337     0.025     0.024
## srmr_bentler            0.337     0.025     0.024
## srmr_bentler_nomean     0.399     0.029     0.029
## srmr_bollen             0.337     0.025     0.024
## srmr_bollen_nomean      0.399     0.029     0.029
## srmr_mplus              0.337     0.025     0.024
## srmr_mplus_nomean       0.399     0.029     0.029
## cn_05                  10.838   172.016   116.154
## cn_01                  14.135   263.893   199.891
## gfi                     0.953     0.998     0.998
## agfi                    0.889     0.984     0.971
## pgfi                    0.408     0.143     0.071
## mfi                     0.535     0.987     0.986
## ecvi                       NA        NA        NA

Formal model test

#Chi-square difference test for nested models 
anova(fit1, fit2)
## Chi Square Difference Test
## 
##      Df    AIC    BIC  Chisq Chisq diff Df diff Pr(>Chisq)
## fit2  1 5327.0 5370.1 6.8053                              
## fit1  2 5325.3 5365.2 7.1470    0.34172       1     0.5588

We cannot reject the null hypothesis that the models fit the same. And - the 2-factor model is bad anyway.

We conclude that the 1-factor model is the optimal model, of those evaluated.

C. Example 2: A Different Set of Variables

Again, our goal is to code a model that matches an a priori hypothesis about the structure of the data, and evaluate the match between that model, specifically the mean and variance-covariance expectations, and the observed data (i.e., the observed means and variance-covaraince matrix).

We go through a series of models here that match those examined in the lecture slides.

Of particular interest for what we are doing is the correlation matrix of 4 substantive variables in the data set (CO, VO, BL, OB). We can look at both the numeric and visual versions of the correlation matrix.

round(cor(dat[,c("comp_06","voca_06","bloc_06","obje_06")]),2)
##         comp_06 voca_06 bloc_06 obje_06
## comp_06    1.00    0.58    0.32    0.31
## voca_06    0.58    1.00    0.39    0.45
## bloc_06    0.32    0.39    1.00    0.55
## obje_06    0.31    0.45    0.55    1.00

And a compact visual version from the corrplot package.

#visusal correlation matrix
corrplot(cor(dat[,c("comp_06","voca_06","bloc_06","obje_06")]), order = "original", tl.col='black', tl.cex=.75) 

We will consider a set of 3 models …
* 0-Factor model * 1-Factor model * 2-Factor model

Each time we engage with … Model Specification, Model Estimation, and Model Evaluation.

0-Factor Model.

We start with a 0-factor model - basically a null model. But it is still a set of assumptions. Namely, that scores on one test are NOT related to scores on any of the other tests - i.e., that there are no covariances among the 4 tests.

Model Specification

First, code the model into a lavaan model object. Note that there are a variety of short-hand ways to program these models (i.e, defualts). We take a very explicit and comprehensive approach - where we basically ‘draw’ every single arrow into the model.

Basically, there are four formula types that correspond to different kinds of arrows.

latent variable definition =~ “is measured by” regression ~ “is regressed on” (residual) (co)variance ~~ “is correlated with” intercept ~ 1 “intercept” A complete lavaan model syntax is simply a combination of these formula types, enclosed between single quotes.

We set up a simple zero factor model …

#Model with 0 common factors 
General_0factor <- ' #start of model

# latent variable definitions (common factors)
# latent variable variances
# latent variable covariances
# latent variable means

# manifest variable variances (uniquenesses)
  comp_06 ~~ comp_06
  voca_06 ~~ voca_06
  bloc_06 ~~ bloc_06
  obje_06 ~~ obje_06
            
# manifest variable covariances (uniquenesses)

#manifest variable means 
  comp_06 ~ 1
  voca_06 ~ 1
  bloc_06 ~ 1
  obje_06 ~ 1
' #end of model
Model Fitting

Second, we fit the specified model to the data to obrain parameter estimates.

fit0 <- lavaan(General_0factor, data=dat, mimic = "mplus")
summary(fit0, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after   8 iterations
## 
##   Number of observations                           204
## 
##   Number of missing patterns                         1
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic              214.189
##   Degrees of freedom                                 6
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              214.189
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.000
##   Tucker-Lewis Index (TLI)                       0.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2950.942
##   Loglikelihood unrestricted model (H1)      -2843.848
## 
##   Number of free parameters                          8
##   Akaike (AIC)                                5917.885
##   Bayesian (BIC)                              5944.430
##   Sample-size adjusted Bayesian (BIC)         5919.084
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.412
##   90 Percent Confidence Interval          0.366  0.461
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.292
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Standard Errors                             Standard
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     comp_06          21.797    0.680   32.036    0.000   21.797    2.243
##     voca_06          20.396    0.439   46.416    0.000   20.396    3.250
##     bloc_06           6.942    0.461   15.063    0.000    6.942    1.055
##     obje_06          25.258    1.145   22.059    0.000   25.258    1.544
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     comp_06          94.434    9.350   10.100    0.000   94.434    1.000
##     voca_06          39.390    3.900   10.100    0.000   39.390    1.000
##     bloc_06          43.333    4.291   10.100    0.000   43.333    1.000
##     obje_06         267.462   26.483   10.100    0.000  267.462    1.000
Model Evaluation

Some basic questions to ask after fitting a model -
Did it work?
Does it make sense? Are the parameters viable?

Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)

semPaths(fit0, what="std", 
         sizeLat = 7, sizeMan = 7, edge.label.cex = .75)

Note that we asked for standardized estimates in the diagram.

We can also look more closely at how well the model fit, and interpret the model more generally. * Note that there are 6 degrees of freedom in this model. * Note how bad the fit of the model is. * There is room for improvement. Do we have another theory?

1-Factor Model

We have another theory - that the 4 tests are indicators of a single latent factor we’ll call G (general ability).

Model Specification

We specify a 1-factor model by defining a latent variable and adding in some additional arrows. The factor must be indicated by manifest variables, and must have a variance (maybe covariance and mean).

Basically, there are four formula types that correspond to different kinds of arrows.

latent variable definition =~ “is measured by” regression ~ “is regressed on” (residual) (co)variance ~~ “is correlated with” intercept ~ 1 “intercept” A complete lavaan model syntax is simply a combination of these formula types, enclosed between single quotes.

We set up a 1-factor model …

#Model with 1 common factor 
General_1factor <- ' #start of model

# latent variable definitions (common factors)
  G =~ NA*comp_06 + 
       NA*voca_06 +
       NA*bloc_06 +
       NA*obje_06
  
# latent variable variances
  G ~~ 1*G

# latent variable covariances
# latent variable means

# manifest variable variances (uniquenesses)
  comp_06 ~~ comp_06
  voca_06 ~~ voca_06
  bloc_06 ~~ bloc_06
  obje_06 ~~ obje_06
            
# manifest variable covariances (uniquenesses)

#manifest variable means 
  comp_06 ~ 1
  voca_06 ~ 1
  bloc_06 ~ 1
  obje_06 ~ 1
' #end of model
Model Fitting

Second, we fit the specified model to the data to obrain parameter estimates.

fit1 <- lavaan(General_1factor, data=dat, mimic = "mplus")
summary(fit1, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after  51 iterations
## 
##   Number of observations                           204
## 
##   Number of missing patterns                         1
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic               31.453
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              214.189
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.859
##   Tucker-Lewis Index (TLI)                       0.576
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2859.574
##   Loglikelihood unrestricted model (H1)      -2843.848
## 
##   Number of free parameters                         12
##   Akaike (AIC)                                5743.148
##   Bayesian (BIC)                              5782.966
##   Sample-size adjusted Bayesian (BIC)         5744.946
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.269
##   90 Percent Confidence Interval          0.191  0.355
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.064
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   G =~                                                                  
##     comp_06           6.409    0.721    8.891    0.000    6.409    0.660
##     voca_06           4.888    0.482   10.151    0.000    4.888    0.779
##     bloc_06           3.788    0.544    6.961    0.000    3.788    0.575
##     obje_06          10.053    1.322    7.606    0.000   10.053    0.615
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .comp_06          21.797    0.680   32.036    0.000   21.797    2.243
##    .voca_06          20.396    0.439   46.416    0.000   20.396    3.250
##    .bloc_06           6.942    0.461   15.063    0.000    6.942    1.055
##    .obje_06          25.258    1.145   22.059    0.000   25.258    1.544
##     G                 0.000                               0.000    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     G                 1.000                               1.000    1.000
##    .comp_06          53.359    7.333    7.276    0.000   53.359    0.565
##    .voca_06          15.493    3.415    4.536    0.000   15.493    0.393
##    .bloc_06          28.985    3.880    7.470    0.000   28.985    0.669
##    .obje_06         166.402   23.404    7.110    0.000  166.402    0.622
Model Evaluation

Some basic questions to ask after fitting a model - are Did it work?
Does it make sense? Are the parameters viable?

Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)

semPaths(fit1, what="std", 
         sizeLat = 7, sizeMan = 7, edge.label.cex = .75)

Then, we can look more closely at how well the model fit, and interpret the model more generally.

  • Note that there are 2 degrees of freedom in this model.
  • Note how much better the fit of the model is.
  • We can formalize the improvement in fit

We do model comparison…
Looking at fit statistics

#Compare model fit statistics
round(cbind(m1=inspect(fit0, 'fit.measures'), m2=inspect(fit1, 'fit.measures')),3)
##                            m1        m2
## npar                    8.000    12.000
## fmin                    0.525     0.077
## chisq                 214.189    31.453
## df                      6.000     2.000
## pvalue                  0.000     0.000
## baseline.chisq        214.189   214.189
## baseline.df             6.000     6.000
## baseline.pvalue         0.000     0.000
## cfi                     0.000     0.859
## tli                     0.000     0.576
## nnfi                    0.000     0.576
## rfi                     0.000     0.559
## nfi                     0.000     0.853
## pnfi                    0.000     0.284
## ifi                     0.000     0.861
## rni                     0.000     0.859
## logl                -2950.942 -2859.574
## unrestricted.logl   -2843.848 -2843.848
## aic                  5917.885  5743.148
## bic                  5944.430  5782.966
## ntotal                204.000   204.000
## bic2                 5919.084  5744.946
## rmsea                   0.412     0.269
## rmsea.ci.lower          0.366     0.191
## rmsea.ci.upper          0.461     0.355
## rmsea.pvalue            0.000     0.000
## rmr                    31.629     8.300
## rmr_nomean             31.629     8.300
## srmr                    0.292     0.064
## srmr_bentler            0.292     0.064
## srmr_bentler_nomean     0.345     0.076
## srmr_bollen             0.292     0.064
## srmr_bollen_nomean      0.345     0.076
## srmr_mplus              0.292     0.064
## srmr_mplus_nomean       0.345     0.076
## cn_05                  12.993    39.860
## cn_01                  17.012    60.738
## gfi                     0.946     0.988
## agfi                    0.875     0.916
## pgfi                    0.406     0.141
## mfi                     0.600     0.930
## ecvi                       NA        NA

Formal model test

#Chi-square difference test for nested models 
anova(fit0, fit1)
## Chi Square Difference Test
## 
##      Df    AIC    BIC   Chisq Chisq diff Df diff Pr(>Chisq)    
## fit1  2 5743.1 5783.0  31.453                                  
## fit0  6 5917.9 5944.4 214.189     182.74       4  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the null hypothesis that the models fit the same.

2-Factor Model

We have yet another theory - that the 4 tests are indicators of two correlated latent factor we’ll call Gf and Gc.

Model Specification

We specify a 2-factor model by defining the latent variables and adding in some additional arrows. Each factor must be indicated by manifest variables, and must have a variance (maybe covariance and mean).
We won’t go into it here, but identification constraints are needed and important.

We set up a 2-factor model …

#Model with 1 common factor 
General_2factor <- ' #start of model

# latent variable definitions (common factors)
  Gc =~ NA*comp_06 + 
        NA*voca_06 
  Gf =~ NA*bloc_06 +
        NA*obje_06
  
# latent variable variances
  Gc ~~ 1*Gc
  Gf ~~ 1*Gf

# latent variable covariances
  Gc ~~ Gf
# latent variable means

# manifest variable variances (uniquenesses)
  comp_06 ~~ comp_06
  voca_06 ~~ voca_06
  bloc_06 ~~ bloc_06
  obje_06 ~~ obje_06
            
# manifest variable covariances (uniquenesses)

#manifest variable means 
  comp_06 ~ 1
  voca_06 ~ 1
  bloc_06 ~ 1
  obje_06 ~ 1
' #end of model
Model Fitting

Second, we fit the specified model to the data to obrain parameter estimates.

fit2 <- lavaan(General_2factor, data=dat, mimic = "mplus")
summary(fit2, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after  71 iterations
## 
##   Number of observations                           204
## 
##   Number of missing patterns                         1
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                1.167
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.280
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              214.189
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.999
##   Tucker-Lewis Index (TLI)                       0.995
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2844.431
##   Loglikelihood unrestricted model (H1)      -2843.848
## 
##   Number of free parameters                         13
##   Akaike (AIC)                                5714.862
##   Bayesian (BIC)                              5757.998
##   Sample-size adjusted Bayesian (BIC)         5716.810
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.029
##   90 Percent Confidence Interval          0.000  0.191
##   P-value RMSEA <= 0.05                          0.394
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.010
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   Gc =~                                                                 
##     comp_06           6.420    0.758    8.466    0.000    6.420    0.661
##     voca_06           5.550    0.526   10.547    0.000    5.550    0.884
##   Gf =~                                                                 
##     bloc_06           4.576    0.517    8.851    0.000    4.576    0.695
##     obje_06          12.829    1.320    9.718    0.000   12.829    0.784
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   Gc ~~                                                                 
##     Gf                0.641    0.074    8.642    0.000    0.641    0.641
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .comp_06          21.797    0.680   32.036    0.000   21.797    2.243
##    .voca_06          20.396    0.439   46.416    0.000   20.396    3.250
##    .bloc_06           6.942    0.461   15.063    0.000    6.942    1.055
##    .obje_06          25.258    1.145   22.059    0.000   25.258    1.544
##     Gc                0.000                               0.000    0.000
##     Gf                0.000                               0.000    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     Gc                1.000                               1.000    1.000
##     Gf                1.000                               1.000    1.000
##    .comp_06          53.220    7.931    6.710    0.000   53.220    0.564
##    .voca_06           8.592    4.511    1.905    0.057    8.592    0.218
##    .bloc_06          22.391    3.717    6.025    0.000   22.391    0.517
##    .obje_06         102.876   25.562    4.025    0.000  102.876    0.385
Model Evaluation

Some basic questions to ask after fitting a model -
Did it work?
Does it make sense? Are the parameters viable?

Lets look at a diagram of the model to check if it matches what we intended (this should also be done with the summary information.)

semPaths(fit2, what="std", 
         sizeLat = 7, sizeMan = 7, edge.label.cex = .75)

Then, we can look more closely at how well the model fit, and interpret the model more generally.

  • Note that there are 2 degrees of freedom in this model.
  • Note how much better the fit of the model is.
  • We can formalize the improvement in fit

We do model comparison…
Looking at fit statistics

#Compare model fit statistics
round(cbind(m1=inspect(fit0, 'fit.measures'), m2=inspect(fit1, 'fit.measures'),m3=inspect(fit2, 'fit.measures')),3)
##                            m1        m2        m3
## npar                    8.000    12.000    13.000
## fmin                    0.525     0.077     0.003
## chisq                 214.189    31.453     1.167
## df                      6.000     2.000     1.000
## pvalue                  0.000     0.000     0.280
## baseline.chisq        214.189   214.189   214.189
## baseline.df             6.000     6.000     6.000
## baseline.pvalue         0.000     0.000     0.000
## cfi                     0.000     0.859     0.999
## tli                     0.000     0.576     0.995
## nnfi                    0.000     0.576     0.995
## rfi                     0.000     0.559     0.967
## nfi                     0.000     0.853     0.995
## pnfi                    0.000     0.284     0.166
## ifi                     0.000     0.861     0.999
## rni                     0.000     0.859     0.999
## logl                -2950.942 -2859.574 -2844.431
## unrestricted.logl   -2843.848 -2843.848 -2843.848
## aic                  5917.885  5743.148  5714.862
## bic                  5944.430  5782.966  5757.998
## ntotal                204.000   204.000   204.000
## bic2                 5919.084  5744.946  5716.810
## rmsea                   0.412     0.269     0.029
## rmsea.ci.lower          0.366     0.191     0.000
## rmsea.ci.upper          0.461     0.355     0.191
## rmsea.pvalue            0.000     0.000     0.394
## rmr                    31.629     8.300     1.160
## rmr_nomean             31.629     8.300     1.160
## srmr                    0.292     0.064     0.010
## srmr_bentler            0.292     0.064     0.010
## srmr_bentler_nomean     0.345     0.076     0.012
## srmr_bollen             0.292     0.064     0.010
## srmr_bollen_nomean      0.345     0.076     0.012
## srmr_mplus              0.292     0.064     0.010
## srmr_mplus_nomean       0.345     0.076     0.012
## cn_05                  12.993    39.860   672.610
## cn_01                  17.012    60.738  1160.992
## gfi                     0.946     0.988     1.000
## agfi                    0.875     0.916     0.994
## pgfi                    0.406     0.141     0.071
## mfi                     0.600     0.930     1.000
## ecvi                       NA        NA        NA

Formal model test

#Chi-square difference test for nested models 
anova(fit1, fit2)
## Chi Square Difference Test
## 
##      Df    AIC  BIC   Chisq Chisq diff Df diff Pr(>Chisq)    
## fit2  1 5714.9 5758  1.1668                                  
## fit1  2 5743.1 5783 31.4527     30.286       1  3.728e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the null hypothesis that the models fit the same.

We conclude that the 2-factor model is the optimal model, of those evaluated.

Thanks for playing!