Skip to contents

In this vignette, we explore analytically the reliability of factor scores, and evaluate them using simulated data.

## This is lavaan 0.6-18
## lavaan is FREE software! Please report any bugs.

Analytic Formula for Factor Scores

Theoretical Values

aka, using true parameter values to compute weights

From a unidimensional factor model 𝐲=𝛌η+𝐞\boldsymbol{\mathbf{y}}= \boldsymbol{\mathbf{\lambda }}\eta + \boldsymbol{\mathbf{e}}, for factor scores of the form

η̃=𝐚(𝐲𝛍̂)=𝐚(𝛎+𝛌η+𝐞𝛎)=𝐚𝛌η+𝐚𝐞, \begin{aligned} \tilde \eta & = \boldsymbol{\mathbf{a}}' (\boldsymbol{\mathbf{y}} - \hat {\boldsymbol{\mathbf{\mu}}}) = \boldsymbol{\mathbf{a}}' (\boldsymbol{\mathbf{\nu }}+ \boldsymbol{\mathbf{\lambda }}\eta + \boldsymbol{\mathbf{e}} - \boldsymbol{\mathbf{\nu}}) \\ & = \boldsymbol{\mathbf{a}}' \boldsymbol{\mathbf{\lambda }}\eta + \boldsymbol{\mathbf{a}}' \boldsymbol{\mathbf{e}}, \end{aligned}

where we set E(η)\mathop{\mathrm{\mathrm{E}}}(\eta) = 0 so that 𝛍̂\hat {\boldsymbol{\mathbf{\mu}}} = 𝛎\boldsymbol{\mathbf{\nu}}, the theoretical reliability formula of factor scores, assuming no sampling errors in the weights is given by

ρ=(𝐚𝛌)2(𝐚𝛌)2+𝐚𝚯𝐚, \rho = \frac{(\boldsymbol{\mathbf{a}}' \boldsymbol{\mathbf{\lambda}})^2}{(\boldsymbol{\mathbf{a}}'\boldsymbol{\mathbf{\lambda}})^2 + \boldsymbol{\mathbf{a}}' \boldsymbol{\mathbf{\Theta }}\boldsymbol{\mathbf{a}}},

assuming that we set Var(η)=1\mathop{\mathrm{\mathrm{Var}}}(\eta) = 1.

Regression factor scores

With 𝐚\boldsymbol{\mathbf{a}} = (𝚺1𝛌)(\boldsymbol{\mathbf{\Sigma}}^{-1} \boldsymbol{\mathbf{\lambda}}),

Rel=(𝛌𝚺1𝛌)2𝛌𝚺1𝛌𝛌𝚺1𝛌+𝛌𝚺1𝚯𝚺1𝛌=(𝛌𝚺1𝛌)2𝛌𝚺1(𝛌𝛌+𝚯)𝚺1𝛌=(𝛌𝚺1𝛌)2𝛌𝚺1𝛌=𝐚𝛌 \begin{align*} \text{Rel} &= \frac{(\boldsymbol{\mathbf{\lambda}}' \boldsymbol{\mathbf{\Sigma}}^{-1} \boldsymbol{\mathbf{\lambda}})^2}{\boldsymbol{\mathbf{\lambda}}'\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\lambda }}\boldsymbol{\mathbf{\lambda }}\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\lambda }}+ \boldsymbol{\mathbf{\lambda}}'\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\Theta }}\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\lambda}}} \\ &= \frac{(\boldsymbol{\mathbf{\lambda}}'\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\lambda}})^2}{\boldsymbol{\mathbf{\lambda}}'\boldsymbol{\mathbf{\Sigma}}^{-1}(\boldsymbol{\mathbf{\lambda}}\boldsymbol{\mathbf{\lambda}}'+\boldsymbol{\mathbf{\Theta}})\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\lambda}}} \\ &= \frac{(\boldsymbol{\mathbf{\lambda}}'\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\lambda}})^2}{\boldsymbol{\mathbf{\lambda}}'\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\lambda}}} \\ &= \boldsymbol{\mathbf{a}}' \boldsymbol{\mathbf{\lambda}} \end{align*}

Bartlett factor scores

With 𝐚\boldsymbol{\mathbf{a}} = 𝚯1𝛌(𝛌𝚯1𝛌)1\boldsymbol{\mathbf{\Theta}}^{-1} \boldsymbol{\mathbf{\lambda }}(\boldsymbol{\mathbf{\lambda}}' \boldsymbol{\mathbf{\Theta}}^{-1} \boldsymbol{\mathbf{\lambda}})^{-1}, and thus 𝐚𝛌=1\boldsymbol{\mathbf{a}}' \boldsymbol{\mathbf{\lambda }}= 1,

Rel=11+(𝛌𝚯1𝛌)1𝛌𝚯1𝚯𝚯1𝛌(𝛌𝚯1𝛌)1=11+(𝛌𝚯1𝛌)1 \begin{align*} \text{Rel} &= \frac{1}{1 + (\boldsymbol{\mathbf{\lambda}}' \boldsymbol{\mathbf{\Theta}}^{-1} \boldsymbol{\mathbf{\lambda}})^{-1} \boldsymbol{\mathbf{\lambda}}' \boldsymbol{\mathbf{\Theta}}^{-1} \boldsymbol{\mathbf{\Theta }}\boldsymbol{\mathbf{\Theta}}^{-1} \boldsymbol{\mathbf{\lambda }}(\boldsymbol{\mathbf{\lambda}}' \boldsymbol{\mathbf{\Theta}}^{-1} \boldsymbol{\mathbf{\lambda}})^{-1}} \\ &= \frac{1}{1 + (\boldsymbol{\mathbf{\lambda}}' \boldsymbol{\mathbf{\Theta}}^{-1} \boldsymbol{\mathbf{\lambda}})^{-1}} \end{align*}

Sample Estimators

When factor scores are computed using sample estimates of 𝛌\boldsymbol{\mathbf{\lambda}} and 𝚯\boldsymbol{\mathbf{\Theta}}, we have 𝛍̂\hat{\boldsymbol{\mathbf{\mu}}} = 𝛎̂\hat{\boldsymbol{\mathbf{\nu}}}, and

η̃=𝐚̂(𝐲𝛍̂)=𝐚̂(𝛎+𝛌η+𝐞𝛎̂)=𝐚̂𝛌η+𝐚̂(𝛎𝛎̂)+𝐚̂𝐞, \begin{aligned} \tilde \eta & = \hat{\boldsymbol{\mathbf{a}}}' (\boldsymbol{\mathbf{y}} - \hat {\boldsymbol{\mathbf{\mu}}}) = \hat{\boldsymbol{\mathbf{a}}}' (\boldsymbol{\mathbf{\nu }}+ \boldsymbol{\mathbf{\lambda }}\eta + \boldsymbol{\mathbf{e}} - \hat{\boldsymbol{\mathbf{\nu}}}) \\ & = \hat{\boldsymbol{\mathbf{a}}}' \boldsymbol{\mathbf{\lambda }}\eta + \hat{\boldsymbol{\mathbf{a}}}'(\boldsymbol{\mathbf{\nu }}- \hat{\boldsymbol{\mathbf{\nu}}}) + \hat{\boldsymbol{\mathbf{a}}}' \boldsymbol{\mathbf{e}}, \end{aligned}

where

  • 𝐚̂\boldsymbol{\mathbf{\hat a}} = vector of estimated weights of length pp
  • 𝐚\boldsymbol{\mathbf{a}} = vector of true weights of length pp,
  • 𝐲\boldsymbol{\mathbf{y}} = vector of observed scores of length pp,
  • λ\lambda = vector of factor loadings of length pp,
  • η\eta = latent variable,
  • 𝐞\boldsymbol{\mathbf{e}} = vector of measurement errors of length pp,
  • Θ\Theta = Var(𝐞)\mathop{\mathrm{\mathrm{Var}}}(\boldsymbol{\mathbf{e}}) = p×pp \times p variance-covariance matrix of errors,
  • pp = number of items.

If we assume that 𝐚̂\hat{\boldsymbol{\mathbf{a}}} and 𝛎̂\hat{\boldsymbol{\mathbf{\nu}}} are independent, and E(𝐚̂)=𝐚E(\hat{\boldsymbol{\mathbf{a}}}) = \boldsymbol{\mathbf{a}}, Var(𝐚̂)=Vâ\mathop{\mathrm{\mathrm{Var}}}(\hat{\boldsymbol{\mathbf{a}}}) = V_{\hat a}, E(𝛎̂)=𝛎E(\hat{\boldsymbol{\mathbf{\nu}}}) = \boldsymbol{\mathbf{\nu}}, Var(𝛎̂)=Vν̂\mathop{\mathrm{\mathrm{Var}}}(\hat{\boldsymbol{\mathbf{\nu}}}) = V_{\hat \nu}, then using law of total variance,

Var(η̃)=E[Var(η̃|𝐚̂)]+Var[E(η̃|𝐚̂)]=E[𝐚̂𝛌𝛌𝐚̂+𝐚Vν̂𝐚+𝐚̂𝚯𝐚̂]+V[𝐚̂0]=𝐚𝛌𝛌𝐚+Tr(Vâ𝛌𝛌)+𝐚Vν̂𝐚+Tr(VâVν̂)+𝐚𝚯𝐚+Tr(Vâ𝚯) \begin{aligned} \mathop{\mathrm{\mathrm{Var}}}(\tilde \eta) & = \mathop{\mathrm{\mathrm{E}}}[\mathop{\mathrm{\mathrm{Var}}}(\tilde \eta | \hat{\boldsymbol{\mathbf{a}}})] + \mathop{\mathrm{\mathrm{Var}}}[\mathop{\mathrm{\mathrm{E}}}(\tilde \eta | \hat{\boldsymbol{\mathbf{a}}})] \\ & = \mathop{\mathrm{\mathrm{E}}}[\hat {\boldsymbol{\mathbf{a}}}' \boldsymbol{\mathbf{\lambda }}\boldsymbol{\mathbf{\lambda}}' \hat{\boldsymbol{\mathbf{a}}} + \boldsymbol{\mathbf{a}}' V_{\hat \nu} \boldsymbol{\mathbf{a}} + \hat {\boldsymbol{\mathbf{a}}}' \boldsymbol{\mathbf{\Theta }}\hat{\boldsymbol{\mathbf{a}}}] + V[\hat{\boldsymbol{\mathbf{a}}}' 0] \\ & = \boldsymbol{\mathbf{a}}' \boldsymbol{\mathbf{\lambda }}\boldsymbol{\mathbf{\lambda}}' \boldsymbol{\mathbf{a}} + \mathop{\mathrm{\mathrm{Tr}}}(V_{\hat a} \boldsymbol{\mathbf{\lambda }}\boldsymbol{\mathbf{\lambda}}') + \boldsymbol{\mathbf{a}}' V_{\hat \nu} \boldsymbol{\mathbf{a}} + \mathop{\mathrm{\mathrm{Tr}}}(V_{\hat a} V_{\hat \nu}) + \boldsymbol{\mathbf{a}}' \boldsymbol{\mathbf{\Theta }}\boldsymbol{\mathbf{a}} + \mathop{\mathrm{\mathrm{Tr}}}(V_{\hat a} \boldsymbol{\mathbf{\Theta}}) \end{aligned}

So, accounting for sampling error,

ρ=(𝐚𝛌)2+𝛌Vâ𝛌(𝐚𝛌)2+𝐚𝚯𝐚+𝛌Vâ𝛌+𝐚Vν̂𝐚+Tr(VâVν̂)+Tr(𝚯𝐕â) \rho = \frac{(\boldsymbol{\mathbf{a}}' \boldsymbol{\mathbf{\lambda}})^2 + \boldsymbol{\mathbf{\lambda}}'V_{\hat a} \boldsymbol{\mathbf{\lambda}}}{(\boldsymbol{\mathbf{a}}' \boldsymbol{\mathbf{\lambda}})^2 + \boldsymbol{\mathbf{a}}'\boldsymbol{\mathbf{\Theta }}\boldsymbol{\mathbf{a}} + \boldsymbol{\mathbf{\lambda}}'V_{\hat a}\boldsymbol{\mathbf{\lambda }}+ \boldsymbol{\mathbf{a}}' V_{\hat \nu} \boldsymbol{\mathbf{a}} + \mathop{\mathrm{\mathrm{Tr}}}(V_{\hat a} V_{\hat \nu}) + \mathop{\mathrm{\mathrm{Tr}}}(\boldsymbol{\mathbf{\Theta }}\boldsymbol{\mathbf{V}}_{\hat a})}

The sample estimates can be obtained by substituting in the sample estimates of 𝛌\boldsymbol{\mathbf{\lambda}} and 𝚯\boldsymbol{\mathbf{\Theta}} and using the corresponding form of 𝐚\boldsymbol{\mathbf{a}}.

Quadratic formula for regression factor scores

The reliability formula for the regression factor scores derived using a quadratic formula is

Rel=1±14θcorrected/ψ2\text{Rel} = \frac{1\pm\sqrt{1 - 4\theta_{corrected}/\psi}}{2}

where θcorrected\theta_{corrected} is the corrected error term for the factor scores.

Empirical Example

Regression Factor Scores

# Three items from the classic Holzinger and Swineford (1939) data
cfa_fit <- cfa("
  visual =~ x1 + x2 + x3
",
  data = HolzingerSwineford1939,
  std.lv = TRUE
)
# Uncorrected reliability of regression factor scores
fsr <- lavPredict(cfa_fit, fsm = TRUE, acov = TRUE)
attr(fsr, "fsm")[[1]] %*% lavInspect(cfa_fit, what = "est")$lambda
##           visual
## visual 0.6598382
# Empirical reliability of regression factor scores
var(fsr) / (var(fsr) + attr(fsr, "acov")[[1]])
##           visual
## visual 0.6605848
# Corrected reliability of regression factor scores
get_a <- function(lambda, theta) {
  solve(tcrossprod(lambda) + theta, lambda)
}
lam <- lavInspect(cfa_fit, what = "est")$lambda
th <- lavInspect(cfa_fit, what = "est")$theta
jac_a <- lavaan::lav_func_jacobian_complex(
  \(x) get_a(x[1:3], diag(x[4:6])),
  x = c(lam, diag(th))
)
# Estimated sampling variance of weights
va <- jac_a %*% vcov(cfa_fit)[1:6, 1:6] %*% t(jac_a)
aa <- tcrossprod(get_a(lam, th)) + va
sum(diag(tcrossprod(lam) %*% aa)) /
  sum(diag(cfa_fit@implied$cov[[1]] %*% aa))
## [1] 0.64774
# Or using `get_fs_lavaan`
get_fs_lavaan(cfa_fit, reliability = TRUE) |>
  attr(which = "reliability")
## [1] 0.64774

Bartlett Factor Scores

# Uncorrected reliability of Bartlett factor scores
fsb <- lavPredict(cfa_fit, method = "Bartlett", fsm = TRUE, acov = TRUE)
1 / (1 + attr(fsb, "acov")[[1]])
##           visual
## visual 0.6598382
# Empirical reliability of regression factor scores
(var(fsb) - attr(fsb, "acov")[[1]]) / var(fsb)
##           visual
## visual 0.6609684
# Corrected reliability of regression factor scores
get_a <- function(lambda, theta) {
  thinv_lam <- solve(theta, lambda)
  solve(crossprod(lambda, thinv_lam), t(thinv_lam))
}
lam <- lavInspect(cfa_fit, what = "est")$lambda
th <- lavInspect(cfa_fit, what = "est")$theta
jac_a <- lavaan::lav_func_jacobian_complex(
  \(x) get_a(x[1:3], diag(x[4:6])),
  x = c(lam, diag(th))
)
# Estimated sampling variance of weights
va <- jac_a %*% vcov(cfa_fit)[1:6, 1:6] %*% t(jac_a)
aa <- tcrossprod(t(get_a(lam, th))) + va
sum(diag(tcrossprod(lam) %*% aa)) /
  sum(diag(cfa_fit@implied$cov[[1]] %*% aa))
## [1] 0.6477805
# Or using `get_fs_lavaan` (need to be fixed)
R2spa::get_fs_lavaan(cfa_fit, method = "Bartlett", reliability = TRUE) |>
  attr(which = "reliability")
## [1] 0.6477805

Multiple-group example

mod <- "
  visual =~ x1 + x2 + x3
"
get_fs(HolzingerSwineford1939, model = mod, group = "school",
       reliability = TRUE, std.lv = TRUE, method = "regression") |>
  attr(which = "reliability")
## $Pasteur
## [1] 0.6473336
## 
## $`Grant-White`
## [1] 0.6789831
## 
## $overall
## [1] 0.66258
get_fs(HolzingerSwineford1939, model = mod, group = "school",
       reliability = TRUE, std.lv = TRUE, method = "Bartlett") |>
  attr(which = "reliability")
## $Pasteur
## [1] 0.6474885
## 
## $`Grant-White`
## [1] 0.6790979
## 
## $overall
## [1] 0.6627156

Simulations

set.seed(2356)
lambda <- seq(.3, .9, length.out = 6)
theta <- 1 - lambda^2
num_obs <- 100

num_sims <- 2000
out <- matrix(ncol = 7, nrow = num_sims)

# Copy function for computing a
get_a <- function(lambda, theta) {
  solve(tcrossprod(lambda) + theta, lambda)
}

# Function for getting all versions
all_rel <- function(lam, th, vc) {
  sigma <- tcrossprod(lam) + th
  ahat <- crossprod(lam, solve(sigma))
  # Formula 1: no adjustment
  rel1 <- ahat %*% lam

  jac_a <- lavaan::lav_func_jacobian_complex(
    function(x, p) {
      get_a(x[seq_len(p)], theta = diag(x[-(seq_len(p))]))
    },
    x = c(lam, diag(th)),
    p = length(lam)
  )
  va <- jac_a %*% vc %*% t(jac_a)
  aa <- crossprod(ahat) + va
  # Formula 2: adjust for both error in weights and true variances
  rel2 <- sum(diag(tcrossprod(lam) %*% aa)) / sum(diag(sigma %*% aa))

  ev_fs <- sum(diag(th %*% aa))
  # Formula 3: solve quadratic equation with adjusted error
  rel3 <- (1 + sqrt(1 - 4 * ev_fs)) / 2

  c(rel1, rel2, rel3)
}

# Function for getting all bias-corrected versions
all_bc_rel <- function(fit, nsim = 500) {
  vc <- vcov(fit)
  mc_sim <- MASS::mvrnorm(nsim, mu = coef(fit), Sigma = vc)
  mc_rel <- apply(mc_sim, MARGIN = 1,
                  FUN = function(x) {
                    all_rel(x[1:6], th = diag(x[7:12]), vc = vc)
                  })
  2 * with(
    lavInspect(fit, what = "est"),
    all_rel(lambda, th = theta, vc = vc)
  ) - rowMeans(mc_rel) # bias-corrected
}

for (i in seq_len(num_sims)) {
  eta <- rnorm(num_obs)
  err <- matrix(
    rnorm(num_obs * length(lambda), sd = sqrt(theta)),
    ncol = num_obs
  )
  y <- t(
    tcrossprod(lambda, eta) + err
  )
  # Run cfa
  fit <- cfa("f =~ y1 + y2 + y3 + y4 + y5 + y6",
    data = data.frame(y) |> setNames(paste0("y", seq_along(lambda))),
    std.lv = TRUE
  )
  pars_fit <- lavInspect(fit, what = "est")
  tilde_eta <- lavPredict(fit, fsm = TRUE)
  true_rel <- cor(tilde_eta, eta)^2
  est_a <- attr(tilde_eta, "fsm")[[1]]
  out[i, ] <- c(
    true_rel,
    all_rel(pars_fit$lambda, pars_fit$theta, vcov(fit)),
    all_bc_rel(fit)
  )
  setTxtProgressBar(
    txtProgressBar(min = 0, max = num_sims, style = 3, width = 50, char = "="), 
    i
  )
}
colnames(out) <- c("true_rel", "no_adj_rel", "adj_rel", "quadratic",
                   "no_adj_rel_bc", "adj_rel_bc", "quadratic_bc")

# save the file
saveRDS(out, "vignettes/sim_results_reliability.RDS")
# Mean bias
out <- readRDS("sim_results_reliability.RDS")
data.frame(
  Type = c("no_adj_rel", "adj_rel", "quadratic",
           "no_adj_rel_bc", "adj_rel_bc", "quadratic_bc"),
  Raw_Biases = c(mean(out[, "no_adj_rel"] - out[, "true_rel"]), 
                 mean(out[, "adj_rel"] - out[, "true_rel"]), 
                 mean(out[, "quadratic"] - out[, "true_rel"]), 
                 mean(out[, "no_adj_rel_bc"] - out[, "true_rel"]), 
                 mean(out[, "adj_rel_bc"] - out[, "true_rel"]), 
                 mean(out[, "quadratic_bc"] - out[, "true_rel"], na.rm = TRUE))
) |>
  knitr::kable(digits = 3)
Type Raw_Biases
no_adj_rel 0.019
adj_rel 0.011
quadratic 0.008
no_adj_rel_bc 0.012
adj_rel_bc 0.005
quadratic_bc 0.001
# RMSE: 
data.frame(
  Type = c("no_adj_rel", "adj_rel", "quadratic",
           "no_adj_rel_bc", "adj_rel_bc", "quadratic_bc"),
  RMSE = c(sqrt(mean((out[, "no_adj_rel"] - out[, "true_rel"])^2)),
           sqrt(mean((out[, "adj_rel"] - out[, "true_rel"])^2)),
           sqrt(mean((out[, "quadratic"] - out[, "true_rel"])^2)),
           sqrt(mean((out[, "no_adj_rel_bc"] - out[, "true_rel"])^2)),
           sqrt(mean((out[, "adj_rel_bc"] - out[, "true_rel"])^2)),
           sqrt(mean((out[, "quadratic_bc"] - out[, "true_rel"])^2, 
                     na.rm = TRUE)))
) |>
  knitr::kable(digits = 3)
Type RMSE
no_adj_rel 0.046
adj_rel 0.042
quadratic 0.042
no_adj_rel_bc 0.042
adj_rel_bc 0.040
quadratic_bc 0.040
# Boxplot to compare the bias across formula
bias_tab <- as.data.frame(out[, -1] - out[, 1]) |>
  pivot_longer(everything(), names_to = "formula", values_to = "bias")
bias_tab |>
  ggplot(aes(x = formula, y = bias, col = formula)) +
  geom_boxplot() +
  geom_hline(yintercept = 0) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_boxplot()`).