This vignette walks through ipca_est() on the Grunfeld
(1958) investment panel: 11 US firms observed annually from 1935 to
1954. The dataset is small, balanced, and has only two characteristics,
which makes it useful as a check that the R implementation matches the
Python ipca package of Kelly, Pruitt, and Su (2019). The
IPCA estimator was developed for asset returns, but the algorithm only
needs a panel and a set of observable instruments, so the
investment-on-fundamentals example serves as a transparent reference
case.
The IPCA model
Instrumented Principal Components Analysis (IPCA) extracts latent factors from a panel by letting observable, asset-specific characteristics act as instruments for otherwise-unobservable conditional loadings. Without that link the loadings have to be either pre-specified or estimated as free parameters; IPCA instead ties them to data the researcher already has.
In the version used here (no intercept, no anomaly term) the model is
\[r_{i,t} = \mathbf{z}_{i,t}^\top \mathbf{\Gamma}_{\!\beta}\, \mathbf{f}_t + \varepsilon_{i,t},\]
where \(r_{i,t}\) is the outcome of unit \(i\) at time \(t\) (gross investment, in the Grunfeld example), \(\mathbf{z}_{i,t}\) is the \(L\)-vector of observable characteristics, \(\mathbf{\Gamma}_{\!\beta}\) is the \(L \times K\) matrix that maps characteristics into factor loadings, and \(\mathbf{f}_t\) is the \(K\)-vector of latent factors. The unit-specific, time-varying loading is therefore \(\beta_{i,t} = \mathbf{z}_{i,t}^\top \mathbf{\Gamma}_{\!\beta}\), i.e. a linear projection of characteristics onto the \(K\)-dimensional factor space.
Estimation alternates between two ordinary-least-squares steps: solve for \(\mathbf{f}_t\) given \(\mathbf{\Gamma}_{\!\beta}\), then update \(\mathbf{\Gamma}_{\!\beta}\) given \(\mathbf{f}_t\). After each pass the solution is rotated by an SVD so that \(\mathbf{\Gamma}_{\!\beta}^\top \mathbf{\Gamma}_{\!\beta} = \mathbf{I}_K\), which fixes the otherwise-arbitrary scaling and rotation of the factors.
Data preparation
The Grunfeld dataset ships with sdim. The dependent
variable is gross investment (invest); the two
characteristics are market value (value) and capital stock
(capital).
library(sdim)
data(grunfeld)
str(grunfeld)
#> 'data.frame': 220 obs. of 5 variables:
#> $ firm : chr "American Steel" "American Steel" "American Steel" "American Steel" ...
#> $ year : int 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 ...
#> $ invest : num 2.94 5.64 10.23 4.05 3.33 ...
#> $ value : num 30.3 43.9 107 68.3 84.2 ...
#> $ capital: num 52 52.9 54.5 59.7 61.7 ...ipca_est() expects a \(T
\times N\) outcome matrix and a \(T
\times N
\times L\) characteristics array. The loop below reshapes the
long panel into that wide form, keeping the year and firm labels on the
margins so the output is easy to read back:
firms <- sort(unique(grunfeld$firm))
years <- sort(unique(grunfeld$year))
N <- length(firms)
TT <- length(years)
ret <- matrix(NA_real_, TT, N, dimnames = list(years, firms))
Z <- array(NA_real_, dim = c(TT, N, 2),
dimnames = list(years, firms, c("value", "capital")))
for (i in seq_along(firms)) {
idx <- grunfeld$firm == firms[i]
ret[, i] <- grunfeld$invest[idx]
Z[, i, 1] <- grunfeld$value[idx]
Z[, i, 2] <- grunfeld$capital[idx]
}
cat("ret:", nrow(ret), "x", ncol(ret), "\n")
#> ret: 20 x 11
cat("Z: ", paste(dim(Z), collapse = " x "), "\n")
#> Z: 20 x 11 x 2Fitting IPCA
We start with a single latent factor. With \(K = 1\) the loadings \(\beta_{i,t}\) are a single linear
combination of value and capital, and \(f_t\) is one common time-series:
fit <- ipca_est(ret, Z, nfac = 1)
print(fit)
#> <sdim_fit [ipca]>
#> Observations : 20
#> Characteristics : 2
#> Factors : 1
#> Factor mean : zero
summary(fit)
#> Instrumented Principal Components Analysis (IPCA)
#> ----------------------------------------
#> Call: ipca_est(ret = ret, Z = Z, nfac = 1)
#>
#> Dimensions
#> ----------------------------------------
#> Observations 20
#> Characteristics 2
#> Factors 1
#> Factor mean zero
#>
#> Eigenvalues
#> ----------------------------------------
#> F1
#> Eigenvalue 0.019
#> Var. expl. (%) 100.000The returned object stores the characteristic loadings (\(\mathbf{\Gamma}_{\!\beta}\), named
lambda) and the estimated factor path:
# How each characteristic maps onto the factor
fit$lambda
#> [,1]
#> [1,] 0.9916601
#> [2,] 0.1288805
# Factor realisations over time
data.frame(year = years, factor = fit$factors[, 1])
#> year factor
#> 1 1935 0.10319684
#> 2 1936 0.08844895
#> 3 1937 0.08384966
#> 4 1938 0.08450699
#> 5 1939 0.07225234
#> 6 1940 0.09950682
#> 7 1941 0.12288401
#> 8 1942 0.14226238
#> 9 1943 0.11975320
#> 10 1944 0.11797240
#> 11 1945 0.10875619
#> 12 1946 0.13575212
#> 13 1947 0.15793483
#> 14 1948 0.16605454
#> 15 1949 0.14849233
#> 16 1950 0.15866343
#> 17 1951 0.15960074
#> 18 1952 0.17593792
#> 19 1953 0.19216956
#> 20 1954 0.21110659Validation against the Python ipca package
The Python ipca package uses the same Grunfeld panel as
its built-in example and runs the identical alternating-least-squares
algorithm. With one factor and no intercept it reports the following
loadings and factor path, which we hard-code below as a reference:
py_gamma <- c(0.99166014, 0.12888046)
py_factors <- c(
0.1031968381, 0.0884489515, 0.0838496628, 0.0845069923, 0.0722523449,
0.0995068155, 0.1228840058, 0.1422623752, 0.1197532025, 0.1179724004,
0.1087561863, 0.1357521189, 0.1579348267, 0.1660545375, 0.1484923276,
0.1586634303, 0.1596007400, 0.1759379247, 0.1921695585, 0.2111065868
)Factors are identified only up to sign, so before comparing we flip the R output if the loading vectors are negatively correlated:
r_gamma <- as.numeric(fit$lambda)
r_factors <- as.numeric(fit$factors)
if (cor(r_gamma, py_gamma) < 0) {
r_gamma <- -r_gamma
r_factors <- -r_factors
}
cat("Gamma max |diff|: ", sprintf("%.2e", max(abs(r_gamma - py_gamma))), "\n")
#> Gamma max |diff|: 3.58e-09
cat("Factor max |diff|: ", sprintf("%.2e", max(abs(r_factors - py_factors))), "\n")
#> Factor max |diff|: 4.99e-11
cat("Factor correlation:", sprintf("%.10f", cor(r_factors, py_factors)), "\n")
#> Factor correlation: 1.0000000000The maximum absolute differences are at numerical-tolerance levels and the factor correlation is one to ten decimals.
Multiple factors
The same call extracts more factors by increasing nfac.
With \(K = 2\) each characteristic
effectively contributes its own factor dimension (since there are only
two instruments):
fit2 <- ipca_est(ret, Z, nfac = 2)
summary(fit2)
#> Instrumented Principal Components Analysis (IPCA)
#> ----------------------------------------
#> Call: ipca_est(ret = ret, Z = Z, nfac = 2)
#>
#> Dimensions
#> ----------------------------------------
#> Observations 20
#> Characteristics 2
#> Factors 2
#> Factor mean zero
#>
#> Eigenvalues
#> ----------------------------------------
#> F1 F2
#> Eigenvalue 0.0073 0.0197
#> Var. expl. (%) 27.0200 72.9800