Skip to contents

This vignette walks through ipca_est() on the Grunfeld (1958) investment panel: 11 US firms observed annually from 1935 to 1954. The dataset is small, balanced, and has only two characteristics, which makes it useful as a check that the R implementation matches the Python ipca package of Kelly, Pruitt, and Su (2019). The IPCA estimator was developed for asset returns, but the algorithm only needs a panel and a set of observable instruments, so the investment-on-fundamentals example serves as a transparent reference case.

The IPCA model

Instrumented Principal Components Analysis (IPCA) extracts latent factors from a panel by letting observable, asset-specific characteristics act as instruments for otherwise-unobservable conditional loadings. Without that link the loadings have to be either pre-specified or estimated as free parameters; IPCA instead ties them to data the researcher already has.

In the version used here (no intercept, no anomaly term) the model is

\[r_{i,t} = \mathbf{z}_{i,t}^\top \mathbf{\Gamma}_{\!\beta}\, \mathbf{f}_t + \varepsilon_{i,t},\]

where \(r_{i,t}\) is the outcome of unit \(i\) at time \(t\) (gross investment, in the Grunfeld example), \(\mathbf{z}_{i,t}\) is the \(L\)-vector of observable characteristics, \(\mathbf{\Gamma}_{\!\beta}\) is the \(L \times K\) matrix that maps characteristics into factor loadings, and \(\mathbf{f}_t\) is the \(K\)-vector of latent factors. The unit-specific, time-varying loading is therefore \(\beta_{i,t} = \mathbf{z}_{i,t}^\top \mathbf{\Gamma}_{\!\beta}\), i.e. a linear projection of characteristics onto the \(K\)-dimensional factor space.

Estimation alternates between two ordinary-least-squares steps: solve for \(\mathbf{f}_t\) given \(\mathbf{\Gamma}_{\!\beta}\), then update \(\mathbf{\Gamma}_{\!\beta}\) given \(\mathbf{f}_t\). After each pass the solution is rotated by an SVD so that \(\mathbf{\Gamma}_{\!\beta}^\top \mathbf{\Gamma}_{\!\beta} = \mathbf{I}_K\), which fixes the otherwise-arbitrary scaling and rotation of the factors.

Data preparation

The Grunfeld dataset ships with sdim. The dependent variable is gross investment (invest); the two characteristics are market value (value) and capital stock (capital).

library(sdim)

data(grunfeld)
str(grunfeld)
#> 'data.frame':    220 obs. of  5 variables:
#>  $ firm   : chr  "American Steel" "American Steel" "American Steel" "American Steel" ...
#>  $ year   : int  1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 ...
#>  $ invest : num  2.94 5.64 10.23 4.05 3.33 ...
#>  $ value  : num  30.3 43.9 107 68.3 84.2 ...
#>  $ capital: num  52 52.9 54.5 59.7 61.7 ...

ipca_est() expects a \(T \times N\) outcome matrix and a \(T \times N \times L\) characteristics array. The loop below reshapes the long panel into that wide form, keeping the year and firm labels on the margins so the output is easy to read back:

firms <- sort(unique(grunfeld$firm))
years <- sort(unique(grunfeld$year))
N  <- length(firms)
TT <- length(years)

ret <- matrix(NA_real_, TT, N, dimnames = list(years, firms))
Z   <- array(NA_real_, dim = c(TT, N, 2),
             dimnames = list(years, firms, c("value", "capital")))

for (i in seq_along(firms)) {

  idx <- grunfeld$firm == firms[i]
  ret[, i]  <- grunfeld$invest[idx]
  Z[, i, 1] <- grunfeld$value[idx]
  Z[, i, 2] <- grunfeld$capital[idx]

}

cat("ret:", nrow(ret), "x", ncol(ret), "\n")
#> ret: 20 x 11
cat("Z:  ", paste(dim(Z), collapse = " x "), "\n")
#> Z:   20 x 11 x 2

Fitting IPCA

We start with a single latent factor. With \(K = 1\) the loadings \(\beta_{i,t}\) are a single linear combination of value and capital, and \(f_t\) is one common time-series:

fit <- ipca_est(ret, Z, nfac = 1)
print(fit)
#> <sdim_fit [ipca]>
#>  Observations    : 20 
#>  Characteristics : 2 
#>  Factors         : 1 
#>  Factor mean     : zero
summary(fit)
#> Instrumented Principal Components Analysis (IPCA) 
#> ---------------------------------------- 
#> Call: ipca_est(ret = ret, Z = Z, nfac = 1)
#> 
#> Dimensions
#> ---------------------------------------- 
#>  Observations     20
#>  Characteristics  2
#>  Factors          1
#>  Factor mean      zero
#> 
#> Eigenvalues
#> ---------------------------------------- 
#>                     F1
#> Eigenvalue       0.019
#> Var. expl. (%) 100.000

The returned object stores the characteristic loadings (\(\mathbf{\Gamma}_{\!\beta}\), named lambda) and the estimated factor path:

# How each characteristic maps onto the factor
fit$lambda
#>           [,1]
#> [1,] 0.9916601
#> [2,] 0.1288805

# Factor realisations over time
data.frame(year = years, factor = fit$factors[, 1])
#>    year     factor
#> 1  1935 0.10319684
#> 2  1936 0.08844895
#> 3  1937 0.08384966
#> 4  1938 0.08450699
#> 5  1939 0.07225234
#> 6  1940 0.09950682
#> 7  1941 0.12288401
#> 8  1942 0.14226238
#> 9  1943 0.11975320
#> 10 1944 0.11797240
#> 11 1945 0.10875619
#> 12 1946 0.13575212
#> 13 1947 0.15793483
#> 14 1948 0.16605454
#> 15 1949 0.14849233
#> 16 1950 0.15866343
#> 17 1951 0.15960074
#> 18 1952 0.17593792
#> 19 1953 0.19216956
#> 20 1954 0.21110659

Validation against the Python ipca package

The Python ipca package uses the same Grunfeld panel as its built-in example and runs the identical alternating-least-squares algorithm. With one factor and no intercept it reports the following loadings and factor path, which we hard-code below as a reference:

py_gamma   <- c(0.99166014, 0.12888046)
py_factors <- c(
  0.1031968381, 0.0884489515, 0.0838496628, 0.0845069923, 0.0722523449,
  0.0995068155, 0.1228840058, 0.1422623752, 0.1197532025, 0.1179724004,
  0.1087561863, 0.1357521189, 0.1579348267, 0.1660545375, 0.1484923276,
  0.1586634303, 0.1596007400, 0.1759379247, 0.1921695585, 0.2111065868
)

Factors are identified only up to sign, so before comparing we flip the R output if the loading vectors are negatively correlated:

r_gamma <- as.numeric(fit$lambda)
r_factors <- as.numeric(fit$factors)

if (cor(r_gamma, py_gamma) < 0) {

  r_gamma   <- -r_gamma
  r_factors <- -r_factors

}

cat("Gamma max |diff|:  ", sprintf("%.2e", max(abs(r_gamma - py_gamma))), "\n")
#> Gamma max |diff|:   3.58e-09
cat("Factor max |diff|: ", sprintf("%.2e", max(abs(r_factors - py_factors))), "\n")
#> Factor max |diff|:  4.99e-11
cat("Factor correlation:", sprintf("%.10f", cor(r_factors, py_factors)), "\n")
#> Factor correlation: 1.0000000000

The maximum absolute differences are at numerical-tolerance levels and the factor correlation is one to ten decimals.

Multiple factors

The same call extracts more factors by increasing nfac. With \(K = 2\) each characteristic effectively contributes its own factor dimension (since there are only two instruments):

fit2 <- ipca_est(ret, Z, nfac = 2)
summary(fit2)
#> Instrumented Principal Components Analysis (IPCA) 
#> ---------------------------------------- 
#> Call: ipca_est(ret = ret, Z = Z, nfac = 2)
#> 
#> Dimensions
#> ---------------------------------------- 
#>  Observations     20
#>  Characteristics  2
#>  Factors          2
#>  Factor mean      zero
#> 
#> Eigenvalues
#> ---------------------------------------- 
#>                     F1      F2
#> Eigenvalue      0.0073  0.0197
#> Var. expl. (%) 27.0200 72.9800

References

Grunfeld, Y. (1958). The Determinants of Corporate Investment. Ph.D. thesis, Department of Economics, University of Chicago.

Kelly, B. T., Pruitt, S., and Su, Y. (2019). Characteristics are Covariances: A Unified Model of Risk and Return. Journal of Financial Economics, 134(3), 501–524.