Multi-collinearity amongst Smart Betas

August 2020

Yakun (Diana) Deng, Oleg Kolesnikov & Milind Sharma [1]

The issue of multi-collinearity often arises in the analysis of financial time series data. Given that our ESBs are not pure statistical constructs (orthogonalized by construction) but largely based on the intuitive industry standard clusters, it is worth checking whether multi-collinearity may become an issue. A high degree of linear dependence amongst the explanatory variables in a multiple regression may not affect the overall predictive efficacy of the model but it can make the estimated coefficients unreliable as they can change erratically in response to small input changes.

Enhanced Smart Betas (ESBs)

QuantZ provides ML enhanced Smart Betas allowing one to express almost any linear view on Equities as well as plug-and-play Composite Signals constructed from such ESBs.

For most multi-factor models in the financial industry, companies in the estimation universe are ranked on each of the factors and then in conjunction via some function of these factors. A factor is a security attribute which has been identified as a potential driver of return/ risk. Factor investing is therefore the embodiment of an investment strategy in which securities are chosen based on certain characteristics with the goal of increasing exposure to those factors or to improve risk adjusted long-term return.

While factor ranks are developed based on a single definition, our Enhanced Smart Betas on the other hand, combine a range of factors within each cohort with the objective of outperforming the naïve benchmark factor. Our researchers have drawn upon their decades of collective experience to identify, clean and test 18 factor cohorts from which we construct the 18 Enhanced Smart Betas so that you may directly deploy these indispensable building blocks cost-effectively towards the creation of quant equity strategies for which our Composite Signals (based on such ESBs) are good proxies. Given that the set of N choose k combinations in this case is quite large, it’s particularly instructive to focus on the curated composites we have created based on the factor ESBs.

BEST-FLAVOR-OF-THE-MONTH (BFOM) investable strategies

We have a suite of 18 Best-Flavor-of-the-Month (BFOM) investable strategies based on our 18 ESBs which systematically switch between the 5 ESB models underlying the ensemble learner. While our ESB heatmaps only display the best of 5 models YTD/ LTD for each ESB (for the sake of brevity), we do provide the full history of all 5 models as well as in live mode so you can pick any flavor. These 5 alternate models for the same Smart Beta can be viewed as an “ensemble” of learners on an expanding window (to prevent look-ahead bias). Furthermore, these models must re-optimize monthly (using only data through T-1 month-end) to solve for the optimal factor weights which are used in calculating performance from T to T+1. To prevent look-ahead bias in representing a single 20 year time series for a given ESB such as DV (Deep Value) we must ensure that such a time series represents the performance of a BFOM trading strategy which picks the “Best Flavor of the Month” based on cumulative LTD return (through T-1 ex-ante) as we do not know a-priori if that flavor will in fact be the best performer out-sample in the coming month. Regardless, clients have access to all five flavors as well as the BFOM ranks and hence can choose as they deem appropriate.

Correlation of ESB BFOMs

When we use ESB BFOMs as a predictor in regression models, multi-collinearity becomes an immediate issue (as is often the case with financial time series data) because our ESB BFOMs are based on intuitive/ fundamentally guided but not orthogonalized categories. If the degree of correlation between variables is high enough, it can cause problems when we fit the model and interpret the results. In this case, one may not be able to distinguish between the individual effects of ESB BFOMs.

In this note we strive to quantify the degree of multi-collinearity & how one may deal with that. The ESB data used are 20y of monthly total returns for the 18 ESB BFOMs from December 1999 to December 2019:


We present a heatmap of the 20y+ LTD returns correlations (as of 07/31/20) to note that while certain ESBs like SIRF & Size have mostly low to negative correlations, others like EFF and DV have extremely high correlations of 0.89, MOM and ENMOM of 0.91, RV and DV of 0.80 etc clearly suggesting that multi-collinearity ought to be a concern. Indeed, looking at the matrix we see reassuring patches of red (negative correlations) but also correlated clusters of green.

20y Return correlations for QMIT Enhanced Smart Betas:

Detecting Multicollinearity using VIFs

In order to assess multi-collinearity, one must calculate the VIFs (Variable Inflation Factors). For a given ESB BFOM, the 𝑉𝐼𝐹(𝑖) would be calculated by taking the ith BFOM and regressing it against every other BFOM to see how much of it can be explained by the other BFOMs. If the 𝑅2 of that OLS is high, then as per the definition below, the 𝑉𝐼𝐹(𝑖) will be high, suggesting collinearity:

A 𝑉𝐼𝐹(𝑖) exceeding 10 would indicate unacceptably high multi-collinearity for the ith BFOM. The ESB data used are the monthly BFOM returns of the 18 ESBs from December 1999 to December 2019. The corresponding VIFs are shown below:

As per the table above – EnMOM, RV, EFF, DV and MOM – constitute 5 ESBs with VIFs clearly in excess of 10. This is no surprise since RV & DV are closely related (as are MOM & EnMom) but can logically be replaced by the corresponding composites.

VIFs after using Composites

In order to mitigate the multi-collinearity evidenced above, we replace DV and RV with the Value Composite (VAL) based on these two BFOMs. In addition, we replace ARS, ART, ENMOM and GROH with the Growth-Momentum Composite (GrowthMo) which leaves us with 14 explanatory variables: CSU, DIV, EFF, EQ, LEV, MOM, PROF, REV, RISK, SIRF, SIZE, STAB, VAL, and GrowthMo. Now, we re-calculate the VIFs for these 14 variables to note that all VIFs are now under 10 with only EFF & VAL over 9:

While some sources consider VIFs < 10 to suffice others are much more conservative in the level of this threshold. Hence, we now investigate VIFs where all corresponding ESBs have been replaced with their composites (which incrementally amounts to collapsing 6 ESBs into Qual) and observe that all VIFs are now well under 10 with only 4 over 5.

Finally, noting the high correlation between MOM & GrowthMo we recompute VIFs without MOM. As expected, we see that the GrowthMo VIF drops sharply by 52% to 2.90 once we remove MOM given that MOM and ENMOM have 0.91 correlation. The following table shows the logical progression & deflation of VIFs along this journey which in the end leads us to a set of 8 predictors (in ascending VIFs) where collinearity is reasonably under control:

{SIRF, REV, SIZE, GrowthMo, DIV, VAL, RISK, Qual}

VIFs without MOM

The punchline is that a parsimonious subset of predictors which still encapsulates the most diverse/ non-overlapping alphas would be the following 8 variables (in order of ascending VIFs – Variance Inflation Factors) - of which VAL, QUAL & GrowthMo are composites each with multiple ESBs embedded & several hundred constituent factors therein:

{SIRF, REV, SIZE, GrowthMo, DIV, VAL, RISK, Qual}

This analysis is key towards a determination of the minimum set of ESBs & composites that one may wish to include amongst their set of alphas.


  1. Goldberger, Arthur S. (1991). A Course in Econometrics. Harvard University Press. pp. 248–250. ISBN 0-674-17544-1.

  2. 2. O’Brien, R. M. (2007). "A Caution Regarding Rules of Thumb for Variance Inflation Factors". Quality & Quantity. 41 (5): 673–690. doi:10.1007/s11135-006-9018-6.

  3. Farrar, Donald E.; Glauber, Robert R. (1967). "Multicollinearity in Regression Analysis: The Problem Revisited" (PDF). Review of Economics and Statistics. 49 (1): 92–107. doi:10.2307/1937887. hdl:1721.1/48530. JSTOR 1937887.

  4. Wichers, C. Robert (1975). "The Detection of Multicollinearity: A Comment". Review of Economics and Statistics. 57 (3): 366–368. doi:10.2307/1923926. JSTOR 1923926.

  5. Kumar, T. Krishna (1975). "Multicollinearity in Regression Analysis". Review of Economics and Statistics. 57 (3): 365–366. doi:10.2307/1923925. JSTOR 1923925.

  6. O'Hagan, John; McCabe, Brendan (1975). "Tests for the Severity of Multicolinearity in Regression Analysis: A Comment". Review of Economics and Statistics. 57 (3): 368–370. doi:10.2307/1923927. JSTOR 1923927.

  7. James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert (2017). An Introduction to Statistical Learning (8th ed.). Springer Science+Business Media New York. ISBN 978-1-4614-7138-7.

  8. Snee, Ron (1981). Origins of the Variance Inflation Factor as Recalled by Cuthbert Daniel (Technical report). Snee Associates.

[1] We thank Xinyi Xu for her key input.


©2019 by QuantZ Machine Intelligence Technologies.