In this post i take an R-view of “Backtesting – Harvey & Liu (2015).” The authors propose an alternative to the commonly practiced 50% discount that is applied to reported Sharpe ratios when evaluating backtests of trading strategies. The reason for the discount is due to the inevitable data mining inherent in research, both past and present. Harvey & Liu (HL) propose a multiple hypothesis testing framework with a more stringent requirement for determining a test’s significance level compared with single tests. HL suggest that the single hypothesis test used by most researchers leads to too many false discoveries (Type I errors). This makes intuitive sense if you consider that the probability of finding a significant factor increases with the number of factors tested or the number of tests conducted.

To give an example quoted in Harvey & Liu – Evaluating Trading Strategies (which you may have heard before), imagine you receive a mail from a manager promoting a particular stock. The mail says to track the manager’s recommendations over time. The recommendation is either to go long or short. The manager is correct 10 weeks in a row with his recommendations. The probability of such a feat you figure is very small, in fact a 0.09765% probability of being a false discovery (0.5^10 = 0.0009765625) based on single hypothesis testing. But what if the manager sent a similar mail to 100k people telling half to go long and the other half to go short the same stock. Every week the manager trims half the people off his mailing list, the half for whom the recommendation did not work. Every week for 10 weeks the list is trimmed. By the end of week 10 a total of 97 lucky people have been the recipients of these perfect picks. The results were random, and the recipients would have been fooled.

In multiple hypothesis testing the challenge is to guard against false discoveries. HL argue that the appropriate “haircut Sharpe ratio” is non-linear, in that the highest Sharpe ratios (SR’s) are only moderately penalized whilst marginal SR’s more so. The implication is that high SR’s are more likely true discoveries in a multiple hypothesis testing framework.

If you would like to jump straight to my R implementation it is at the end of this post. The authors make their Matlab code for Exhibits 5 & 6 available here. I simply converted it to R and added some extra bits for illustrative purposes.

HL mention 5 caveats to their framework, namely;

- Sharpe ratios may not be appropriate metrics for strategies with negatively skewed expected payoffs, such as option strategies.
- Sharpe ratios normalize returns based on their volatility (ie. market risk), which may not be the most appropriate reflection of risk for a strategy.
- Determining the appropriate significance level for multiple testing (where in single tests 5% is the normal cutoff).
- Which multiple testing method you choose could yield different conclusions. HL proposes 3 methods together with an average.
- Lastly, and i would say most sensitively is the number of tests used.

**Linking Sharpe ratio with t-statistic**

To explain the link between the Sharpe ratio and the t-stat and the application of a multiple testing p-value adjustment, HL use the simplest case of an individual investment strategy. Assume a null hypothesis in which the strategy’s mean return is significantly different from zero, therefore implying a 2-sided alternative hypothesis. A strategy can be regarded as profitable if its mean returns are either side of zero since investors can generally go long or short. Since returns will at least be asymptotically normally distributed (thanks to the Central Limit Theorem) a t-statistic following a t-distribution can be constructed and tested for significance. However, due to the link between the Sharpe ratio and the t-stat it is possible to assess the significance of a strategy’s excess returns directly using the Sharpe ratio. Assume denotes the mean of your sample of historical returns (daily or weekly etc) and denotes standard deviation, then

where T-1 is degrees of freedom and since

it can be shown that

By implication a higher Sharpe ratio equates to a higher t-ratio, implying a higher significance level (lower p-value) for an investment strategy. If we denote p-value of the single test as then we can present the p-value of the single test as

or

Now if the researcher was exploring a particular economic theory then this p-value might make sense, but what if the researcher has tested multiple strategies and presents only the most profitable one? In this case the p-value of the single test may severely overstate the actual significance. A more truthful p-value would be an adjusted multiple testing p-value which assuming we denote as could be represented as

or

By equating the p-value of a single test to a multiple test p-value we get the defining equation of which is

Using the example in HL, assuming a strategy with 20 years of monthly returns (T=240), an annual Sharpe ratio (SR) of 0.75, you would have a p-value for the single test of

To compute the single p-value and the multiple testing p-value assuming researchers have tried N = 200 strategies, you can use the below R code to compute the adjusted or “haircut” Sharpe ratio;

# Set the variables N = 200 # Number of strategies tested T = 240 # Number of observations or degrees of freedom SR_ann = 0.75 # Annualised Sharpe ratio # Compute the single test p-value p_single = 2 * (1-pnorm(SR_ann*sqrt(T/12))) # Compute the multiple test p-value p_multiple = 1 - (1 - p_single) ^ N # Compute haircut Sharpe ratio (HSR) z_stat = qt(1 - p_multiple/2, N-1) HSR = (z_stat/sqrt(T)) * sqrt(12) # Sharpe ratio haircut haircut = (SR_ann - HSR)/SR_ann

The code should output an HSR of 0.325 (HL yields 0.32 due to rounding the single and multiple p-values) which implies a haircut to the original Sharpe ratio of 56.6% (HL report a 60% haircut). The example above is no longer appropriate where t-statistics are dependent. More on this later.

**Multiple Multiple-Testing methods**

HL mentions 3 well known adjustment methods in the statistics literature, which are originally prescribed in the paper “…and the Cross-Section of Expected Returns” by Harvey, Liu and Zhu. These are Bonferroni, Holm, and Benjamini, Hochberg, and Yekutieli (BHY).

**Bonferroni’s adjustment:**

Bonferroni applies the same adjustment to the p-value of each test, inflating the p-value by the number of tests. The multiple testing p-value is the minimum of each inflated p-value and 1 where 1 (or 100% if you prefer) is the upper bound of probability. HL use the example of p-values from 6 strategies where the p-values are (0.005, 0.009, 0.0128, 0.0135, 0.045, 0.06). According to a 5% significance cutoff the first 5 tests would be considered significant. Using the p.adjust function in R we can get the multiple adjusted p-values and according to Bonferroni only the first test would be considered significant.

# Set the variables p_values = c(0.005, 0.009, 0.0128, 0.0135, 0.045, 0.06) # Assign to adj.p_values the output of p.adjust() adj.p_values = p.adjust(p_values, "bonferroni") # Print adj.p_values

**Holm’s adjustment:**

p-value adjustments can be categorized into 2 categories, namely: single-step and sequential. Single-step corrections equally adjust p-values as in Bonferroni. Sequential adjustments are an adaptive procedure based on the distribution of p-values. Sequential methods gained prominence after a seminal paper by Schweder & Spjotvoll (1982) and section 7.3 of this paper gives a useful example of an application of multiple testing hypothesis diagnostic plotting in R. Holm is an example of a sequential multiple testing procedure. For Holm, the equivalent adjusted p-value is

Using the same 6 strategies illustrated above, we can use p.adjust() to compute the Holm adjusted p-values.

# Run p.adjust() for Holm HOLM_adj.p_values = p.adjust(p_values, "holm") HOLM_adj.p_values

We see in the output above that the first 2 tests are significant at the 5% cutoff, compared to only 1 with Bonferroni. This is due to the fact that Bonferroni adjusts single tests equally, whereas Holm applies a sequential approach. By conclusion it should not surprise you that adjusted Sharpe ratios under Bonferroni will therefore be lower than for Holm. At this point it is useful to mention that both Holm and Bonferroni attempt to prevent even 1 Type I error occurring, controlling what is called the family-wise error rate (FWER). The next adjustment proposed by HL is BHY and the main difference from the previous 2 adjustment methods is that BHY attempts to control the false discovery rate (FDR), implying more lenience than Holm and Bonferroni and therefore expected to yield higher adjusted Sharpe ratios.

**BHY – (Benjamini, Hochberg, and Yekutieli):**

BHY’s formulation of the FDR can be represented as follows. Firstly all p-values are sorted in descending order and the adjusted p-value sequence is defined by pairwaise comparisons.

c(M) is defined as follows

making c(M) = 2.45 for the example of M=6 strategies (ie. sum of 1/1 + 1/2 + 1/3 +1/4 + 1/5 + 1/6). Using p.adjust() we can see that there are now 4 significant strategies at the 5% cutoff. The output from p.adjust below is slightly different to HL’s BHY implementation as p.adjust() does not equate with the least significant p-value 0.06.

# Run p.adjust() for BHY BHY_adj.p_values = p.adjust(sort(p_values, decreasing = TRUE), "BY") BHY_adj.p_values

So to summarise these methods BHY leads to 4 significant discoveries versus Holm’s 2 and Bonferroni’s 1. We expect BHY to be more lenient as it controls the false discovery rate whereas Holm and Bonferroni control the family-wise error rate, trying to eliminate making even 1 false discovery. Bonferroni is more stringent than Holm since it is a single-step adjustment versus the sequential approach of Holm. With these 3 methods HL attempt to adjust p-values to account for multiple testing and then convert these to haircut Sharpe ratios and in so doing control for data mining. For both Holm and BHY you need the empirical distribution of p-values from previously tried strategies. Harvey, Liu and Zhu (HLZ) model over 300 risk factors documented in the finance literature. However, using this model for the distribution of p-values is not complete since many tried strategies would not have been documented (referred to as Publication Bias) plus they are potentially correlated thereby violating the requirement for independence between tests. HLZ propose a new distribution to overcome these shortfalls.

**Harvey, Liu and Zhu (HLZ)**

As previously mentioned HLZ study over 300 factors tested for explaining the cross section of return patterns. They publish a list of their sources here in spreadsheet format. Just looking at the history of published factors they show that factor discovery increases every decade.

Using the t-statistics published with those papers (assuming they are economically and statistically sound) HLZ perform the 3 multiple testing procedures described above (Bonferroni, Holm and BHY). They conclude under the assumption that if all tried factors are published then an appropriate minimum threshold t-statistic for 5% significance is 2.8, which equates to a p-value of 0.50% for single tests. Whilst the assumption of all tried factors being published is not reasonable, HLZ argue that the analysis does serve to provide a minimum threshold for accepting significance of future tests.

HLZ limit their factor sample to unique risk factors so as to keep test dependence to a minimum. This is a requirement for the 3 multiple testing procedures described above. But we know the requirements for being published are fairly stringent and most likely limited to tests that show significant t-statistics. HLZ estimate that 71% of tried tests are not published (see appendix B of HLZ for details), and based on this together with their implementation of the 3 adjusting procedures HLZ propose a benchmark t-statistic of 3.18. Intuitively this is greater than previously (2.8) when the assumption was all tried factors were published.

What about test dependence ie. correlation among test statistics?

This will be the case where factors tested model similar forms of risk, and are therefore correlated. Think about the multitude of price multiple factors. Assuming in the extreme that all factors are directly correlated at 100% then no adjustment to the single test would be required as you would be penalizing Type I errors (false discoveries) too harshly at the risk of increasing the Type II error rate (missed discoveries). In multiple testing, the statistics literature attempts to account for correlation through various methods including resampling the entire dataset. In finance research bootstrap procedures are being used to assess significance of individual tests and infer fund manager skill versus chance. Indeed, this is what we are aiming to achieve with mcsim() and txnsim() in the R:blotter package. However, since HLZ do not have access to the datasets used for the factors published in the majority of cases, they instead infer a distribution using the t-statistics published. They suggest a mixed distribution, in which the null hypothesis that mean returns are zero is drawn from a normal distribution and the alternative hypothesis that mean returns are non-zero is drawn from an exponential distribution. This should make economic sense if one assumes more profitable strategies are less likely to exist in an economically scarce world. Using some statistical magic (Section 4 and Appendix A in HLZ which i hope to address more specifically in a future post) HLZ propose model parameter estimates as follows:

Using the baseline assumptions for the number of unobserved tests, one can use Harvey & Liu’s code in Exhibit 5 & 6 (replicated in R below) to determine the appropriate haircut Sharpe ratio (HSR) or the required mean return for a given significance level, both taking into account an assumed number of tests.

**Conclusion**

Data mining is unavoidable and in some respects represents knowledge gain. Accounting for it when measuring statistical significance is a requirement. Judgement will have to be exercised when considering significance levels and which multiple testing technique is most appropriate. Bonferroni and Holm are appropriate for mission-critical tests such as for medical research or space missions. Controlling for the error rate of a single event occurring is desirable. Where the application is not a matter of life or death controlling for the rate of false discoveries may be more desirable. Perhaps this applies to finance. Perhaps not. Either way the requirements for statistical significance of trading strategies over time are becoming more stringent. This is natural; most of the true factors would have been discovered and the cost of mining data has diminished and continues to do so at an impressive rate. In other sciences such as physics and medicine/genomics, tests require 5 standard deviations to be considered significant. The field of finance is headed in the same direction as more factors are published. Assuming the widely used Sharpe ratio is an applicable metric for your new strategy’s evaluation, you would be well served to consider its haircut equivalent.

**References**

- Backtesting – Harvey & Liu
- …and the Cross-Section of Expected Returns – Harvey, Liu and Zhu
- Multiple Testing in Economics – Harvey & Liu
- Evaluating Trading Strategies – Harvey & Liu

**R implementation of Exhibit 5 (Haircut_SR.m and sample_random_multests.m)**

**R implementation of Exhibit 6 (Profit_Hurdle.m)**

*Please note you will need the sample_random_multests() function from above to run the Profit_Hurdle R code*