R-view: Backtesting – Harvey & Liu (2015)

In this post i take an R-view of “Backtesting – Harvey & Liu (2015).” The authors propose an alternative to the commonly practiced 50% discount that is applied to reported Sharpe ratios when evaluating backtests of trading strategies. The reason for the discount is due to the inevitable data mining inherent in research, both past and present. Harvey & Liu (HL) propose a multiple hypothesis testing framework with a more stringent requirement for determining a test’s significance level compared with single tests. HL suggest that the single hypothesis test used by most researchers leads to too many false discoveries (Type I errors). This makes intuitive sense if you consider that the probability of finding a significant factor increases with the number of factors tested or the number of tests conducted.

To give an example quoted in Harvey & Liu – Evaluating Trading Strategies (which you may have heard before), imagine you receive a mail from a manager promoting a particular stock. The mail says to track the manager’s recommendations over time. The recommendation is either to go long or short. The manager is correct 10 weeks in a row with his recommendations. The probability of such a feat you figure is very small, in fact a 0.09765% probability of being a false discovery (0.5^10 = 0.0009765625) based on single hypothesis testing. But what if the manager sent a similar mail to 100k people telling half to go long and the other half to go short the same stock. Every week the manager trims half the people off his mailing list, the half for whom the recommendation did not work. Every week for 10 weeks the list is trimmed. By the end of week 10 a total of 97 lucky people have been the recipients of these perfect picks. The results were random, and the recipients would have been fooled.

In multiple hypothesis testing the challenge is to guard against false discoveries. HL argue that the appropriate “haircut Sharpe ratio” is non-linear, in that the highest Sharpe ratios (SR’s) are only moderately penalized whilst marginal SR’s more so. The implication is that high SR’s are more likely true discoveries in a multiple hypothesis testing framework.

If you would like to jump straight to my R implementation it is at the end of this post. The authors make their Matlab code for Exhibits 5 & 6 available here. I simply converted it to R and added some extra bits for illustrative purposes.

HL mention 5 caveats to their framework, namely;

  • Sharpe ratios may not be appropriate metrics for strategies with negatively skewed expected payoffs, such as option strategies.
  • Sharpe ratios normalize returns based on their volatility (ie. market risk), which may not be the most appropriate reflection of risk for a strategy.
  • Determining the appropriate significance level for multiple testing (where in single tests 5% is the normal cutoff).
  • Which multiple testing method you choose could yield different conclusions. HL proposes 3 methods together with an average.
  • Lastly, and i would say most sensitively is the number of tests used.

Linking Sharpe ratio with t-statistic

To explain the link between the Sharpe ratio and the t-stat and the application of a multiple testing p-value adjustment, HL use the simplest case of an individual investment strategy. Assume a null hypothesis in which the strategy’s mean return is significantly different from zero, therefore implying a 2-sided alternative hypothesis. A strategy can be regarded as profitable if its mean returns are either side of zero since investors can generally go long or short. Since returns will at least be asymptotically normally distributed (thanks to the Central Limit Theorem) a t-statistic following a t-distribution can be constructed and tested for significance. However, due to the link between the Sharpe ratio and the t-stat it is possible to assess the significance of a strategy’s excess returns directly using the Sharpe ratio. Assume \hat{\mu} denotes the mean of your sample of historical returns (daily or weekly etc) and \hat{\sigma} denotes standard deviation, then

t-statistic = \frac{ \hat{\mu}}{(\hat{\sigma}/\sqrt{T})}

where T-1 is degrees of freedom and since

\widehat{SR} =  \hat{\mu}/\hat{\sigma}

it can be shown that

\widehat{SR} = t-statistic/\sqrt{T}

By implication a higher Sharpe ratio equates to a higher t-ratio, implying a higher significance level (lower p-value) for an investment strategy. If we denote p-value of the single test as {p^s} then we can present the p-value of the single test as

{p^s} = Pr( |r| > t-ratio)

or

{p^s} = Pr( |r| > \widehat{SR}.\sqrt{T})

Now if the researcher was exploring a particular economic theory then this p-value might make sense, but what if the researcher has tested multiple strategies and presents only the most profitable one? In this case the p-value of the single test may severely overstate the actual significance. A more truthful p-value would be an adjusted multiple testing p-value which assuming we denote as {p^m} could be represented as

{p^m} = Pr( max{|{r_i}|, i = 1,...,N} > t-ratio)

or

{p^m} = 1 - (1 - {p^s}{)^N}

By equating the p-value of a single test to a multiple test p-value we get the defining equation of {p^m} which is

{p^m} = Pr( {|{r_i}|} > \widehat{SR}.\sqrt{T})

Using the example in HL, assuming a strategy with 20 years of monthly returns (T=240), an annual Sharpe ratio (SR) of 0.75, you would have a p-value for the single test of

{p^s} = Pr( |r| > t-ratio)

{p^s} \approxeq 0.0008

To compute the single p-value and the multiple testing p-value assuming researchers have tried N = 200 strategies, you can use the below R code to compute the adjusted or “haircut” Sharpe ratio;

# Set the variables
N = 200 # Number of strategies tested
T = 240 # Number of observations or degrees of freedom
SR_ann = 0.75 # Annualised Sharpe ratio
# Compute the single test p-value
p_single = 2 * (1-pnorm(SR_ann*sqrt(T/12)))
# Compute the multiple test p-value
p_multiple = 1 - (1 - p_single) ^ N
# Compute haircut Sharpe ratio (HSR)
z_stat = qt(1 - p_multiple/2, N-1)
HSR = (z_stat/sqrt(T)) * sqrt(12)
# Sharpe ratio haircut
haircut = (SR_ann - HSR)/SR_ann

The code should output an HSR of 0.325 (HL yields 0.32 due to rounding the single and multiple p-values) which implies a haircut to the original Sharpe ratio of 56.6% (HL report a 60% haircut). The example above is no longer appropriate where t-statistics are dependent. More on this later.

Multiple Multiple-Testing methods

HL mentions 3 well known adjustment methods in the statistics literature, which are originally prescribed in the paper “…and the Cross-Section of Expected Returns” by Harvey, Liu and Zhu. These are Bonferroni, Holm, and Benjamini, Hochberg, and Yekutieli (BHY).

Bonferroni’s adjustment:

{p_i^{Bonferroni}} = min[M * {p_i}, 1]

Bonferroni applies the same adjustment to the p-value of each test, inflating the p-value by the number of tests. The multiple testing p-value is the minimum of each inflated p-value and 1 where 1 (or 100% if you prefer) is the upper bound of probability. HL use the example of p-values from 6 strategies where the p-values are (0.005, 0.009, 0.0128, 0.0135, 0.045, 0.06). According to a 5% significance cutoff the first 5 tests would be considered significant. Using the p.adjust function in R we can get the multiple adjusted p-values and according to Bonferroni only the first test would be considered significant.

# Set the variables
p_values = c(0.005, 0.009, 0.0128, 0.0135, 0.045, 0.06)
# Assign to adj.p_values the output of p.adjust()
adj.p_values = p.adjust(p_values, "bonferroni")
# Print
adj.p_values

bon_adj-p_values

Holm’s adjustment:

p-value adjustments can be categorized into 2 categories, namely: single-step and sequential. Single-step corrections equally adjust p-values as in Bonferroni. Sequential adjustments are an adaptive procedure based on the distribution of p-values. Sequential methods gained prominence after a seminal paper by Schweder & Spjotvoll (1982) and section 7.3 of this paper gives a useful example of an application of multiple testing hypothesis diagnostic plotting in R. Holm is an example of a sequential multiple testing procedure. For Holm, the equivalent adjusted p-value is

{p_{(i)}^{Holm}} = min[max((M - j + 1)*{p_{(j)}}),1]

Using the same 6 strategies illustrated above, we can use p.adjust() to compute the Holm adjusted p-values.

# Run p.adjust() for Holm
HOLM_adj.p_values = p.adjust(p_values, "holm")
HOLM_adj.p_values

holm_adj-p_values

We see in the output above that the first 2 tests are significant at the 5% cutoff, compared to only 1 with Bonferroni. This is due to the fact that Bonferroni adjusts single tests equally, whereas Holm applies a sequential approach. By conclusion it should not surprise you that adjusted Sharpe ratios under Bonferroni will therefore be lower than for Holm. At this point it is useful to mention that both Holm and Bonferroni attempt to prevent even 1 Type I error occurring, controlling what is called the family-wise error rate (FWER). The next adjustment proposed by HL is BHY and the main difference from the previous 2 adjustment methods is that BHY attempts to control the false discovery rate (FDR), implying more lenience than Holm and Bonferroni and therefore expected to yield higher adjusted Sharpe ratios.

BHY – (Benjamini, Hochberg, and Yekutieli):

BHY’s formulation of the FDR can be represented as follows. Firstly all p-values are sorted in descending order and the adjusted p-value sequence is defined by pairwaise comparisons.

BHY - formulation.PNG

c(M) is defined as follows

c(M) = {\sum_{j=1}^M} \frac{1}{j}

making c(M) = 2.45 for the example of M=6 strategies (ie. sum of 1/1 + 1/2 + 1/3 +1/4 + 1/5 + 1/6). Using p.adjust() we can see that there are now 4 significant strategies at the 5% cutoff. The output from p.adjust below is slightly different to HL’s BHY implementation as p.adjust() does not equate {p_{(6)}^{BHY}} with the least significant p-value 0.06.

# Run p.adjust() for BHY
BHY_adj.p_values = p.adjust(sort(p_values, decreasing = TRUE), "BY")
BHY_adj.p_values

bhy_adj-p_values

So to summarise these methods BHY leads to 4 significant discoveries versus Holm’s 2 and Bonferroni’s 1. We expect BHY to be more lenient as it controls the false discovery rate whereas Holm and Bonferroni control the family-wise error rate, trying to eliminate making even 1 false discovery. Bonferroni is more stringent than Holm since it is a single-step adjustment versus the sequential approach of Holm. With these 3 methods HL attempt to adjust p-values to account for multiple testing and then convert these to haircut Sharpe ratios and in so doing control for data mining. For both Holm and BHY you need the empirical distribution of p-values from previously tried strategies. Harvey, Liu and Zhu (HLZ) model over 300 risk factors documented in the finance literature. However, using this model for the distribution of p-values is not complete since many tried strategies would not have been documented (referred to as Publication Bias) plus they are potentially correlated thereby violating the requirement for independence between tests. HLZ propose a new distribution to overcome these shortfalls.

Harvey, Liu and Zhu (HLZ)

As previously mentioned HLZ study over 300 factors tested for explaining the cross section of return patterns. They publish a list of their sources here in spreadsheet format. Just looking at the history of published factors they show that factor discovery increases every decade.

History of Factor Discoveries - HLZ.PNG

Using the t-statistics published with those papers (assuming they are economically and statistically sound) HLZ perform the 3 multiple testing procedures described above (Bonferroni, Holm and BHY). They conclude under the assumption that if all tried factors are published then an appropriate minimum threshold t-statistic for 5% significance is 2.8, which equates to a p-value of 0.50% for single tests. Whilst the assumption of all tried factors being published is not reasonable, HLZ argue that the analysis does serve to provide a minimum threshold for accepting significance of future tests.

HLZ limit their factor sample to unique risk factors so as to keep test dependence to a minimum. This is a requirement for the 3 multiple testing procedures described above. But we know the requirements for being published are fairly stringent and most likely limited to tests that show significant t-statistics. HLZ estimate that 71% of tried tests are not published (see appendix B of HLZ for details), and based on this together with their implementation of the 3 adjusting procedures HLZ propose a benchmark t-statistic of 3.18. Intuitively this is greater than previously (2.8) when the assumption was all tried factors were published.

What about test dependence ie. correlation among test statistics?

This will be the case where factors tested model similar forms of risk, and are therefore correlated. Think about the multitude of price multiple factors. Assuming in the extreme that all factors are directly correlated at 100% then no adjustment to the single test would be required as you would be penalizing Type I errors (false discoveries) too harshly at the risk of increasing the Type II error rate (missed discoveries). In multiple testing, the statistics literature attempts to account for correlation through various methods including resampling the entire dataset. In finance research bootstrap procedures are being used to assess significance of individual tests and infer fund manager skill versus chance. Indeed, this is what we are aiming to achieve with mcsim() and txnsim() in the R:blotter package. However, since HLZ do not have access to the datasets used for the factors published in the majority of cases, they instead infer a distribution using the t-statistics published. They suggest a mixed distribution, in which the null hypothesis that mean returns are zero is drawn from a normal distribution and the alternative hypothesis that mean returns are non-zero is drawn from an exponential distribution. This should make economic sense if one assumes more profitable strategies are less likely to exist in an economically scarce world. Using some statistical magic (Section 4 and Appendix A in HLZ which i hope to address more specifically in a future post) HLZ propose model parameter estimates as follows:

Model Parameter Estimates.PNG

 Using the baseline assumptions for the number of unobserved tests, one can use Harvey & Liu’s code in Exhibit 5 & 6 (replicated in R below) to determine the appropriate haircut Sharpe ratio (HSR) or the required mean return for a given significance level, both taking into account an assumed number of tests.

Conclusion

Data mining is unavoidable and in some respects represents knowledge gain. Accounting for it when measuring statistical significance is a requirement. Judgement will have to be exercised when considering significance levels and which multiple testing technique is most appropriate. Bonferroni and Holm are appropriate for mission-critical tests such as for medical research or space missions. Controlling for the error rate of a single event occurring is desirable. Where the application is not a matter of life or death controlling for the rate of false discoveries may be more desirable. Perhaps this applies to finance. Perhaps not. Either way the requirements for statistical significance of trading strategies over time are becoming more stringent. This is natural; most of the true factors would have been discovered and the cost of mining data has diminished and continues to do so at an impressive rate. In other sciences such as physics and medicine/genomics, tests require 5 standard deviations to be considered significant. The field of finance is headed in the same direction as more factors are published. Assuming the widely used Sharpe ratio is an applicable metric for your new strategy’s evaluation, you would be well served to consider its haircut equivalent.

References

R implementation of Exhibit 5 (Haircut_SR.m and sample_random_multests.m)

R implementation of Exhibit 6 (Profit_Hurdle.m)

Please note you will need the sample_random_multests() function from above to run the Profit_Hurdle R code

Quantitative Strategy Development Overview – Brian Peterson

I have had the pleasure of getting to know and work with Brian Peterson of late building out the blotter::mcsim function in the blotter package. I will be writing about this function soon and where it is headed, but in this post i wanted to share a presentation Brian gave the CapeR User Group last week on Developing and Backtesting Systematic Trading Strategies with an application in R, and in particular using the quantstrat and blotter packages. Enjoy!

 

Link to the slides here!

Block Bootstrapped Monte Carlo – in R

roulette-wheel-gambling-5-wallpaper-1.jpg

A few weeks back i wrote a post including the source code for a Monte Carlo simulation function in R. The idea was to randomly sample daily returns produced by a backtest and build a confidence interval distribution of the middle 50% and 90% of returns. Since then Brian Peterson got in touch with me asking if i would work with him in getting some form of Monte Carlo simulation functionality standardized within the R blotter package. The blotter package forms part of the TradeAnalytics project on R-Forge (which includes “quantstrat”) where you will read the project description as…

“Transaction-oriented infrastructure for defining instruments, transactions, portfolios and accounts for trading systems and strategy simulation. Provides portfolio support for multi-asset class and multi-currency portfolios. Still in heavy development.”

Whilst working with Brian he suggested a few enhancements for the function in the original post, one of which was to include the option to perform a block bootstrap of the original backtest in order to capture any autocorrelation effects. This leaves 2 options as far as existing R packages go, namely; meboot() and boot().

 

MEBOOT

Now meboot() is a really neat package which i found out about thanks to this post by James Picerno over at The Capital Spectator. It eliminates the requirement for stationarity in a time series. Below is an example of what a meboot plot looks like using the same simSample file i used in the previous post with 1,000 samples (or replicates as the meboot package refers to it).

meboot_simSample

Looking at the above chart I am not sure it is the most appropriate tool for the job, since the sampling is done within a very narrow range giving you large swings which tend to revert back to the original backtest. To make sure though i also ran the meboot() function over the backtest results generated by the “longtrend” demo strategy script in the blotter package. Again, samples track the original backtest quite closely.

meboot_longtrend

Whether or not meboot is the best tool for the job will ultimately come down to what the sample statistics say and which metrics you use to determine whether or not to proceed with a strategy into production. Either way it may be a useful option to include in any standardized Monte Carlo simulation functionality within blotter.

 

BOOT

The boot package includes functions and datasets for bootstrapping from “Bootstrap Methods and Their Application” by A. C. Davison and D. V. Hinkley (1997, CUP). Of particular interest for time series bootstrapping is the tsboot() function, whose purpose is described in the vignette on CRAN as to “Generate R bootstrap replicates of a statistic applied to a time series. The replicate time series can be generated using fixed or random block lengths or can be model based replicates.”

Now the tsboot object returned by the function does not include the samples themselves, but instead you will have to call the boot.array() function to get the sample indexing. Off the bat though i wasn’t sure the tsboot function was appropriate for simply block sampling a time series, albeit very eloquently and efficiently achieving just that on top of calculating a statistic of each sampled series. So i spent some time debugging the function trying to learn how it goes about performing the block bootstrapping and managed to come up with my own function which produces the below graph after sampling simSample 1,000 times (R = 1000) using a block length of 5 (l = 5) [full source code below the graph].

mcblocksim_simSample

Your choice of block length (l) can be anything (as long as its less than the number of periods in your backtest) and how long you make your blocks could be based on a number of factors including the average length that your strategy is in the market or the average period in which your strategy experiences autocorrelation. Remember, the whole purpose of adding block bootstrap functionality was to account for autocorrelation characteristics in your backtest.

So, whilst the above mcblocksim() function is likely an improvement on the mcsimr() function i wrote about in my previous post, there is still quite a lot of work to do to get it ready for testing and committing to the blotter package. Particular to-do items include:

  • Revisiting the tsboot() function and determining whether or not some form of statistic can help users identify the optimal number of samples required
  • Potentially using the drawdowns() function in the Performance Analytics package as a statistic in the tsboot() function
  • Applying confidence intervals to the samples, thereby eliminating outlying returns (as in mcsimr())
  • Adding any other strategy-specific noise to the samples

So lots to do but as they say, Rome wasn’t built in a day. Hopefully you find some use from the above source code in the mean time.

Thanks to Brian Peterson for his valuable input thus far.

 

Thanks for reading.

 

A Monte Carlo Simulation function for your back-test results – in R

In this post on bettersystemtrader.com, Andrew Swanscott interviews Kevin Davey from KJ Trading Systems who discusses why looking at your back-test historical equity curve alone might not give you a true sense of a strategy’s risk profile. Kevin Davey also writes on the topic here for futuresmag.com. So i wrote a Monte Carlo-type simulation function (in R) to see graphically how my back-test results would have looked had the order of returns been randomized. I use ‘n’ to denote the number of simulations i would like to do, and exclude the highest and lowest 5% of return streams to form the blue ribbon. For the middle red ribbon i use the middle 50% of return streams (ie. excluding the 1st and last quartile). The second and last parameter (default TRUE) is a boolean which if TRUE will run the sampling procedure in R with replacement, and without replacement if set to FALSE. See this link if you need an intuitive explanation of what sampling with and without replacement means. Below is an example of the output for 10k simulations, with and without replacement for a random set of returns on a hypothetical strategy spanning 7 months. The black line is the actual historical equity curve. Whether or not the return distributions satisfy your risk appetite will come down to a more refined drawdown analysis together with the host of other risk metrics available for consideration. At least with the function below you have the stats in a dataframe and a graph to start…the starting point for any statistical analysis.

mcsimr_without_repmcsimr_with_rep

I read the data into R using the read.csv function. You can use the same data here (simSample.xlsx), just be sure to change the file type to CSV first, as WordPress only allows for storing .xls and .xlsx Excel files. Below is the source code to add the function to your environment. I add the required packages then create the mcsimr() function to do the import, manipulate the data and build the graph using ggplot. Remember to set your working directory as the directory in which you have stored the simSample.csv file.

If you would like to run the simulation 1,000 times without replacement, then the function should read:

mcsimr(1000, FALSE)

If you would like to run with replacement, include TRUE for the second argument or leave it blank, as the default is TRUE.

mcsimr(1000)

Ok, below is the full source code

I am aiming to have the function code on github soon [EDIT: The above is hosted (with love) by GitHub, however enhancements are still a work in progress], hopefully with some enhancements allowing the user to specify an xts object for the raw data (ie. backtest results) as well as letting the user decide which percentages to use for the bands. Some drawdown stats would be useful too. I have built something which incorporates a drawdown analysis of every ‘n’ return stream, however, it needs some more work to speed it up as you can imagine 10k xts objects with 500+ of daily return data could take some time to analyse individually.

Lastly, i wonder if something like this could fit into the QuantStrat or TradeAnalytics packages…or if there already is something similar do let me know. [EDIT: Since publishing this post Brian Peterson (maintainer of multiple Finance R packages like ReturnAnalytics, quantmod and Quanstrat) got in touch and asked if i would work with him in getting this function into the blotter package, a dependency of Quantstrat. Brian has also suggested a few improvements to the function including using Block Bootstrap to account for autocorrelation and plot.xts as opposed to ggplot for faster graphical rendering. So look out for an updated post on that.]

Happy hacking!

 

How i used 18 lines of R code to crack the Triangle Challenge

I recently came across a question that required logic and coding skills to solve. It went a little something like this:

Let there be a stick of length 1. Pick 2 points uniformly at random along the stick, and break the stick at those points. What is the probability of the three resulting pieces being able to form a triangle?

Interesting i thought. I do enjoy these types of challenges, and had fun trying to solve this problem. It took me a while to understand that the longest side (or one of the longest sides in the case of an isosceles triangle)  had to be smaller than the sum of the 2 smaller sides. In pseudo code a triangle would be possible IF:

Length of longest side < sum of other 2 sides, which is true IF

Length of longest side < 0.5 (half the length of the stick)

When i saw (in my mind’s eye) me breaking a stick in half and folding it so that both pieces overlap, it became obvious that i needed one of the pieces to be longer and then break that piece into 2 pieces of any length to form a would-be triangle. As a side note, i have been hoping to create my first gif for a while and this might be a useful one to animate a stick being broken into half, folded over and then one half growing in length and then breaking into 2 and forming a triangle.

I turned to the statistical computing language R to solve my problem, knowing i could randomly draw values from a uniform distribution with the runif() function.

Continue reading