Backtesting: The Discipline of Honest Historical Inquiry

Every event, seen from the outside, appears unique. The behavioral signatures beneath them are ancient. Studying human history with honest neutrality means studying the one force that holds constant across time: how human beings respond to uncertainty, opportunity, and fear.

This study, applied to the historical price record with defined rules and measured accountability, is what the systematic trading world calls backtesting.

Paul Tudor Jones demonstrated this in 1987. His research director, Peter Borish, overlaid the 1929 pre-crash market trajectory onto the 1987 market and found structural similarity between both periods. Parabolic run ups driven by optimism over fundamentals. They positioned with put options on equity indexes. When Black Monday arrived on October 19, 1987, the Dow dropped 22.61% in a single day.

Tudor Investment tripled their money. The reason matters more than the outcome. Jones studied the historical pattern, recognized structural similarity, and prepared with guardrails. The team looked for structural rhyme across history, and they sized their positions to survive in case they were wrong.

Paul Tudor Jones and Crowds outside the New York Stock Exchange, 1929

Paul Tudor Jones, 2006. His 1987 positioning came from historical pattern recognition, backed by position sizing that assumed the possibility of being incorrect.

Crowds outside the New York Stock Exchange, 1929. The outer architecture transforms. The inner behavioral sequence persists across generations.

Backtesting, operating at its highest level, is a defined observation about recurring human behavior applied to a current market condition, measured against the historical record, and backed by position sizing that accounts for the possibility of being incorrect.

Chapter One

Backtesting is the practice of applying a defined set of trading rules to historical price data and observing what would have happened. A strategy with specific conditions, when to enter, when to exit, how to size the position, gets applied to a historical data set as though the trader is living through it in real time with no knowledge of what comes next. Every trade, every win, every loss, every drawdown gets recorded.

The output is a measurement. How did this set of rules interact with this set of market conditions?

The backtest is where the theoretical becomes empirical. It tells whether defined rules have demonstrated, under rigorous conditions, that they capture something genuine in the behavioral record.

A deeper question lives beneath the mechanics. Why would patterns from fifty years ago, or five hundred years ago, carry any information about markets today? The instruments have changed. The technology has changed. The speed of information flow has transformed beyond recognition.

A Cotton Office in New Orleans by Edgar Degas, 1873 and Flora's Wagon of Fools by Hendrik Gerritsz Pot, c.1637

A Cotton Office in New Orleans, Edgar Degas, 1873. Fear operated in these rooms the same way it operates in electronic markets today.

Flora’s Wagon of Fools, Hendrik Gerritsz Pot, c.1637. Euphoria overextends in the same sequence it followed during the tulip mania of 1637.

The outer architecture has transformed completely across generations. The inner architecture remains. Collective human behavior follows structural patterns across time because the consciousness driving it operates on the same principles it always has. A backtest works because the record beneath the price data is a record of human response, and human response at the collective level holds constant across centuries even as every surface condition transforms.

“History doesn’t repeat itself, but it often rhymes.”

Attributed to Mark Twain · The principle that makes backtesting possible

The Mirror

Narcissus by Caravaggio, c.1597-1599 — The myth of overfitting

Narcissus, Caravaggio (c. 1597–1599). The myth of falling in love with your own reflection carries a direct parallel to overfitting. A strategy that only sees what it wants to see in the data has already begun to drown.

There is a concept in systematic trading called overfitting. Some call it curve fitting. It is the single most important thing to understand about how backtesting fails, because it is the failure that feels like success.

A strategy gets designed. It gets tested on historical data. The results disappoint. So a filter is added. The results improve. A parameter gets adjusted, further improvement. Another condition is added, a stop is tightened, adjustments continue until the equity curve is smooth, the drawdowns are small, and the returns are extraordinary.

What has happened is that the strategy has memorized the specific noise of the particular dataset instead of learning the behavioral pattern beneath it.

Signal vs. Noise — Recurring behavioral patterns versus random variation

Signal vs. Noise

Signal is a recurring behavioral pattern. The structural tendency that persists across time because it is rooted in how consciousness responds to uncertainty. Noise is random variation. The specific sequence of events that occurred once due to unique circumstances and will never repeat in the same configuration.

Overfitting — When a strategy memorizes noise instead of learning signal

Research has demonstrated that a strategy can show a Sharpe ratio of 1.2 in backtesting and drop to negative 0.2 on data it has never seen. A strategy that appeared to generate consistent risk adjusted returns revealed itself as performing worse than random when the specific noise it had been fitted to was removed.

Three versions of this problem exist. The first is obvious curve fitting, where optimization across thousands of parameter combinations yields the one that performed best. The more knobs that get tuned, the more likely the system has been tuned to historical noise instead of a real signal.

The second is more subtle. Implicit fitting. It does not show up in code. It shows up in decisions. Choosing momentum over mean reversion because the trader already knows momentum did well in that period. Selecting certain instruments based on historical behavior already observed. Every time a design decision is influenced by information that would not have been available at the moment of the trade, future knowledge quietly leaks into the past. Researchers call this data snooping bias.

The third is selection bias. Testing ten variations of a strategy, discarding nine, and trading the one that looked exceptional. The problem is mathematical. If enough variations are tested, one will eventually look extraordinary purely by chance. Probability at work.

The honest question every backtest must answer: are these results showing a real behavioral pattern in the record, or the noise of a specific dataset shaped by a process that was quietly looking for confirmation?

The Expected Value Formula

Twenty Seven Centuries Apart

This question was answered with extraordinary clarity by two observers separated by twenty seven centuries. One studied the sky. The other studied debt.

Babylonian boundary stone (Kudurru), Musée du Louvre

Babylonian boundary stone (Kudurru), Musée du Louvre. Temple scribes in Babylon documented celestial movements night after night, year after year, for over seven hundred years. The longest continuous research program in recorded history.

Twenty seven hundred years ago, temple scribes in Babylon began recording the position of the moon every night on clay tablets. They documented eclipses, planetary movements, every visible celestial body, night after night, year after year, for over seven hundred years. From that record, they extracted a rule.

Every 223 lunar months, about eighteen years, eclipses reoccur. They applied this rule backward across centuries of observation and forward into dates that had not yet arrived. They predicted when eclipses should occur, then they watched the sky and compared prediction to reality. When the prediction was accurate, the rule stood. When it failed, the record overruled the theory. One astronomer was even arrested for an incorrect eclipse prediction that triggered an expensive ritual. Accuracy mattered. The record enforced accountability.

A rule extracted from historical data, applied systematically, measured against what occurred, refined when wrong, trusted when repeatedly right. A complete backtest, twenty seven centuries ago.

The Saros Cycle — Eclipse prediction across 223 lunar months

“He who studies the record with patience inherits the authority of the record itself.”

The principle embodied by Babylonian astronomical tradition

Ray Dalio, founder of Bridgewater Associates, speaking at the Web Summit, 2018

Ray Dalio, founder of Bridgewater Associates, speaking at the Web Summit, 2018. His study of 48 major debt crises across centuries, continents, and political systems produced a template that anticipated 2008.

Twenty seven centuries later, the same architecture produced one of the most important investment outcomes of our lifetime. Ray Dalio studied debt crises across centuries. Forty eight major crises, multiple continents, multiple currencies, multiple political systems. What he found was structural. Debt crises follow a recognizable behavioral sequence. Healthy growth becomes extrapolation. Extrapolation becomes leverage. Leverage becomes speculation. Speculation becomes a bubble, then tightening, then contraction, sometimes depression.

The outer details changed every time. The inner behavioral sequence held constant. Dalio turned that into a template, a framework derived from the historical record so that at any moment they could identify where they were in the cycle. When 2008 arrived, Bridgewater was positioned accordingly. The template held.

The Death of Caesar by Vincenzo Camuccini — Power, deception, and the ancient behavioral signature

The Death of Caesar, Vincenzo Camuccini. Power, deception, collective fear. The behavioral signature beneath historical events persists because the consciousness driving it operates on the same principles across millennia.

The Babylonians studied seven hundred years of celestial movement and extracted a rule that predicted eclipses. Dalio studied centuries of debt cycles and extracted a template that anticipated crisis. Both accumulated an honest historical record. Both observed recurring patterns. Both extracted rules. Both tested those rules against reality. Both held themselves accountable to the outcome.

The surface environment transforms across time. The behavioral architecture beneath it persists.

The Protocol

The Backtesting Protocol — Five components of honest validation

Honest backtesting is structured. It requires discipline at every stage, and it requires active resistance to the natural human desire for confirmation.

The first component is definition. Writing complete strategy rules before seeing any data. Entry, exit, position sizing parameters. Everything defined in advance. This addresses overfitting at the root. If rules are shaped after seeing outcomes, the time machine has already been used. Measurement must exist before observation.

The second is separation. Dividing data into in sample and out of sample. Developing on one, validating on the other. If performance collapses out of sample, the strategy learned the noise of the training period rather than the underlying pattern.

The third is walking forward through time. Optimizing on past data, testing on the next unseen segment, recording results, shifting forward, and repeating. This simulates reality. Parameters come from the past. Testing happens in the unknown. One strong backtest can be noise. Repeated out of sample consistency is signal.

The fourth is stress testing. Monte Carlo simulations. Randomizing trade order. Performing sensitivity analysis. Adjusting parameters slightly. Robust systems tolerate variation. Fragile ones collapse. Testing across markets reveals whether the edge generalizes, because behavioral edges tend to persist where data specific artifacts do not.

The fifth is measurement. Expected value per trade, after costs, after slippage, after realistic execution. Maximum drawdown, and whether the capital protocol survives it.

Sample size carries weight in this measurement. Thirty trades reveal very little. Three hundred begin to establish structure. The discipline of honest backtesting lives in accepting what the record shows, even when it contradicts what the trader hoped to find.

Geometric Growth — The relationship between expected return and variance

The Complete Checklist

The Complete Backtesting Checklist

Define before testing. Separate data. Walk forward. Stress results. Measure expected value, drawdown, geometric growth, and statistical significance. This is honest testing. Everything else is confirmation dressed as research.

The outer events are always unique. The behavioral signatures beneath them are always ancient. A backtest does not predict the future. It asks whether the rules being carried into the unknown have been tested with enough honesty, enough rigor, and enough humility to deserve the capital being placed behind them.

Studying the record honestly is what separates being positioned from being surprised.

This is the truth as I have found it. Your path may reveal more.

Think in odds. Act with discipline.

— Ashim

Visual Breakdown. Video Edition

topic: 15

These lessons are part of my ongoing public research on System R AI.