Data Validation¶
Catch data quality issues before they corrupt your backtest results.
DataValidator¶
Validate a pandas DataFrame:
import pandas as pd
from replaybt import DataValidator
df = pd.read_csv("ETH_1m.csv", parse_dates=["timestamp"])
validator = DataValidator(df, symbol="ETH", timeframe="1m")
issues = validator.validate()
print(validator.report())
Checks Performed¶
| Check | Severity | What it catches |
|---|---|---|
| Missing columns | ERROR | OHLCV columns not present |
| Duplicate timestamps | ERROR | Same timestamp appears twice |
| Time gaps | WARNING | Missing bars in the sequence |
| OHLC violations | WARNING | High < Low, Open/Close outside High/Low range |
| Negative values | ERROR | Negative prices or volume |
| Zero values | WARNING | Zero prices (volume can be zero) |
DataIssue¶
Each issue returned has:
issue.severity # "ERROR", "WARNING", "INFO"
issue.check # issue type identifier
issue.message # human-readable description
issue.row_index # row where issue occurs (if applicable)
issue.timestamp # timestamp of the row (if applicable)
ValidatedProvider¶
Wrap any provider to validate bars on the fly:
from replaybt import ValidatedProvider, CSVProvider
data = ValidatedProvider(
inner=CSVProvider("ETH_1m.csv", symbol_name="ETH"),
stop_on_error=False, # True = raise on first ERROR
)
# Issues are logged as bars are yielded
engine = BacktestEngine(strategy=my_strategy, data=data, config=config)
results = engine.run()