Skip to content

Data Validation

Catch data quality issues before they corrupt your backtest results.

DataValidator

Validate a pandas DataFrame:

import pandas as pd
from replaybt import DataValidator

df = pd.read_csv("ETH_1m.csv", parse_dates=["timestamp"])
validator = DataValidator(df, symbol="ETH", timeframe="1m")

issues = validator.validate()
print(validator.report())

Checks Performed

Check Severity What it catches
Missing columns ERROR OHLCV columns not present
Duplicate timestamps ERROR Same timestamp appears twice
Time gaps WARNING Missing bars in the sequence
OHLC violations WARNING High < Low, Open/Close outside High/Low range
Negative values ERROR Negative prices or volume
Zero values WARNING Zero prices (volume can be zero)

DataIssue

Each issue returned has:

issue.severity   # "ERROR", "WARNING", "INFO"
issue.check      # issue type identifier
issue.message    # human-readable description
issue.row_index  # row where issue occurs (if applicable)
issue.timestamp  # timestamp of the row (if applicable)

ValidatedProvider

Wrap any provider to validate bars on the fly:

from replaybt import ValidatedProvider, CSVProvider

data = ValidatedProvider(
    inner=CSVProvider("ETH_1m.csv", symbol_name="ETH"),
    stop_on_error=False,  # True = raise on first ERROR
)

# Issues are logged as bars are yielded
engine = BacktestEngine(strategy=my_strategy, data=data, config=config)
results = engine.run()

Convenience Functions

from replaybt import validate_dataframe, validate_provider

# Validate a DataFrame directly
issues = validate_dataframe(df, symbol="ETH", timeframe="1m")

# Validate a provider
issues = validate_provider(CSVProvider("ETH_1m.csv"), symbol="ETH")