Infrastructure4 min readNovember 6, 2024

Speed Is a Feature: 10x Faster Data Loading with Parquet

High-performance data caching using Apache Parquet achieving 10x faster data loading compared to CSV.

ParquetData PipelinePerformance

The CSV Bottleneck

V7's backtest processes 7.5 years of M15 data across multiple instruments. In CSV format, loading this data takes 45 seconds. When you are iterating on strategy parameters and running backtests dozens of times per day, 45 seconds per load adds up to hours of wasted developer time.

S38 converts all historical data to Apache Parquet format with columnar compression. The same dataset loads in 4.5 seconds. That 10x speedup comes from columnar storage (only load the columns you need), efficient compression (Snappy by default), and memory-mapped reading.

Implementation Details

The Parquet cache is managed by a simple write-once-read-many pattern. When data is first requested, S38 checks for a Parquet cache file. If it exists and the checksum matches the source data, it loads from Parquet. If not, it loads from the source, writes a Parquet cache, and returns the data.

Cache invalidation uses MD5 checksums of the source data. When new data is added (daily updates for live trading), the checksum changes and the cache is rebuilt. The rebuild takes 8 seconds, amortized over hundreds of subsequent loads.

Why Performance Infrastructure Matters

A 10x speedup in data loading does not improve trading performance. It improves development velocity. Faster backtests mean more experiments per day. More experiments mean faster convergence on optimal parameters. S38 is not a trading module. It is a developer productivity module. The decision to include it in the S-series acknowledges that development infrastructure is part of the system. A slow development loop leads to undertested changes and shortcuts. A fast loop encourages thorough testing and validation. The 10x speedup paid for itself within the first week of development through increased experiment throughput.

← When Your Model Says 70% but Reality Says 55%: Fixing Probability Calibration Features Do Not Last Forever: Monitoring Predictive Power Over Time →