Documentation

Methodology

Data sources

Primary source (pre-March 2022): PiTrading, which provides historical consolidated-tape data derived from the CTA (NYSE-listed) and UTP (Nasdaq-listed) feeds. These are the same feeds that underlie CRSP and TAQ. The data arrives as one-minute OHLCV bars, adjusted for stock splits and dividends.

Secondary source (post-March 2022): IEX Exchange HIST, which provides full-depth pcap files from the IEX exchange. I parse these into one-minute OHLCV bars using the same aggregation logic. IEX represents approximately 2–3% of consolidated volume, so these bars capture IEX-only activity, not the full tape.

The splice boundary (March 2022) is documented, and both sources are flagged in the source column of every file.

Nine-step cleaning pipeline

Version 2 (Clean) applies the following pipeline. Each step is applied in order. A bar that fails any step is removed.

Step	Rule	Rationale
1	Remove bars outside 09:30–16:00 ET	Pre/post-market bars have different microstructure properties
2	Remove bars with non-positive prices (Open, High, Low, or Close ≤ 0)	Clearly erroneous
3	Remove bars where High < Low	OHLC violation — impossible in valid trade data
4	Remove bars where Open or Close falls outside [Low, High]	OHLC violation — Open and Close must be within the bar's range
5	Remove duplicate timestamps (keep first occurrence)	Source data occasionally contains exact duplicates
6	Remove bars with zero volume	No trade occurred — price is stale or indicative
7	Remove bars where \|log return\| > 25% (relative to previous close)	Extreme outlier filter — returns exceeding 25% in one minute are almost certainly errors
8	Brownlees-Gallo filter: remove bars where \|close − median\| > 3 × MAD, computed over a 50-bar centered window	Adaptive outlier detection that accounts for local price level
9	Splice-boundary adjustment: verify continuity at the PiTrading/IEX transition (March 2022)	Ensures no artificial price jump at the source-change boundary

Bars removed: 388,559 (0.025% of total). The cleaning is conservative by design.

Gap-filling (Version 3: Filled)

The regular trading day has 390 one-minute bars (09:30–15:59 ET). When a bar is missing in the Clean version, Version 3 fills it using Last Observation Carried Forward (LOCF): the previous bar's Close price is propagated to Open, High, Low, and Close, and Volume is set to zero. Every filled bar is flagged with is_filled = True.

LOCF introduces known biases (stale prices suppress volatility, compress spreads, and dampen autocorrelation). These biases are documented and quantified in the accompanying paper. Researchers who need a regular grid should use Version 3 with awareness of these effects.

Coverage

1,391 tickers — U.S. equities and ETFs
Majority begin December 30, 2002 — 45 tickers have history back to January 1991
Updated weekly — new bars added every Sunday via automated pipeline
Trading hours only — 09:30–16:00 Eastern Time
Split/dividend adjusted — all prices adjusted for corporate actions

Data Versions

Data Dictionary

API Reference

Sample Code

Changelog

Known Issues

Methodology

Data sources

Nine-step cleaning pipeline

Gap-filling (Version 3: Filled)

Coverage