Documentation

Everything you need to know about the data: sources, cleaning, variables, and quality.

Methodology

Data sources

Primary source (pre-March 2022): PiTrading, which provides historical consolidated-tape data derived from the CTA (NYSE-listed) and UTP (Nasdaq-listed) feeds. These are the same feeds that underlie CRSP and TAQ. The data arrives as one-minute OHLCV bars, adjusted for stock splits and dividends.

Secondary source (post-March 2022): IEX Exchange HIST, which provides full-depth pcap files from the IEX exchange. I parse these into one-minute OHLCV bars using the same aggregation logic. IEX represents approximately 2–3% of consolidated volume, so these bars capture IEX-only activity, not the full tape.

The splice boundary (March 2022) is documented, and both sources are flagged in the source column of every file.

Nine-step cleaning pipeline

Version 2 (Clean) applies the following pipeline. Each step is applied in order. A bar that fails any step is removed.

Step Rule Rationale
1 Remove bars outside 09:30–16:00 ET Pre/post-market bars have different microstructure properties
2 Remove bars with non-positive prices (Open, High, Low, or Close ≤ 0) Clearly erroneous
3 Remove bars where High < Low OHLC violation — impossible in valid trade data
4 Remove bars where Open or Close falls outside [Low, High] OHLC violation — Open and Close must be within the bar's range
5 Remove duplicate timestamps (keep first occurrence) Source data occasionally contains exact duplicates
6 Remove bars with zero volume No trade occurred — price is stale or indicative
7 Remove bars where |log return| > 25% (relative to previous close) Extreme outlier filter — returns exceeding 25% in one minute are almost certainly errors
8 Brownlees-Gallo filter: remove bars where |close − median| > 3 × MAD, computed over a 50-bar centered window Adaptive outlier detection that accounts for local price level
9 Splice-boundary adjustment: verify continuity at the PiTrading/IEX transition (March 2022) Ensures no artificial price jump at the source-change boundary

Bars removed: 388,559 (0.025% of total). The cleaning is conservative by design.

Gap-filling (Version 3: Filled)

The regular trading day has 390 one-minute bars (09:30–15:59 ET). When a bar is missing in the Clean version, Version 3 fills it using Last Observation Carried Forward (LOCF): the previous bar's Close price is propagated to Open, High, Low, and Close, and Volume is set to zero. Every filled bar is flagged with is_filled = True.

LOCF introduces known biases (stale prices suppress volatility, compress spreads, and dampen autocorrelation). These biases are documented and quantified in the accompanying paper. Researchers who need a regular grid should use Version 3 with awareness of these effects.

Coverage