Everything you need to know about the data: sources, cleaning, variables, and quality.
Raw, Clean, and Filled — what each version contains and who should use it.
Every column, data type, and description. All 27 computed academic variables.
REST endpoints, authentication, rate limits, response formats, and examples.
Python, R, and Stata examples. Jupyter notebooks for common research tasks.
Every data update, versioned and documented. What changed and when.
Living errata log. Every discovered issue documented with date and resolution.
Primary source (pre-March 2022): PiTrading, which provides historical consolidated-tape data derived from the CTA (NYSE-listed) and UTP (Nasdaq-listed) feeds. These are the same feeds that underlie CRSP and TAQ. The data arrives as one-minute OHLCV bars, adjusted for stock splits and dividends.
Secondary source (post-March 2022): IEX Exchange HIST, which provides full-depth pcap files from the IEX exchange. I parse these into one-minute OHLCV bars using the same aggregation logic. IEX represents approximately 2–3% of consolidated volume, so these bars capture IEX-only activity, not the full tape.
The splice boundary (March 2022) is documented, and both sources are flagged in the source column of every file.
Version 2 (Clean) applies the following pipeline. Each step is applied in order. A bar that fails any step is removed.
| Step | Rule | Rationale |
|---|---|---|
| 1 | Remove bars outside 09:30–16:00 ET | Pre/post-market bars have different microstructure properties |
| 2 | Remove bars with non-positive prices (Open, High, Low, or Close ≤ 0) | Clearly erroneous |
| 3 | Remove bars where High < Low | OHLC violation — impossible in valid trade data |
| 4 | Remove bars where Open or Close falls outside [Low, High] | OHLC violation — Open and Close must be within the bar's range |
| 5 | Remove duplicate timestamps (keep first occurrence) | Source data occasionally contains exact duplicates |
| 6 | Remove bars with zero volume | No trade occurred — price is stale or indicative |
| 7 | Remove bars where |log return| > 25% (relative to previous close) | Extreme outlier filter — returns exceeding 25% in one minute are almost certainly errors |
| 8 | Brownlees-Gallo filter: remove bars where |close − median| > 3 × MAD, computed over a 50-bar centered window | Adaptive outlier detection that accounts for local price level |
| 9 | Splice-boundary adjustment: verify continuity at the PiTrading/IEX transition (March 2022) | Ensures no artificial price jump at the source-change boundary |
Bars removed: 388,559 (0.025% of total). The cleaning is conservative by design.
The regular trading day has 390 one-minute bars (09:30–15:59 ET). When a bar is missing in the Clean version, Version 3 fills it using Last Observation Carried Forward (LOCF): the previous bar's Close price is propagated to Open, High, Low, and Close, and Volume is set to zero. Every filled bar is flagged with is_filled = True.
LOCF introduces known biases (stale prices suppress volatility, compress spreads, and dampen autocorrelation). These biases are documented and quantified in the accompanying paper. Researchers who need a regular grid should use Version 3 with awareness of these effects.