Data Dictionary

Complete column definitions and computed academic variable specifications.

Parquet file schema

Each file contains one-minute OHLCV bars for a single U.S. equity or ETF.

ColumnTypeDescription
datetimedatetime64Bar timestamp (Eastern Time), format: YYYY-MM-DD HH:MM:SS
DatestringTrading date, format: MM/DD/YYYY
TimestringBar time, format: HHMM (24-hour)
Openfloat64Opening price of the 1-minute bar (split/dividend adjusted)
Highfloat64Highest price during the 1-minute bar
Lowfloat64Lowest price during the 1-minute bar
Closefloat64Closing price of the 1-minute bar
Volumeint64Number of shares traded during the 1-minute bar
sourcestringData source: "pitrading" (pre-2022) or "alpaca" (post-2022, IEX via Alpaca)
is_filledboolTrue if this bar was gap-filled (Version 3 only)

Computed academic variables (27 measures)

Computed daily for each ticker in each cleaning version.

Volatility (5 measures)

#VariableFormula / Reference
1Realized variance (5-min)RV = Σ rt,i² using 5-minute sampled returns
2Realized variance (1-min)RV = Σ rt,i² using all 1-minute returns
3Bipower variationBV = (π/2) Σ |rt,i| |rt,i-1| (Barndorff-Nielsen and Shephard 2004)
4Parkinson range volatilityσ² = (1/4 ln 2) (ln H/L)² (Parkinson 1980)
5Yang-Zhang volatilityOHLC-based estimator (Yang and Zhang 2000)

Spreads (2 measures)

#VariableFormula / Reference
6Roll implied spreadS = 2√(−Cov(rt, rt-1)) in basis points (Roll 1984)
7Corwin-Schultz spreadHigh-low spread estimator (Corwin and Schultz 2012)

Autocorrelation (3 measures)

#VariableFormula / Reference
8AC(1)First-order autocorrelation of 1-minute log returns
9VR(5)Variance ratio: Var(5-min returns) / [5 × Var(1-min returns)]
10VR(10)Variance ratio: Var(10-min returns) / [10 × Var(1-min returns)]

Jump detection (3 measures)

#VariableFormula / Reference
11BNS z-statisticz = (RV − BV) / √(θ × max(1, QV/BV²)) (Barndorff-Nielsen and Shephard 2006)
12BNS jump (1%)Indicator: 1 if z > 2.326
13BNS jump (5%)Indicator: 1 if z > 1.645

Liquidity (4 measures)

#VariableFormula / Reference
14Amihud illiquidity|rdaily| / dollar volume (Amihud 2002)
15Daily dollar volumeΣ (Closei × Volumei)
16Daily share volumeΣ Volumei
17Number of tradesCount of observed (non-filled) bars

Data quality (6 measures)

#VariableFormula / Reference
18Gap rateFraction of the 390-bar daily grid with no trade
19Observed barsNumber of bars with actual trades
20Filled barsNumber of bars filled by LOCF (Version 3)
21Longest gapMaximum consecutive missing bars in the day
22is_filled countCount of filled bars in the day
23Max bars since last tradeLargest gap between consecutive observed bars

Returns (4 measures)

#VariableFormula / Reference
24Open-to-close returnln(Closelast / Openfirst)
25Overnight returnln(Opentoday / Closeyesterday)
26Daily high-low rangeln(Highmax / Lowmin)
27Intraday return stdStandard deviation of 1-minute log returns