KNN Bootstrap Generator (Lall and Sharma 1996)¶
| Type | Nonparametric |
| Resolution | Monthly / Annual |
| Sites | Univariate / Multisite |
| Class | KNNBootstrapGenerator |
Overview¶
The K-Nearest Neighbor (KNN) bootstrap generates synthetic streamflow by conditionally resampling from the historical record. At each timestep, the current flow value determines a neighborhood of K similar historical states, and the next value is drawn from the successors of those neighbors using kernel-weighted probabilities. This nonparametric approach preserves the empirical marginal distribution exactly and captures nonlinear dependence structures that parametric models may miss.
For multisite applications, all sites are resampled jointly using the same selected neighbor index, preserving spatial correlation by construction.
Algorithm¶
Preprocessing¶
- Validate input as univariate or multisite DataFrame with DatetimeIndex.
- Construct state vectors: for each timestep t, define the feature vector used for neighbor search. Default: the flow value(s) at time t.
- Build successor pairs: for each historical timestep t, store (feature_t, Q_{t+1}) so that neighbors of the current state yield candidate next values. For monthly data, successors are partitioned by calendar month to preserve seasonality.
Fitting¶
-
Determine K (number of neighbors). Default heuristic:
where n is the number of historical timesteps. Can also be set manually. -
Fit KNN model using
sklearn.NearestNeighborson the historical feature vectors. -
Compute Lall-Sharma kernel weights for neighbor selection:
where i is the rank of the neighbor (i=1 is closest). This harmonic kernel gives the closest neighbor approximately twice the weight of the second-closest. -
Store the fitted KNN model, the historical feature-successor pairs, and the kernel weights.
Generation¶
- Initialize by randomly selecting a historical timestep as the starting state.
- For each subsequent timestep: a. Query the KNN model for the K nearest neighbors of the current state. b. Select one neighbor with probability K(i) (Lall-Sharma kernel). c. The generated value is the successor of the selected neighbor in the historical record: d. Update the current state to Q_syn(t+1) for the next iteration.
- Multisite: use the index site (or multivariate distance) for neighbor search, then take the successor vector across all sites jointly.
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
Q_obs |
pd.DataFrame |
- | Observed streamflow with DatetimeIndex, sites as columns |
n_neighbors |
Optional[int] |
None |
K; if None, uses ceil(sqrt(n)) |
feature_cols |
Optional[List[str]] |
None |
Columns to use as features for KNN search. If None, uses all site columns |
index_site |
Optional[str] |
None |
Reserved for future use (multisite index site selection) |
block_size |
int |
1 |
Reserved for future use (block resampling) |
name |
Optional[str] |
None |
Optional name identifier for this generator instance |
debug |
bool |
False |
Enable debug logging |
Properties Preserved¶
- Empirical marginal distribution (resampled values are historical observations)
- Nonlinear dependence structure (via conditional resampling)
- Lag-1 autocorrelation (approximately, via nearest-neighbor conditioning)
- Spatial cross-correlations (via joint resampling in multisite mode)
Not preserved: - Values outside the historical range (bootstrap limitation) - Long-range persistence beyond the conditioning lag - Trends or non-stationarity
Limitations¶
- Cannot generate values outside the range of the historical record
- Sensitive to the choice of K: too small causes repetitive sequences, too large destroys temporal structure
- Curse of dimensionality for high-dimensional feature spaces (many sites)
- Successor-based resampling can create discontinuities at December-January boundaries if not handled explicitly
- Requires at least 20 years for monthly data to avoid excessive repetition of analogs
References¶
Primary: Lall, U., and Sharma, A. (1996). A nearest neighbor bootstrap for resampling hydrologic time series. Water Resources Research, 32(3), 679-693. https://doi.org/10.1029/95WR02966
See also: - Rajagopalan, B., and Lall, U. (1999). A k-nearest-neighbor simulator for daily precipitation and other weather variables. Water Resources Research, 35(10), 3089-3101. https://doi.org/10.1029/1999WR900028 - Lall, U. (1995). Recent advances in nonparametric function estimation: Hydrologic applications. Reviews of Geophysics, 33(S2), 1093-1102.
Implementation: src/synhydro/methods/generation/nonparametric/knn_bootstrap.py
Tests: tests/test_knn_bootstrap_generator.py