Core Data Structures¶
Ensemble¶
Ensemble ¶
Manage ensemble timeseries data with dual representations.
The Ensemble class stores synthetic timeseries data in two complementary formats:
- By Realization:
{realization_id: DataFrame[sites × time]} - Keys are realization numbers (int)
-
Values are DataFrames with sites as columns
-
By Site:
{site_name: DataFrame[realizations × time]} - Keys are site names (str)
- Values are DataFrames with realizations as columns
Both representations are maintained automatically and provide efficient access for different analysis workflows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Dict[Union[int, str], DataFrame]
|
Ensemble data in either format. Structure is automatically detected. |
required |
metadata
|
EnsembleMetadata
|
Metadata about the ensemble. If None, creates default metadata. |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
data_by_realization |
Dict[int, DataFrame]
|
Data organized by realization number. |
data_by_site |
Dict[str, DataFrame]
|
Data organized by site name. |
realization_ids |
List[int]
|
List of all realization IDs. |
site_names |
List[str]
|
List of all site names. |
metadata |
EnsembleMetadata
|
Ensemble metadata and provenance information. |
Examples:
Create ensemble from generator output:
>>> from synhydro import ThomasFieringGenerator
>>> gen = ThomasFieringGenerator(Q_hist)
>>> gen.fit()
>>> ensemble = gen.generate(n_realizations=100, n_years=50)
Save and load ensemble:
>>> ensemble.to_hdf5('synthetic_flows.h5')
>>> ensemble_loaded = Ensemble.from_hdf5('synthetic_flows.h5')
Access data by site or realization:
>>> site_data = ensemble.data_by_site['site_A'] # All realizations for site A
>>> real_data = ensemble.data_by_realization[0] # All sites for realization 0
Compute statistics:
>>> stats = ensemble.summary(by='site')
>>> percentiles = ensemble.percentile([10, 50, 90], by='site')
Initialize Ensemble with data and optional metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Dict[Union[int, str], DataFrame]
|
Ensemble data dictionary. Structure is auto-detected. |
required |
metadata
|
EnsembleMetadata
|
Metadata about the ensemble. |
None
|
Raises:
| Type | Description |
|---|---|
TypeError
|
If data is not a dictionary. |
ValueError
|
If data structure cannot be determined. |
frequency
property
¶
Get the time frequency/resolution of the ensemble data.
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Time frequency (e.g., 'D', 'MS', 'YS') from metadata. |
sites
property
¶
Get list of site names (alias for site_names).
Returns:
| Type | Description |
|---|---|
List[str]
|
List of site names in the ensemble. |
data_by_site
property
writable
¶
Data organized by site name (lazily computed from realization data).
from_hdf5
classmethod
¶
from_hdf5(filename: str, realization_subset: Optional[List[int]] = None, stored_by_node: bool = True) -> Ensemble
Load ensemble from HDF5 file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filename
|
str
|
Path to HDF5 file. |
required |
realization_subset
|
List[int]
|
Load only specified realizations. If None, loads all. |
None
|
stored_by_node
|
bool
|
If True, data is stored with sites as top-level groups. |
True
|
Returns:
| Type | Description |
|---|---|
Ensemble
|
Loaded ensemble object. |
Examples:
to_hdf5 ¶
Save ensemble to HDF5 file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filename
|
str
|
Path to output HDF5 file. |
required |
compression
|
str
|
Compression algorithm ('gzip', 'lzf', None). Default is 'gzip'. |
'gzip'
|
stored_by_node
|
bool
|
If True, store data with sites as top-level groups. |
True
|
Examples:
summary ¶
Compute statistical summary across realizations or sites.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
by
|
(site, realization)
|
Compute statistics by site or by realization. |
'site'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Summary statistics (mean, std, min, max) for each site or realization. |
Examples:
percentile ¶
Compute percentiles across realizations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
q
|
float or List[float]
|
Percentile(s) to compute (0-100). |
required |
by
|
(site, realization)
|
Compute percentiles by site or realization. |
'site'
|
Returns:
| Type | Description |
|---|---|
Dict[str, DataFrame]
|
Dictionary mapping site/realization to DataFrame of percentiles over time. |
Examples:
subset ¶
subset(sites: Optional[List[str]] = None, realizations: Optional[List[int]] = None, start_date: Optional[str] = None, end_date: Optional[str] = None) -> Ensemble
Create subset of ensemble by sites, realizations, or time period.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sites
|
List[str]
|
Site names to include. |
None
|
realizations
|
List[int]
|
Realization IDs to include. |
None
|
start_date
|
str
|
Start date (ISO format or pandas-parseable). |
None
|
end_date
|
str
|
End date (ISO format or pandas-parseable). |
None
|
Returns:
| Type | Description |
|---|---|
Ensemble
|
New ensemble containing only the subset. |
Examples:
EnsembleMetadata¶
EnsembleMetadata
dataclass
¶
EnsembleMetadata(generator_class: Optional[str] = None, generator_params: Optional[Dict[str, Any]] = None, creation_timestamp: str = (lambda: datetime.now().isoformat())(), n_realizations: int = 0, n_sites: int = 0, time_resolution: Optional[str] = None, time_period: Optional[Tuple[str, str]] = None, description: Optional[str] = None, custom_attrs: Optional[Dict[str, Any]] = dict())
Store metadata about an ensemble.
Attributes:
| Name | Type | Description |
|---|---|---|
generator_class |
(str, optional)
|
Name of the generator class that created this ensemble. |
generator_params |
(Dict, optional)
|
Parameters used to configure the generator. |
creation_timestamp |
str
|
ISO format timestamp of when ensemble was created. |
n_realizations |
int
|
Number of realizations in the ensemble. |
n_sites |
int
|
Number of sites/locations in the ensemble. |
time_resolution |
(str, optional)
|
Time resolution of data ('daily', 'monthly', etc.). |
time_period |
(Tuple[str, str], optional)
|
Start and end dates of time series (ISO format strings). |
description |
(str, optional)
|
User-provided description of the ensemble. |
custom_attrs |
(Dict, optional)
|
Additional user-defined metadata attributes. |
Parameter Containers¶
FittedParams
dataclass
¶
FittedParams(means_: Optional[Union[Series, DataFrame]] = None, stds_: Optional[Union[Series, DataFrame]] = None, correlations_: Optional[Union[ndarray, Dict[str, ndarray]]] = None, distributions_: Optional[Dict[str, Any]] = None, transformations_: Optional[Dict[str, Any]] = None, fitted_models_: Optional[Dict[str, Any]] = None, n_parameters_: int = 0, sample_size_: int = 0, n_sites_: int = 0, training_period_: Optional[Tuple[str, str]] = None)
Store parameters learned from data during fit().
Following scikit-learn convention, parameter names end with underscore.
GeneratorState
dataclass
¶
GeneratorState(is_preprocessed: bool = False, is_fitted: bool = False, fit_timestamp: Optional[str] = None)
Track generator preprocessing and fitting state.
GeneratorParams
dataclass
¶
GeneratorParams(random_seed: Optional[int] = None, verbose: bool = False, debug: bool = False, algorithm_params: Dict[str, Any] = dict(), transformation_params: Dict[str, Any] = dict(), computational_params: Dict[str, Any] = dict())
Store initialization/configuration parameters for generators.
These are user-specified settings that control algorithm behavior, not learned from data.