Skip to content

Core Data Structures

Ensemble

Ensemble

Ensemble(data: Dict[Union[int, str], DataFrame], metadata: Optional[EnsembleMetadata] = None)

Manage ensemble timeseries data with dual representations.

The Ensemble class stores synthetic timeseries data in two complementary formats:

  1. By Realization: {realization_id: DataFrame[sites × time]}
  2. Keys are realization numbers (int)
  3. Values are DataFrames with sites as columns

  4. By Site: {site_name: DataFrame[realizations × time]}

  5. Keys are site names (str)
  6. Values are DataFrames with realizations as columns

Both representations are maintained automatically and provide efficient access for different analysis workflows.

Parameters:

Name Type Description Default
data Dict[Union[int, str], DataFrame]

Ensemble data in either format. Structure is automatically detected.

required
metadata EnsembleMetadata

Metadata about the ensemble. If None, creates default metadata.

None

Attributes:

Name Type Description
data_by_realization Dict[int, DataFrame]

Data organized by realization number.

data_by_site Dict[str, DataFrame]

Data organized by site name.

realization_ids List[int]

List of all realization IDs.

site_names List[str]

List of all site names.

metadata EnsembleMetadata

Ensemble metadata and provenance information.

Examples:

Create ensemble from generator output:

>>> from synhydro import ThomasFieringGenerator
>>> gen = ThomasFieringGenerator(Q_hist)
>>> gen.fit()
>>> ensemble = gen.generate(n_realizations=100, n_years=50)

Save and load ensemble:

>>> ensemble.to_hdf5('synthetic_flows.h5')
>>> ensemble_loaded = Ensemble.from_hdf5('synthetic_flows.h5')

Access data by site or realization:

>>> site_data = ensemble.data_by_site['site_A']  # All realizations for site A
>>> real_data = ensemble.data_by_realization[0]  # All sites for realization 0

Compute statistics:

>>> stats = ensemble.summary(by='site')
>>> percentiles = ensemble.percentile([10, 50, 90], by='site')

Initialize Ensemble with data and optional metadata.

Parameters:

Name Type Description Default
data Dict[Union[int, str], DataFrame]

Ensemble data dictionary. Structure is auto-detected.

required
metadata EnsembleMetadata

Metadata about the ensemble.

None

Raises:

Type Description
TypeError

If data is not a dictionary.

ValueError

If data structure cannot be determined.

frequency property

frequency: Optional[str]

Get the time frequency/resolution of the ensemble data.

Returns:

Type Description
Optional[str]

Time frequency (e.g., 'D', 'MS', 'YS') from metadata.

sites property

sites: List[str]

Get list of site names (alias for site_names).

Returns:

Type Description
List[str]

List of site names in the ensemble.

data_by_site property writable

data_by_site: Dict[str, DataFrame]

Data organized by site name (lazily computed from realization data).

from_hdf5 classmethod

from_hdf5(filename: str, realization_subset: Optional[List[int]] = None, stored_by_node: bool = True) -> Ensemble

Load ensemble from HDF5 file.

Parameters:

Name Type Description Default
filename str

Path to HDF5 file.

required
realization_subset List[int]

Load only specified realizations. If None, loads all.

None
stored_by_node bool

If True, data is stored with sites as top-level groups.

True

Returns:

Type Description
Ensemble

Loaded ensemble object.

Examples:

>>> ensemble = Ensemble.from_hdf5('synthetic_flows.h5')
>>> ensemble = Ensemble.from_hdf5('flows.h5', realization_subset=[0, 1, 2])

to_hdf5

to_hdf5(filename: str, compression: Optional[str] = 'gzip', stored_by_node: bool = True)

Save ensemble to HDF5 file.

Parameters:

Name Type Description Default
filename str

Path to output HDF5 file.

required
compression str

Compression algorithm ('gzip', 'lzf', None). Default is 'gzip'.

'gzip'
stored_by_node bool

If True, store data with sites as top-level groups.

True

Examples:

>>> ensemble.to_hdf5('synthetic_flows.h5')
>>> ensemble.to_hdf5('flows.h5', compression='lzf')

summary

summary(by: str = 'site') -> pd.DataFrame

Compute statistical summary across realizations or sites.

Parameters:

Name Type Description Default
by (site, realization)

Compute statistics by site or by realization.

'site'

Returns:

Type Description
DataFrame

Summary statistics (mean, std, min, max) for each site or realization.

Examples:

>>> stats = ensemble.summary(by='site')
>>> print(stats)

percentile

percentile(q: Union[float, List[float]], by: str = 'site') -> Dict[str, pd.DataFrame]

Compute percentiles across realizations.

Parameters:

Name Type Description Default
q float or List[float]

Percentile(s) to compute (0-100).

required
by (site, realization)

Compute percentiles by site or realization.

'site'

Returns:

Type Description
Dict[str, DataFrame]

Dictionary mapping site/realization to DataFrame of percentiles over time.

Examples:

>>> p = ensemble.percentile([10, 50, 90], by='site')
>>> site_a_percentiles = p['site_A']

subset

subset(sites: Optional[List[str]] = None, realizations: Optional[List[int]] = None, start_date: Optional[str] = None, end_date: Optional[str] = None) -> Ensemble

Create subset of ensemble by sites, realizations, or time period.

Parameters:

Name Type Description Default
sites List[str]

Site names to include.

None
realizations List[int]

Realization IDs to include.

None
start_date str

Start date (ISO format or pandas-parseable).

None
end_date str

End date (ISO format or pandas-parseable).

None

Returns:

Type Description
Ensemble

New ensemble containing only the subset.

Examples:

>>> subset = ensemble.subset(sites=['site_A', 'site_B'],
...                          start_date='2000-01-01',
...                          end_date='2010-12-31')

resample

resample(freq: str) -> Ensemble

Resample time series to different frequency.

Parameters:

Name Type Description Default
freq str

Pandas frequency string ('D', 'W', 'MS', 'AS', etc.).

required

Returns:

Type Description
Ensemble

New ensemble with resampled data.

Examples:

>>> monthly_ensemble = daily_ensemble.resample('MS')

EnsembleMetadata

EnsembleMetadata dataclass

EnsembleMetadata(generator_class: Optional[str] = None, generator_params: Optional[Dict[str, Any]] = None, creation_timestamp: str = (lambda: datetime.now().isoformat())(), n_realizations: int = 0, n_sites: int = 0, time_resolution: Optional[str] = None, time_period: Optional[Tuple[str, str]] = None, description: Optional[str] = None, custom_attrs: Optional[Dict[str, Any]] = dict())

Store metadata about an ensemble.

Attributes:

Name Type Description
generator_class (str, optional)

Name of the generator class that created this ensemble.

generator_params (Dict, optional)

Parameters used to configure the generator.

creation_timestamp str

ISO format timestamp of when ensemble was created.

n_realizations int

Number of realizations in the ensemble.

n_sites int

Number of sites/locations in the ensemble.

time_resolution (str, optional)

Time resolution of data ('daily', 'monthly', etc.).

time_period (Tuple[str, str], optional)

Start and end dates of time series (ISO format strings).

description (str, optional)

User-provided description of the ensemble.

custom_attrs (Dict, optional)

Additional user-defined metadata attributes.

to_dict

to_dict() -> Dict[str, Any]

Convert metadata to dictionary.


Parameter Containers

FittedParams dataclass

FittedParams(means_: Optional[Union[Series, DataFrame]] = None, stds_: Optional[Union[Series, DataFrame]] = None, correlations_: Optional[Union[ndarray, Dict[str, ndarray]]] = None, distributions_: Optional[Dict[str, Any]] = None, transformations_: Optional[Dict[str, Any]] = None, fitted_models_: Optional[Dict[str, Any]] = None, n_parameters_: int = 0, sample_size_: int = 0, n_sites_: int = 0, training_period_: Optional[Tuple[str, str]] = None)

Store parameters learned from data during fit().

Following scikit-learn convention, parameter names end with underscore.

to_dict

to_dict() -> Dict[str, Any]

Convert to dictionary, handling numpy/pandas types.

GeneratorState dataclass

GeneratorState(is_preprocessed: bool = False, is_fitted: bool = False, fit_timestamp: Optional[str] = None)

Track generator preprocessing and fitting state.

GeneratorParams dataclass

GeneratorParams(random_seed: Optional[int] = None, verbose: bool = False, debug: bool = False, algorithm_params: Dict[str, Any] = dict(), transformation_params: Dict[str, Any] = dict(), computational_params: Dict[str, Any] = dict())

Store initialization/configuration parameters for generators.

These are user-specified settings that control algorithm behavior, not learned from data.

to_dict

to_dict() -> Dict[str, Any]

Convert to flat dictionary.