Multi-Dimensional Data¶

tsam_xarray handles multi-dimensional DataArrays through two mechanisms:

cluster_dim — dimensions clustered together (shared clustering)
Auto-slicing — remaining dimensions get independent clusterings

This notebook covers stacking, slicing, weights, and selecting what to cluster on.

In [1]:

Copied!





import plotly.io as pio
import xarray_plotly  # noqa: F401

import tsam_xarray
from tsam_xarray._sample_data import sample_energy_data

pio.renderers.default = "notebook_connected"

da = sample_energy_data(n_days=30)
print(f"Dims: {list(da.dims)}")
print(f"Shape: {dict(da.sizes)}")
import plotly.io as pio
import xarray_plotly  # noqa: F401

import tsam_xarray
from tsam_xarray._sample_data import sample_energy_data

pio.renderers.default = "notebook_connected"

da = sample_energy_data(n_days=30)
print(f"Dims: {list(da.dims)}")
print(f"Shape: {dict(da.sizes)}")

Dims: ['time', 'variable', 'region', 'scenario']
Shape: {'time': 720, 'variable': 3, 'region': 3, 'scenario': 2}

Multiple cluster dims¶

Pass cluster_dim=["variable", "region"] to cluster all variable-region combinations together. They are stacked internally and unstacked in the results.

In [2]:

Copied!





da_single = da.sel(scenario="low")

result = tsam_xarray.aggregate(
    da_single,
    time_dim="time",
    cluster_dim=["variable", "region"],
    n_clusters=4,
)
print("Result dims:", result.cluster_representatives.dims)
result.cluster_representatives.to_dataframe("value").head(10)
da_single = da.sel(scenario="low")

result = tsam_xarray.aggregate(
    da_single,
    time_dim="time",
    cluster_dim=["variable", "region"],
    n_clusters=4,
)
print("Result dims:", result.cluster_representatives.dims)
result.cluster_representatives.to_dataframe("value").head(10)

Result dims: ('cluster', 'timestep', 'variable', 'region')

Out[2]:

				value
cluster	timestep	variable	region
0	0	demand	east	0.172796
			north	0.203079
			south	0.168561
		solar	east	0.000283
			north	0.000000
			south	0.000140
		wind	east	0.575650
			north	0.754306
			south	0.407363
	1	demand	east	0.194493

In [3]:

Copied!





result.cluster_representatives.sel(variable="solar").plotly.line(
    line_shape="hv",
    x="timestep",
    color="cluster",
    facet_col="region",
    title="Cluster representatives (solar, by region)",
)
result.cluster_representatives.sel(variable="solar").plotly.line(
    line_shape="hv",
    x="timestep",
    color="cluster",
    facet_col="region",
    title="Cluster representatives (solar, by region)",
)

Auto-slicing¶

Any dimension not in time_dim or cluster_dim is automatically sliced — one independent aggregation per coordinate, with results concatenated into coherent multi-dimensional arrays.

Here, scenario is auto-sliced. Each scenario gets its own clustering. Cluster counts, accuracy metrics, and cluster representatives all have the scenario dimension — no manual looping or concatenation needed.

Without tsam_xarray, you'd have to:

# Manual approach (what tsam_xarray replaces)
results = {}
for scenario in da.scenario.values:
    da_slice = da.sel(scenario=scenario)
    df = ...  # flatten to DataFrame
    results[scenario] = tsam.aggregate(df, n_clusters=4)
# Then manually concat cluster_counts, accuracy, cluster_representatives...

In [4]:

Copied!





result_sliced = tsam_xarray.aggregate(
    da,
    time_dim="time",
    cluster_dim=["variable", "region"],
    n_clusters=4,
)
print("Result dims:", result_sliced.cluster_representatives.dims)
result_sliced.cluster_counts.to_dataframe("count")
result_sliced = tsam_xarray.aggregate(
    da,
    time_dim="time",
    cluster_dim=["variable", "region"],
    n_clusters=4,
)
print("Result dims:", result_sliced.cluster_representatives.dims)
result_sliced.cluster_counts.to_dataframe("count")

Result dims: ('scenario', 'cluster', 'timestep', 'variable', 'region')

Out[4]:

		count
scenario	cluster
low	0	8
	1	13
	2	5
	3	4
high	0	11
	1	13
	2	4
	3	2

In [5]:

Copied!

result_sliced.accuracy.rmse.to_dataframe("RMSE")
result_sliced.accuracy.rmse.to_dataframe("RMSE")

Out[5]:

			RMSE
scenario	variable	region
low	demand	east	0.070919
		north	0.071710
		south	0.069703
	solar	east	0.099097
		north	0.116967
		south	0.093158
	wind	east	0.153082
		north	0.154634
		south	0.156379
high	demand	east	0.066410
		north	0.069501
		south	0.067962
	solar	east	0.067888
		north	0.085266
		south	0.063697
	wind	east	0.157093
		north	0.162105
		south	0.165263

Multiple slice dims¶

Only variable as cluster_dim — both region and scenario are auto-sliced. One aggregation per (region, scenario) combination.

In [6]:

Copied!





result_multi = tsam_xarray.aggregate(
    da,
    time_dim="time",
    cluster_dim="variable",
    n_clusters=4,
)
print("Result dims:", result_multi.cluster_representatives.dims)
result_multi.cluster_representatives.sel(
    variable="solar",
    scenario="low",
).plotly.line(
    line_shape="hv",
    x="timestep",
    color="cluster",
    facet_col="region",
    title="Cluster representatives (solar, low, per region)",
)
result_multi = tsam_xarray.aggregate(
    da,
    time_dim="time",
    cluster_dim="variable",
    n_clusters=4,
)
print("Result dims:", result_multi.cluster_representatives.dims)
result_multi.cluster_representatives.sel(
    variable="solar",
    scenario="low",
).plotly.line(
    line_shape="hv",
    x="timestep",
    color="cluster",
    facet_col="region",
    title="Cluster representatives (solar, low, per region)",
)

Result dims: ('region', 'scenario', 'cluster', 'timestep', 'variable')

Weights¶

Use a dict to weight certain coordinates higher during clustering. For multiple cluster_dim, use a dict-of-dicts keyed by dimension name.

In [7]:

Copied!





# Weight solar 2x — broadcasts across region (missing entries default to 1.0)
result_w = tsam_xarray.aggregate(
    da_single,
    time_dim="time",
    cluster_dim=["variable", "region"],
    n_clusters=4,
    weights={"variable": {"solar": 2.0}},
)
result_w.accuracy.rmse.to_dataframe("RMSE")
# Weight solar 2x — broadcasts across region (missing entries default to 1.0)
result_w = tsam_xarray.aggregate(
    da_single,
    time_dim="time",
    cluster_dim=["variable", "region"],
    n_clusters=4,
    weights={"variable": {"solar": 2.0}},
)
result_w.accuracy.rmse.to_dataframe("RMSE")

Out[7]:

		RMSE
variable	region
demand	east	0.075016
	north	0.076493
	south	0.071846
solar	east	0.075933
	north	0.112421
	south	0.073196
wind	east	0.160321
	north	0.159308
	south	0.162374

Weights can span multiple dimensions — they multiply across dims:

In [8]:

Copied!





# Weight solar in north: solar=3.0 * north=2.0 = 6.0
result_w2 = tsam_xarray.aggregate(
    da_single,
    time_dim="time",
    cluster_dim=["variable", "region"],
    n_clusters=4,
    weights={"variable": {"solar": 3.0}, "region": {"north": 2.0}},
)
result_w2.accuracy.rmse.to_dataframe("RMSE")
# Weight solar in north: solar=3.0 * north=2.0 = 6.0
result_w2 = tsam_xarray.aggregate(
    da_single,
    time_dim="time",
    cluster_dim=["variable", "region"],
    n_clusters=4,
    weights={"variable": {"solar": 3.0}, "region": {"north": 2.0}},
)
result_w2.accuracy.rmse.to_dataframe("RMSE")

Out[8]:

		RMSE
variable	region
demand	east	0.065388
	north	0.073632
	south	0.068182
solar	east	0.082727
	north	0.095599
	south	0.076938
wind	east	0.184076
	north	0.186040
	south	0.182107

Selecting what to cluster on¶

Sometimes a variable should be carried through the aggregation without influencing how the clusters are formed — e.g. a price or demand series you want on the representative grid, but that shouldn't drive period selection.

Use cluster_on to restrict the clustering to specific coordinates. Everything else is still aggregated and reconstructed from the resulting clusters, but has no effect on the cluster distances. This differs from a small weights value: cluster_on fully removes the coordinate from the distance metric.

Here we cluster on solar and wind only, while demand is carried along.

In [9]:

Copied!





# Cluster on solar + wind only; demand is carried along but doesn't
# influence cluster selection.
result_on = tsam_xarray.aggregate(
    da_single,
    time_dim="time",
    cluster_dim=["variable", "region"],
    n_clusters=4,
    cluster_on={"variable": ["solar", "wind"]},
)
# demand is still present in the output, aggregated from the same clusters
print("Variables:", list(result_on.cluster_representatives.coords["variable"].values))
result_on.accuracy.rmse.to_dataframe("RMSE")
# Cluster on solar + wind only; demand is carried along but doesn't
# influence cluster selection.
result_on = tsam_xarray.aggregate(
    da_single,
    time_dim="time",
    cluster_dim=["variable", "region"],
    n_clusters=4,
    cluster_on={"variable": ["solar", "wind"]},
)
# demand is still present in the output, aggregated from the same clusters
print("Variables:", list(result_on.cluster_representatives.coords["variable"].values))
result_on.accuracy.rmse.to_dataframe("RMSE")

Variables: ['demand', 'solar', 'wind']

Out[9]:

		RMSE
variable	region
demand	east	0.070919
	north	0.071710
	south	0.069703
solar	east	0.099097
	north	0.116967
	south	0.093158
wind	east	0.153082
	north	0.154634
	south	0.156379