Out-of-core fields: move unstructured interpolators to .isel, and add a time-windowed array cache (WindowedArray)

## Summary

For datasets whose full time series doesn't fit in RAM (the common production case — global 1/12° u,v,w for two time levels alone is ~10–21 GiB), Parcels v4 must open fields lazily with dask. In that regime field sampling is **orders of magnitude slower than it needs to be**, and the **unstructured path can't be made lazy at all** as currently written. This proposes two changes:

1. **Unstructured interpolators: replace `field.data.values[...]` with vectorized `.isel(...)`** — a prerequisite for out-of-core UxGrid fields and for any caching layer.
2. **Add a `WindowedArray` wrapper** that transparently keeps only the time level(s) bracketing the simulation clock resident as NumPy, loading on demand and retiring stale levels — a near-drop-in that removes both the read-amplification and the dask scheduling overhead.

Full measured evidence + reproducers: **https://github.com/FluidNumerics/uxxarray-perf-modeling**

## Motivation (measured)

On an identical structured-grid advection (2000 particles, 480 steps), dask-backed vs `.load()`-ed-into-RAM:

| field backing | wall time |
|---|---|
| in-memory (numpy) | **6.6 s** |
| dask-backed (on-disk) | **2156 s** (≈ **327×**) |

This is **not** disk-bound — the field is cache-resident and the process is CPU-pinned. Two independent causes:

- **Read amplification.** Per kernel step, scattered per-particle `isel` pulls whole chunks; the same time-slab chunks are re-read every step. Measured 40–52× bytes-read vs bytes-used (and small chunks collapse to fio-4 KiB-random throughput, ~34 MB/s).
- **dask per-`compute()` scheduling overhead.** With the field *fully in RAM* (`.persist()`, zero I/O), a single scattered `isel().compute()` is **~200× slower than the numpy equivalent** (0.3 ms → 63 ms). Profiling (cProfile + VizTracer) attributes it to the threaded scheduler's lock/condition machinery plus per-call graph build/tokenize — not array work.

We checked the dask-only escape hatches; none close the gap:
- `scheduler="synchronous"` + larger chunks: ~6–12× faster, but **floors at ~47× numpy** (graph build/tokenize is paid per `.compute()`).
- Batching U/V/W into one `dask.compute()`: reduces call count (RK2-3D: 6→2) but **not wall time** — cost is per-task, not per-call.
- Bigger chunks: fewer tasks (faster in RAM) but read amplification is unchanged on disk — only pays off *inside* a resident window.

The structural fix is to stop sampling lazily per step: read the bracketing time level(s) **once** into NumPy, reuse across all particles and sub-steps, evict when the clock advances. A prototype of exactly this is **6× faster and reads 12× less data** at production scale (20 GB Atlantic), with **byte-identical** trajectories.

## Root cause in the code

- **Structured (`interpolators/_xinterpolators.py`)** samples lazily: `data.isel(selection_dict).data.reshape(...)` (lines 82, 244, 373, 563) then `value.compute() if is_dask_collection(value)` (135, 299–300, 381, 570, 626). Lazy seam exists.
- **Unstructured (`interpolators/_uxinterpolators.py`)** samples via **`field.data.values[ti, zi, fi]`** (lines 20, 41–42, 65, 87–88). `.values` **materializes the entire field**, then NumPy fancy-indexes. For a dask-backed UxGrid field this requires the whole field in memory (or recomputes it per call) → **precludes out-of-core operation**, and leaves no lazy seam to cache behind.
- `.sel` is not used anywhere in the interpolators.

## Proposal 1 — unstructured interpolators → `.isel`

Index *before* materializing, mirroring the structured path:

```python
# now (materializes the whole field):
field.data.values[ti, zi, fi]
field.data.values[ti[:, None], zi[:, None], node_ids]

# proposed (lazy, only the requested points):
field.data.isel(time=ti, n_z=zi, n_face=fi).data
field.data.isel(time=ti[:, None], n_z=zi[:, None], n_node=node_ids).data
```

Benefits independent of caching: UxGrid fields become genuinely out-of-core-capable (no full materialization / no OOM on large meshes), and the structured & unstructured paths share one lazy access idiom. The mesh **connectivity** (`uxgrid.face_node_connectivity`, node coords) stays a separate, static, one-time load — only data variables are touched lazily.

## Proposal 2 — `WindowedArray`

A thin wrapper around `field.data` that exploits the fact that **all particles share the simulation clock**, so any access touches only `{ti, ti+1}`:

```python
class WindowedArray:
    """Wrap a lazy DataArray; .isel() loads only the requested time levels to
    NumPy, retires levels below the current minimum, and gathers on the small
    resident block. All other attributes forward to the wrapped array."""
    # on .isel(time=..., ...):
    #   1. levels = unique(time indices)
    #   2. load missing levels -> NumPy (one bulk read each); evict levels < min(levels)
    #   3. gather on the stacked resident block (vectorized isel in NumPy)
```

Because the result is **NumPy**, the structured interpolator's `value.compute() if is_dask_collection(value)` guard auto-skips `.compute()` — so this removes the re-reads **and** the scheduler tax with no interpolator change. Integration is one line where `Field.data` is set (e.g. in `FieldSet.from_sgrid_conventions` / `from_ugrid_conventions`, guarded by "is dask-backed"):

```python
field.data = WindowedArray(field.data)
```

Verified in the reference repo: identical to dask (`max|Δ| = 0`), each time level loaded **once** (20 loads / 60 steps vs 120 naive gathers), and **≤ 2 levels resident** at any time. Works for XGrid today; works for UxGrid once Proposal 1 lands (the wrapper is grid-shape-agnostic — it windows the leading `time` axis and forwards the rest).

## Why this design

The working set is tiny even when the dataset is enormous: a simulation only ever needs the two time levels bracketing "now." Windowing turns "hundreds of thousands of tiny scattered dask gathers" into "one bulk sequential read per time level + free NumPy indexing," with memory bounded to ~2 levels regardless of series length.

## Scope / open questions

- **Correctness guard:** valid only while requested time indices stay within the resident window. Add a cheap assert (or fall back to a wider load) for kernels that sample arbitrary past/future times. Needs care for **backward-in-time** runs and **multiple `dt`s**.
- **Prefetch** the next level on a background thread to overlap the boundary reload (prototype has this in the standalone sampler).
- **Spatial/depth subsetting** (load only the particles' bbox / depth range) is a natural follow-on for very large meshes.
- Assumes `time` is the leading dim of `field.data` (true for the current `Field` construction).

## Alternatives considered

- **`dask.cache.Cache(...).register()`** — zero-code, caches loaded chunks across `.compute()` calls (removes re-reads) but keeps the per-call scheduler tax; eviction is byte-budget/LRU, not time-aware.
- **Custom xarray `BackendArray` duck-array** — the most robust/idiomatic home for the window logic (xarray's lazy-indexing adapter calls `__getitem__` with windowed indices), but more plumbing. Could be the long-term form of `WindowedArray`.
- **dask-only tuning** (`synchronous` scheduler, larger chunks, batched compute) — worth setting as defaults for the lazy path, but caps at ~47× numpy; doesn't replace windowing.

## Proposed work / acceptance criteria

- [ ] Switch `_uxinterpolators.py` data access from `.values[...]` to vectorized `.isel(...)`; confirm unchanged results on existing UxGrid tests and that a chunked/dask UxGrid field runs without full materialization.
- [ ] Add `WindowedArray` (or a `BackendArray` equivalent) with load-on-demand + retire-stale + optional prefetch and a window-span assertion.
- [ ] Wire `field.data = WindowedArray(field.data)` for dask-backed fields in the `FieldSet` constructors (opt-in flag initially).
- [ ] Benchmark: identical trajectories vs the current path; report speedup + peak memory on a large structured and a large unstructured case.

*Filed by Claude Code on behalf of Joe Schoonover (FluidNumerics). Evidence and reproducers: https://github.com/FluidNumerics/uxxarray-perf-modeling*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-core fields: move unstructured interpolators to .isel, and add a time-windowed array cache (WindowedArray) #2656

Summary

Motivation (measured)

Root cause in the code

Proposal 1 — unstructured interpolators → `.isel`

Proposal 2 — `WindowedArray`

Why this design

Scope / open questions

Alternatives considered

Proposed work / acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

field backing	wall time
in-memory (numpy)	6.6 s
dask-backed (on-disk)	2156 s (≈ 327×)

Out-of-core fields: move unstructured interpolators to .isel, and add a time-windowed array cache (WindowedArray) #2656

Description

Summary

Motivation (measured)

Root cause in the code

Proposal 1 — unstructured interpolators → .isel

Proposal 2 — WindowedArray

Why this design

Scope / open questions

Alternatives considered

Proposed work / acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Proposal 1 — unstructured interpolators → `.isel`

Proposal 2 — `WindowedArray`