Summary
For datasets whose full time series doesn't fit in RAM (the common production case — global 1/12° u,v,w for two time levels alone is ~10–21 GiB), Parcels v4 must open fields lazily with dask. In that regime field sampling is orders of magnitude slower than it needs to be, and the unstructured path can't be made lazy at all as currently written. This proposes two changes:
- Unstructured interpolators: replace
field.data.values[...] with vectorized .isel(...) — a prerequisite for out-of-core UxGrid fields and for any caching layer.
- Add a
WindowedArray wrapper that transparently keeps only the time level(s) bracketing the simulation clock resident as NumPy, loading on demand and retiring stale levels — a near-drop-in that removes both the read-amplification and the dask scheduling overhead.
Full measured evidence + reproducers: https://github.com/FluidNumerics/uxxarray-perf-modeling
Motivation (measured)
On an identical structured-grid advection (2000 particles, 480 steps), dask-backed vs .load()-ed-into-RAM:
| field backing |
wall time |
| in-memory (numpy) |
6.6 s |
| dask-backed (on-disk) |
2156 s (≈ 327×) |
This is not disk-bound — the field is cache-resident and the process is CPU-pinned. Two independent causes:
- Read amplification. Per kernel step, scattered per-particle
isel pulls whole chunks; the same time-slab chunks are re-read every step. Measured 40–52× bytes-read vs bytes-used (and small chunks collapse to fio-4 KiB-random throughput, ~34 MB/s).
- dask per-
compute() scheduling overhead. With the field fully in RAM (.persist(), zero I/O), a single scattered isel().compute() is ~200× slower than the numpy equivalent (0.3 ms → 63 ms). Profiling (cProfile + VizTracer) attributes it to the threaded scheduler's lock/condition machinery plus per-call graph build/tokenize — not array work.
We checked the dask-only escape hatches; none close the gap:
scheduler="synchronous" + larger chunks: ~6–12× faster, but floors at ~47× numpy (graph build/tokenize is paid per .compute()).
- Batching U/V/W into one
dask.compute(): reduces call count (RK2-3D: 6→2) but not wall time — cost is per-task, not per-call.
- Bigger chunks: fewer tasks (faster in RAM) but read amplification is unchanged on disk — only pays off inside a resident window.
The structural fix is to stop sampling lazily per step: read the bracketing time level(s) once into NumPy, reuse across all particles and sub-steps, evict when the clock advances. A prototype of exactly this is 6× faster and reads 12× less data at production scale (20 GB Atlantic), with byte-identical trajectories.
Root cause in the code
- Structured (
interpolators/_xinterpolators.py) samples lazily: data.isel(selection_dict).data.reshape(...) (lines 82, 244, 373, 563) then value.compute() if is_dask_collection(value) (135, 299–300, 381, 570, 626). Lazy seam exists.
- Unstructured (
interpolators/_uxinterpolators.py) samples via field.data.values[ti, zi, fi] (lines 20, 41–42, 65, 87–88). .values materializes the entire field, then NumPy fancy-indexes. For a dask-backed UxGrid field this requires the whole field in memory (or recomputes it per call) → precludes out-of-core operation, and leaves no lazy seam to cache behind.
.sel is not used anywhere in the interpolators.
Proposal 1 — unstructured interpolators → .isel
Index before materializing, mirroring the structured path:
# now (materializes the whole field):
field.data.values[ti, zi, fi]
field.data.values[ti[:, None], zi[:, None], node_ids]
# proposed (lazy, only the requested points):
field.data.isel(time=ti, n_z=zi, n_face=fi).data
field.data.isel(time=ti[:, None], n_z=zi[:, None], n_node=node_ids).data
Benefits independent of caching: UxGrid fields become genuinely out-of-core-capable (no full materialization / no OOM on large meshes), and the structured & unstructured paths share one lazy access idiom. The mesh connectivity (uxgrid.face_node_connectivity, node coords) stays a separate, static, one-time load — only data variables are touched lazily.
Proposal 2 — WindowedArray
A thin wrapper around field.data that exploits the fact that all particles share the simulation clock, so any access touches only {ti, ti+1}:
class WindowedArray:
"""Wrap a lazy DataArray; .isel() loads only the requested time levels to
NumPy, retires levels below the current minimum, and gathers on the small
resident block. All other attributes forward to the wrapped array."""
# on .isel(time=..., ...):
# 1. levels = unique(time indices)
# 2. load missing levels -> NumPy (one bulk read each); evict levels < min(levels)
# 3. gather on the stacked resident block (vectorized isel in NumPy)
Because the result is NumPy, the structured interpolator's value.compute() if is_dask_collection(value) guard auto-skips .compute() — so this removes the re-reads and the scheduler tax with no interpolator change. Integration is one line where Field.data is set (e.g. in FieldSet.from_sgrid_conventions / from_ugrid_conventions, guarded by "is dask-backed"):
field.data = WindowedArray(field.data)
Verified in the reference repo: identical to dask (max|Δ| = 0), each time level loaded once (20 loads / 60 steps vs 120 naive gathers), and ≤ 2 levels resident at any time. Works for XGrid today; works for UxGrid once Proposal 1 lands (the wrapper is grid-shape-agnostic — it windows the leading time axis and forwards the rest).
Why this design
The working set is tiny even when the dataset is enormous: a simulation only ever needs the two time levels bracketing "now." Windowing turns "hundreds of thousands of tiny scattered dask gathers" into "one bulk sequential read per time level + free NumPy indexing," with memory bounded to ~2 levels regardless of series length.
Scope / open questions
- Correctness guard: valid only while requested time indices stay within the resident window. Add a cheap assert (or fall back to a wider load) for kernels that sample arbitrary past/future times. Needs care for backward-in-time runs and multiple
dts.
- Prefetch the next level on a background thread to overlap the boundary reload (prototype has this in the standalone sampler).
- Spatial/depth subsetting (load only the particles' bbox / depth range) is a natural follow-on for very large meshes.
- Assumes
time is the leading dim of field.data (true for the current Field construction).
Alternatives considered
dask.cache.Cache(...).register() — zero-code, caches loaded chunks across .compute() calls (removes re-reads) but keeps the per-call scheduler tax; eviction is byte-budget/LRU, not time-aware.
- Custom xarray
BackendArray duck-array — the most robust/idiomatic home for the window logic (xarray's lazy-indexing adapter calls __getitem__ with windowed indices), but more plumbing. Could be the long-term form of WindowedArray.
- dask-only tuning (
synchronous scheduler, larger chunks, batched compute) — worth setting as defaults for the lazy path, but caps at ~47× numpy; doesn't replace windowing.
Proposed work / acceptance criteria
Filed by Claude Code on behalf of Joe Schoonover (FluidNumerics). Evidence and reproducers: https://github.com/FluidNumerics/uxxarray-perf-modeling
Summary
For datasets whose full time series doesn't fit in RAM (the common production case — global 1/12° u,v,w for two time levels alone is ~10–21 GiB), Parcels v4 must open fields lazily with dask. In that regime field sampling is orders of magnitude slower than it needs to be, and the unstructured path can't be made lazy at all as currently written. This proposes two changes:
field.data.values[...]with vectorized.isel(...)— a prerequisite for out-of-core UxGrid fields and for any caching layer.WindowedArraywrapper that transparently keeps only the time level(s) bracketing the simulation clock resident as NumPy, loading on demand and retiring stale levels — a near-drop-in that removes both the read-amplification and the dask scheduling overhead.Full measured evidence + reproducers: https://github.com/FluidNumerics/uxxarray-perf-modeling
Motivation (measured)
On an identical structured-grid advection (2000 particles, 480 steps), dask-backed vs
.load()-ed-into-RAM:This is not disk-bound — the field is cache-resident and the process is CPU-pinned. Two independent causes:
iselpulls whole chunks; the same time-slab chunks are re-read every step. Measured 40–52× bytes-read vs bytes-used (and small chunks collapse to fio-4 KiB-random throughput, ~34 MB/s).compute()scheduling overhead. With the field fully in RAM (.persist(), zero I/O), a single scatteredisel().compute()is ~200× slower than the numpy equivalent (0.3 ms → 63 ms). Profiling (cProfile + VizTracer) attributes it to the threaded scheduler's lock/condition machinery plus per-call graph build/tokenize — not array work.We checked the dask-only escape hatches; none close the gap:
scheduler="synchronous"+ larger chunks: ~6–12× faster, but floors at ~47× numpy (graph build/tokenize is paid per.compute()).dask.compute(): reduces call count (RK2-3D: 6→2) but not wall time — cost is per-task, not per-call.The structural fix is to stop sampling lazily per step: read the bracketing time level(s) once into NumPy, reuse across all particles and sub-steps, evict when the clock advances. A prototype of exactly this is 6× faster and reads 12× less data at production scale (20 GB Atlantic), with byte-identical trajectories.
Root cause in the code
interpolators/_xinterpolators.py) samples lazily:data.isel(selection_dict).data.reshape(...)(lines 82, 244, 373, 563) thenvalue.compute() if is_dask_collection(value)(135, 299–300, 381, 570, 626). Lazy seam exists.interpolators/_uxinterpolators.py) samples viafield.data.values[ti, zi, fi](lines 20, 41–42, 65, 87–88)..valuesmaterializes the entire field, then NumPy fancy-indexes. For a dask-backed UxGrid field this requires the whole field in memory (or recomputes it per call) → precludes out-of-core operation, and leaves no lazy seam to cache behind..selis not used anywhere in the interpolators.Proposal 1 — unstructured interpolators →
.iselIndex before materializing, mirroring the structured path:
Benefits independent of caching: UxGrid fields become genuinely out-of-core-capable (no full materialization / no OOM on large meshes), and the structured & unstructured paths share one lazy access idiom. The mesh connectivity (
uxgrid.face_node_connectivity, node coords) stays a separate, static, one-time load — only data variables are touched lazily.Proposal 2 —
WindowedArrayA thin wrapper around
field.datathat exploits the fact that all particles share the simulation clock, so any access touches only{ti, ti+1}:Because the result is NumPy, the structured interpolator's
value.compute() if is_dask_collection(value)guard auto-skips.compute()— so this removes the re-reads and the scheduler tax with no interpolator change. Integration is one line whereField.datais set (e.g. inFieldSet.from_sgrid_conventions/from_ugrid_conventions, guarded by "is dask-backed"):Verified in the reference repo: identical to dask (
max|Δ| = 0), each time level loaded once (20 loads / 60 steps vs 120 naive gathers), and ≤ 2 levels resident at any time. Works for XGrid today; works for UxGrid once Proposal 1 lands (the wrapper is grid-shape-agnostic — it windows the leadingtimeaxis and forwards the rest).Why this design
The working set is tiny even when the dataset is enormous: a simulation only ever needs the two time levels bracketing "now." Windowing turns "hundreds of thousands of tiny scattered dask gathers" into "one bulk sequential read per time level + free NumPy indexing," with memory bounded to ~2 levels regardless of series length.
Scope / open questions
dts.timeis the leading dim offield.data(true for the currentFieldconstruction).Alternatives considered
dask.cache.Cache(...).register()— zero-code, caches loaded chunks across.compute()calls (removes re-reads) but keeps the per-call scheduler tax; eviction is byte-budget/LRU, not time-aware.BackendArrayduck-array — the most robust/idiomatic home for the window logic (xarray's lazy-indexing adapter calls__getitem__with windowed indices), but more plumbing. Could be the long-term form ofWindowedArray.synchronousscheduler, larger chunks, batched compute) — worth setting as defaults for the lazy path, but caps at ~47× numpy; doesn't replace windowing.Proposed work / acceptance criteria
_uxinterpolators.pydata access from.values[...]to vectorized.isel(...); confirm unchanged results on existing UxGrid tests and that a chunked/dask UxGrid field runs without full materialization.WindowedArray(or aBackendArrayequivalent) with load-on-demand + retire-stale + optional prefetch and a window-span assertion.field.data = WindowedArray(field.data)for dask-backed fields in theFieldSetconstructors (opt-in flag initially).Filed by Claude Code on behalf of Joe Schoonover (FluidNumerics). Evidence and reproducers: https://github.com/FluidNumerics/uxxarray-perf-modeling