Code accompanying [paper citation]. Analyzes the spatial organization of cells in tissue images to distinguish between tissue conditions using spatial statistics and machine learning.
Dependencies are pinned in conda-lock.yml (a multi-platform lock for
linux-64, osx-64, and osx-arm64, generated from environment.yml with
conda-lock). Create the environment
with an exact, reproducible install:
conda-lock install --name SCSAPreduced conda-lock.ymlThis works with conda, mamba, or micromamba as the backend. If you don't have
conda-lock, install it with pip install conda-lock or
micromamba install -c conda-forge conda-lock.
To resolve fresh (unpinned) instead of using the lock, or after editing
environment.yml:
micromamba env create -f environment.yml # fresh solve from the spec
conda-lock lock -f environment.yml -p linux-64 -p osx-64 -p osx-arm64 # regenerate the lockThe tissue datasets analyzed in the paper are not publicly available and
are not distributed with this repository. To make the pipeline runnable and to
document the expected input format, src/simulate_data.py generates a small
synthetic dataset with the same structure the pipeline consumes — it is an
illustrative stand-in, not real data.
The pipeline reads a single CSV in the following unified schema (one row per
cell), and src/preprocessing.py groups it into per-image tables:
| Column | Description |
|---|---|
cell_id |
Integer cell identifier |
x, y |
Float cell-centre coordinates |
cell_type |
Cell-type label (blank → undefined) |
image_id |
Image identifier; rows are grouped by this |
pathology |
Condition / disease label (the classification target) |
patient |
Patient identifier (may be blank) |
To use your own data, convert it to this schema and point raw_csv (or
raw_data_dir) in the config at it.
The recommended starting point is the tutorial notebook notebooks/pipeline.ipynb,
which walks through the full analysis end-to-end — preprocessing, features, KS
tests, random-forest classification, and the HSIC independence test — with
plots inline.
To reproduce all result tables headlessly (e.g. on a cluster), run the CLI
orchestrator, which executes the same five stages (preprocess → features → KS
tests → random forest → HSIC) and writes its outputs to results/:
python -m src.pipeline --config configs/default.yaml --output-dir results/Since the study CSVs are not distributed (see Data), you can run the pipeline end-to-end on synthetic data instead. Generate a small dataset (two conditions, one tumour + three immune cell types) and run the demo config:
python -m src.simulate_data data/raw/simulated_data.csv
python -m src.pipeline --config configs/demo.yaml --output-dir results/demoOr import individual stages directly:
from src.preprocessing import load_dataset, build_system
from src.features import compute_local_density, compute_katic_order
from src.stats import ks_test_per_feature, adjust_pvalues
from src.ml import cross_validate_random_forest
from src.hsic import hsic_test, hsic_sweep| File | Stage |
|---|---|
src/preprocessing.py |
Load centroids, deduplicate, encode cell types, build simulation boxes |
src/features.py |
Local density and k-atic order parameters (psi_k) per cell and per cell-type combination |
src/stats.py |
Univariate two-sample tests (KS) with BH/BY correction and KDE-grid plots |
src/ml.py |
Cross-validated random-forest classification with bootstrap confidence intervals and feature-group selection |
src/hsic.py |
HSIC test for independence between the psi (k-atic order) and local-density feature blocks, with permutation p-values and a cutoff × split sweep |
src/pipeline.py |
End-to-end orchestrator; also the CLI entry point |
notebooks/pipeline.ipynb |
Tutorial notebook walking through the full analysis |
src/simulate_data.py |
Generate a small synthetic dataset in the unified schema for demos/testing |
Edit configs/default.yaml to set the dataset path, bond cutoff, k-range,
and cell-type definitions.
Copyright The Regents of the University of Michigan.
Released under the BSD 3-Clause License — see LICENSE.