MatBio - Analysis Pipeline

Code accompanying [paper citation]. Analyzes the spatial organization of cells in tissue images to distinguish between tissue conditions using spatial statistics and machine learning.

Dependencies

Dependencies are pinned in conda-lock.yml (a multi-platform lock for linux-64, osx-64, and osx-arm64, generated from environment.yml with conda-lock). Create the environment with an exact, reproducible install:

conda-lock install --name SCSAPreduced conda-lock.yml

This works with conda, mamba, or micromamba as the backend. If you don't have conda-lock, install it with pip install conda-lock or micromamba install -c conda-forge conda-lock.

To resolve fresh (unpinned) instead of using the lock, or after editing environment.yml:

micromamba env create -f environment.yml   # fresh solve from the spec
conda-lock lock -f environment.yml -p linux-64 -p osx-64 -p osx-arm64  # regenerate the lock

Data

The tissue datasets analyzed in the paper are not publicly available and are not distributed with this repository. To make the pipeline runnable and to document the expected input format, src/simulate_data.py generates a small synthetic dataset with the same structure the pipeline consumes — it is an illustrative stand-in, not real data.

The pipeline reads a single CSV in the following unified schema (one row per cell), and src/preprocessing.py groups it into per-image tables:

Column	Description
`cell_id`	Integer cell identifier
`x`, `y`	Float cell-centre coordinates
`cell_type`	Cell-type label (blank → `undefined`)
`image_id`	Image identifier; rows are grouped by this
`pathology`	Condition / disease label (the classification target)
`patient`	Patient identifier (may be blank)

To use your own data, convert it to this schema and point raw_csv (or raw_data_dir) in the config at it.

Usage

The recommended starting point is the tutorial notebook notebooks/pipeline.ipynb, which walks through the full analysis end-to-end — preprocessing, features, KS tests, random-forest classification, and the HSIC independence test — with plots inline.

To reproduce all result tables headlessly (e.g. on a cluster), run the CLI orchestrator, which executes the same five stages (preprocess → features → KS tests → random forest → HSIC) and writes its outputs to results/:

python -m src.pipeline --config configs/default.yaml --output-dir results/

Try it on synthetic data

Since the study CSVs are not distributed (see Data), you can run the pipeline end-to-end on synthetic data instead. Generate a small dataset (two conditions, one tumour + three immune cell types) and run the demo config:

python -m src.simulate_data data/raw/simulated_data.csv
python -m src.pipeline --config configs/demo.yaml --output-dir results/demo

Or import individual stages directly:

from src.preprocessing import load_dataset, build_system
from src.features import compute_local_density, compute_katic_order
from src.stats import ks_test_per_feature, adjust_pvalues
from src.ml import cross_validate_random_forest
from src.hsic import hsic_test, hsic_sweep

Code structure

File	Stage
`src/preprocessing.py`	Load centroids, deduplicate, encode cell types, build simulation boxes
`src/features.py`	Local density and k-atic order parameters (psi_k) per cell and per cell-type combination
`src/stats.py`	Univariate two-sample tests (KS) with BH/BY correction and KDE-grid plots
`src/ml.py`	Cross-validated random-forest classification with bootstrap confidence intervals and feature-group selection
`src/hsic.py`	HSIC test for independence between the psi (k-atic order) and local-density feature blocks, with permutation p-values and a cutoff × split sweep
`src/pipeline.py`	End-to-end orchestrator; also the CLI entry point
`notebooks/pipeline.ipynb`	Tutorial notebook walking through the full analysis
`src/simulate_data.py`	Generate a small synthetic dataset in the unified schema for demos/testing

Configuration

Edit configs/default.yaml to set the dataset path, bond cutoff, k-range, and cell-type definitions.

License

Copyright The Regents of the University of Michigan.

Released under the BSD 3-Clause License — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MatBio - Analysis Pipeline

Dependencies

Data

Usage

Try it on synthetic data

Code structure

Configuration

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
configs		configs
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda-lock.yml		conda-lock.yml
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

MatBio - Analysis Pipeline

Dependencies

Data

Usage

Try it on synthetic data

Code structure

Configuration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages