feat(models): port Granite 4 Vision (window-QFormer + Granite-4 hybrid) by inureyes · Pull Request #649 · lablup/mlxcel

inureyes · 2026-07-03T08:49:48Z

Ports IBM's Granite 4 Vision (granite4_vision), a document VLM with multi-depth visual injection.

Architecture

SigLIP tower (sigmoid GELU(approx="fast")) feeds 8 window-QFormer projectors: 4 deepstack (mean-pool downsampler) + 4 spatial (strided-offset), each a Blip2 QFormer with self + cross attention per 4x4/8x8 window.
The 8 packed streams are injected into the residual stream at 8 depths (text layers 0, 3, 6, 9, 12, 15, 18, 21) of a granitemoehybrid text backbone during prefill: image slots are zeroed in inputs_embeds, then each stream is scatter-added at the image-token positions right before its target layer (mirrors the qwen3-vl DeepStack shape).
Reuses the shared AnyRes tiling and the four Granite scalar multipliers.

Supporting changes

granitemoehybrid: a fused shared-MLP FeedForward variant so the all-attention text backbone (this checkpoint has no Mamba/MoE layers) loads directly from the quantized fused shared_mlp.input_linear.
Chat-template engine: normalize multi-line image-marker string literals, and add the Python str.index/str.find methods (the Granite template calls .index() to place the image token).
inject_at_positions: capture the functional slice_update return. A dropped return left every image slot zero and silently ran the model text-only; a co-located unit test now guards that the features actually reach the hidden state.

Validation

mlx-community/granite-4.0-3b-vision-4bit on M1 Ultra:

Table OCR: What is the price of Bread in this table? -> The price of Bread in the table is $2.00. (matches the mlx-vlm reference).
Photo: Describe the animals in this image. -> The image features two cats, both with striped fur, resting on a pink surface.

docs/supported-models.md and src/models/detection.rs updated. cargo fmt clean; clippy -p mlxcel --lib --tests -- -D warnings clean for the changed code.

Closes #541

Granite 4 Vision (granite4_vision) is IBM's document VLM with multi-depth visual injection. A SigLIP tower feeds eight window-QFormer projectors (four deepstack mean-pool downsamplers and four spatial strided-offset downsamplers, each a Blip2 QFormer with self and cross attention per 4x4/8x8 window) whose packed outputs are added into the residual stream at eight depths of a granitemoehybrid text backbone during prefill, rather than merged once into the input embeddings. Reuses the shared AnyRes tiling and the four Granite scalar multipliers. The granitemoehybrid dense path gains a fused shared-MLP variant so the all-attention text backbone loads directly from the quantized fused shared_mlp.input_linear. The chat-template engine normalizes multi-line image-marker string literals and gains the Python str.index/str.find methods the Granite template uses to place the image token. The injection helper captures the functional slice_update result (a dropped return left every image slot zero and silently made the model text-only); a unit test guards the write. Validated on mlx-community/granite-4.0-3b-vision-4bit: reads a table image ("The price of Bread in the table is $2.00.") and describes a photograph, matching the mlx-vlm reference. Closes #541

inureyes added area:models Model architectures, weights, loading, metadata type:enhancement New features, capabilities, or significant additions labels Jul 3, 2026

inureyes merged commit 472408c into main Jul 3, 2026
5 checks passed

inureyes deleted the wip/issue-541-granite4-vision branch July 3, 2026 08:52

inureyes mentioned this pull request Jul 3, 2026

fix(vision): captured slice_update in DeepStack/section-assembly injection loops (qwen3_vl, qwen3_vl_moe, qwen2_vl, paddleocr_vl) #650

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(models): port Granite 4 Vision (window-QFormer + Granite-4 hybrid)#649

feat(models): port Granite 4 Vision (window-QFormer + Granite-4 hybrid)#649
inureyes merged 1 commit into
mainfrom
wip/issue-541-granite4-vision

inureyes commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Jul 3, 2026

Architecture

Supporting changes

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant