Skip to content

feat(models): port Granite 4 Vision (window-QFormer + Granite-4 hybrid)#649

Merged
inureyes merged 1 commit into
mainfrom
wip/issue-541-granite4-vision
Jul 3, 2026
Merged

feat(models): port Granite 4 Vision (window-QFormer + Granite-4 hybrid)#649
inureyes merged 1 commit into
mainfrom
wip/issue-541-granite4-vision

Conversation

@inureyes

@inureyes inureyes commented Jul 3, 2026

Copy link
Copy Markdown
Member

Ports IBM's Granite 4 Vision (granite4_vision), a document VLM with multi-depth visual injection.

Architecture

  • SigLIP tower (sigmoid GELU(approx="fast")) feeds 8 window-QFormer projectors: 4 deepstack (mean-pool downsampler) + 4 spatial (strided-offset), each a Blip2 QFormer with self + cross attention per 4x4/8x8 window.
  • The 8 packed streams are injected into the residual stream at 8 depths (text layers 0, 3, 6, 9, 12, 15, 18, 21) of a granitemoehybrid text backbone during prefill: image slots are zeroed in inputs_embeds, then each stream is scatter-added at the image-token positions right before its target layer (mirrors the qwen3-vl DeepStack shape).
  • Reuses the shared AnyRes tiling and the four Granite scalar multipliers.

Supporting changes

  • granitemoehybrid: a fused shared-MLP FeedForward variant so the all-attention text backbone (this checkpoint has no Mamba/MoE layers) loads directly from the quantized fused shared_mlp.input_linear.
  • Chat-template engine: normalize multi-line image-marker string literals, and add the Python str.index/str.find methods (the Granite template calls .index() to place the image token).
  • inject_at_positions: capture the functional slice_update return. A dropped return left every image slot zero and silently ran the model text-only; a co-located unit test now guards that the features actually reach the hidden state.

Validation

mlx-community/granite-4.0-3b-vision-4bit on M1 Ultra:

  • Table OCR: What is the price of Bread in this table? -> The price of Bread in the table is $2.00. (matches the mlx-vlm reference).
  • Photo: Describe the animals in this image. -> The image features two cats, both with striped fur, resting on a pink surface.

docs/supported-models.md and src/models/detection.rs updated. cargo fmt clean; clippy -p mlxcel --lib --tests -- -D warnings clean for the changed code.

Closes #541

Granite 4 Vision (granite4_vision) is IBM's document VLM with multi-depth visual injection. A SigLIP tower feeds eight window-QFormer projectors (four deepstack mean-pool downsamplers and four spatial strided-offset downsamplers, each a Blip2 QFormer with self and cross attention per 4x4/8x8 window) whose packed outputs are added into the residual stream at eight depths of a granitemoehybrid text backbone during prefill, rather than merged once into the input embeddings. Reuses the shared AnyRes tiling and the four Granite scalar multipliers.

The granitemoehybrid dense path gains a fused shared-MLP variant so the all-attention text backbone loads directly from the quantized fused shared_mlp.input_linear. The chat-template engine normalizes multi-line image-marker string literals and gains the Python str.index/str.find methods the Granite template uses to place the image token. The injection helper captures the functional slice_update result (a dropped return left every image slot zero and silently made the model text-only); a unit test guards the write.

Validated on mlx-community/granite-4.0-3b-vision-4bit: reads a table image ("The price of Bread in the table is $2.00.") and describes a photograph, matching the mlx-vlm reference.

Closes #541
@inureyes inureyes added area:models Model architectures, weights, loading, metadata type:enhancement New features, capabilities, or significant additions labels Jul 3, 2026
@inureyes inureyes merged commit 472408c into main Jul 3, 2026
5 checks passed
@inureyes inureyes deleted the wip/issue-541-granite4-vision branch July 3, 2026 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:models Model architectures, weights, loading, metadata type:enhancement New features, capabilities, or significant additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(models): port Granite 4 Vision (ViT + Granite-4 hybrid text)

1 participant