feat(models): port Granite 4 Vision (window-QFormer + Granite-4 hybrid)#649
Merged
Conversation
Granite 4 Vision (granite4_vision) is IBM's document VLM with multi-depth visual injection. A SigLIP tower feeds eight window-QFormer projectors (four deepstack mean-pool downsamplers and four spatial strided-offset downsamplers, each a Blip2 QFormer with self and cross attention per 4x4/8x8 window) whose packed outputs are added into the residual stream at eight depths of a granitemoehybrid text backbone during prefill, rather than merged once into the input embeddings. Reuses the shared AnyRes tiling and the four Granite scalar multipliers.
The granitemoehybrid dense path gains a fused shared-MLP variant so the all-attention text backbone loads directly from the quantized fused shared_mlp.input_linear. The chat-template engine normalizes multi-line image-marker string literals and gains the Python str.index/str.find methods the Granite template uses to place the image token. The injection helper captures the functional slice_update result (a dropped return left every image slot zero and silently made the model text-only); a unit test guards the write.
Validated on mlx-community/granite-4.0-3b-vision-4bit: reads a table image ("The price of Bread in the table is $2.00.") and describes a photograph, matching the mlx-vlm reference.
Closes #541
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ports IBM's Granite 4 Vision (
granite4_vision), a document VLM with multi-depth visual injection.Architecture
GELU(approx="fast")) feeds 8 window-QFormer projectors: 4 deepstack (mean-pool downsampler) + 4 spatial (strided-offset), each a Blip2 QFormer with self + cross attention per4x4/8x8window.granitemoehybridtext backbone during prefill: image slots are zeroed ininputs_embeds, then each stream is scatter-added at the image-token positions right before its target layer (mirrors the qwen3-vl DeepStack shape).Supporting changes
granitemoehybrid: a fused shared-MLPFeedForwardvariant so the all-attention text backbone (this checkpoint has no Mamba/MoE layers) loads directly from the quantized fusedshared_mlp.input_linear.str.index/str.findmethods (the Granite template calls.index()to place the image token).inject_at_positions: capture the functionalslice_updatereturn. A dropped return left every image slot zero and silently ran the model text-only; a co-located unit test now guards that the features actually reach the hidden state.Validation
mlx-community/granite-4.0-3b-vision-4biton M1 Ultra:What is the price of Bread in this table?->The price of Bread in the table is $2.00.(matches the mlx-vlm reference).Describe the animals in this image.->The image features two cats, both with striped fur, resting on a pink surface.docs/supported-models.mdandsrc/models/detection.rsupdated.cargo fmtclean;clippy -p mlxcel --lib --tests -- -D warningsclean for the changed code.Closes #541