feat(models): port GLM-OCR (ViT + GLM-4 text OCR)#646
Merged
Conversation
Add GLM-OCR (`glm_ocr`, `GlmOcrForConditionalGeneration`), a document-OCR sibling of the in-tree GLM-4V stack: a 24-block ViT feeds a 16-layer GLM-4 text decoder driven by full-width even/odd MRoPE. Vision tower (`src/vision/encoders/glm_ocr.rs`), a variant of the glm4v encoder: per-head q/k RMSNorm before the 2D vision rotary, block norms take `rms_norm_eps` (1e-5), and it drops the glm4v-only learned position embedding and post-conv norm. Patch-embed and downsample detect both the channels-last export layout and the raw channels-second checkpoint layout. The tower reorders patches from the processor's raster order into spatial-merge-window order after the patch embedding so the rotary, the consecutive-4 downsample, and the text-side merged-token grid all agree; glm4v leaves this latent misalignment in place (fine for VQA, but OCR reads scrambled patches wrong). Text backbone reuses `Glm4vTextModel` unchanged. The loader lifts the nested `rope_parameters` block (mrope_section [16, 24, 24], partial_rotary_factor 1.0) into the fields the config expects and drops the next-n prediction (MTP) layer at index `num_hidden_layers`. The processor reuses `Qwen2VLProcessor` with the OCR pixel bounds (min 12544, max 9633792). Wired end to end: detection -> loader -> `LoadedModel::GlmOcr` -> `VlmRuntimeRef::Qwen` -> server/CLI, plus metadata and TP dispatch rows. Validated on `mlx-community/GLM-OCR-4bit` through the CLI: Text Recognition reads a 4-quadrant image in correct spatial order and Table Recognition emits correct HTML. Co-located unit tests cover conv-layout detection, the raster->merge-window permutation, `rope_parameters` lifting, and the MTP drop; `tests/glm_ocr_parity.rs` adds config-detection tests plus checkpoint-gated forward tests. Closes #547
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds GLM-OCR (
glm_ocr,GlmOcrForConditionalGeneration) support, a document-OCR sibling of the in-tree GLM-4V stack. A 24-block ViT vision tower feeds a 16-layer GLM-4 text decoder driven by full-width even/odd MRoPE. Closes #547.Architecture
src/vision/encoders/glm_ocr.rs), a variant of the glm4v encoder:head_dim(64) before the 2D vision rotary (glm4v has none).rms_norm_eps(1e-5); no learned position embedding and no post-conv norm (the checkpoint ships neither).patch_embedanddownsampledetect both the channels-last export layout and the raw channels-second checkpoint layout.Glm4vTextModelunchanged. The loader lifts the nestedrope_parametersblock (mrope_section [16, 24, 24],partial_rotary_factor 1.0) into the fields the config deserializes, and drops the next-n prediction (MTP) layer at indexnum_hidden_layers.Qwen2VLProcessorwith the OCR pixel bounds (min 12544, max 9633792).LoadedModel::GlmOcr->VlmRuntimeRef::Qwen-> server/CLI, plusmodel_metadataand TP dispatch rows.Validation
Validated on
mlx-community/GLM-OCR-4bitthrough the actual CLI runtime:ALPHA 111 BRAVO 222 / CHARLIE 333 DELTA 444in correct spatial reading order (clean text, no character scrambling), confirming the patch-order fix.Automated tests:
rope_parameterslifting, and the MTP-layer drop.tests/glm_ocr_parity.rs: config-detection tests (run in CI without a checkpoint) plus checkpoint-gated real-model load and text-forward tests.cargo test,cargo fmt --all -- --check, andcargo clippy --all-targets --features metal,accelerateare clean for the new code.