Skip to content

feat(models): port GLM-OCR (ViT + GLM-4 text OCR)#646

Merged
inureyes merged 1 commit into
mainfrom
feat/issue-547-glm-ocr
Jul 3, 2026
Merged

feat(models): port GLM-OCR (ViT + GLM-4 text OCR)#646
inureyes merged 1 commit into
mainfrom
feat/issue-547-glm-ocr

Conversation

@inureyes

@inureyes inureyes commented Jul 3, 2026

Copy link
Copy Markdown
Member

Summary

Adds GLM-OCR (glm_ocr, GlmOcrForConditionalGeneration) support, a document-OCR sibling of the in-tree GLM-4V stack. A 24-block ViT vision tower feeds a 16-layer GLM-4 text decoder driven by full-width even/odd MRoPE. Closes #547.

Architecture

  • Vision tower (src/vision/encoders/glm_ocr.rs), a variant of the glm4v encoder:
    • Per-head q/k RMSNorm over head_dim (64) before the 2D vision rotary (glm4v has none).
    • Block norms take rms_norm_eps (1e-5); no learned position embedding and no post-conv norm (the checkpoint ships neither).
    • patch_embed and downsample detect both the channels-last export layout and the raw channels-second checkpoint layout.
    • Patch-order fix (OCR-critical): patches are reordered from the processor's raster order into spatial-merge-window order after the patch embedding, so the rotary, the consecutive-4 downsample, and the text-side merged-token grid all agree. glm4v leaves this latent misalignment in place (fine for coarse VQA, but OCR reads scrambled patches wrong).
  • Text backbone reuses Glm4vTextModel unchanged. The loader lifts the nested rope_parameters block (mrope_section [16, 24, 24], partial_rotary_factor 1.0) into the fields the config deserializes, and drops the next-n prediction (MTP) layer at index num_hidden_layers.
  • Processor reuses Qwen2VLProcessor with the OCR pixel bounds (min 12544, max 9633792).
  • Wiring: detection -> loader -> LoadedModel::GlmOcr -> VlmRuntimeRef::Qwen -> server/CLI, plus model_metadata and TP dispatch rows.

Validation

Validated on mlx-community/GLM-OCR-4bit through the actual CLI runtime:

  • Text Recognition on a spatially-asymmetric 4-quadrant image returns ALPHA 111 BRAVO 222 / CHARLIE 333 DELTA 444 in correct spatial reading order (clean text, no character scrambling), confirming the patch-order fix.
  • Table Recognition emits correct structured HTML matching every cell in row/column order.

Automated tests:

  • Co-located unit tests: conv-layout detection (both patch-embed and downsample layouts), the raster->merge-window permutation, rope_parameters lifting, and the MTP-layer drop.
  • tests/glm_ocr_parity.rs: config-detection tests (run in CI without a checkpoint) plus checkpoint-gated real-model load and text-forward tests.

cargo test, cargo fmt --all -- --check, and cargo clippy --all-targets --features metal,accelerate are clean for the new code.

Add GLM-OCR (`glm_ocr`, `GlmOcrForConditionalGeneration`), a document-OCR
sibling of the in-tree GLM-4V stack: a 24-block ViT feeds a 16-layer GLM-4
text decoder driven by full-width even/odd MRoPE.

Vision tower (`src/vision/encoders/glm_ocr.rs`), a variant of the glm4v
encoder: per-head q/k RMSNorm before the 2D vision rotary, block norms take
`rms_norm_eps` (1e-5), and it drops the glm4v-only learned position embedding
and post-conv norm. Patch-embed and downsample detect both the channels-last
export layout and the raw channels-second checkpoint layout. The tower
reorders patches from the processor's raster order into spatial-merge-window
order after the patch embedding so the rotary, the consecutive-4 downsample,
and the text-side merged-token grid all agree; glm4v leaves this latent
misalignment in place (fine for VQA, but OCR reads scrambled patches wrong).

Text backbone reuses `Glm4vTextModel` unchanged. The loader lifts the nested
`rope_parameters` block (mrope_section [16, 24, 24], partial_rotary_factor
1.0) into the fields the config expects and drops the next-n prediction (MTP)
layer at index `num_hidden_layers`. The processor reuses `Qwen2VLProcessor`
with the OCR pixel bounds (min 12544, max 9633792).

Wired end to end: detection -> loader -> `LoadedModel::GlmOcr` ->
`VlmRuntimeRef::Qwen` -> server/CLI, plus metadata and TP dispatch rows.

Validated on `mlx-community/GLM-OCR-4bit` through the CLI: Text Recognition
reads a 4-quadrant image in correct spatial order and Table Recognition emits
correct HTML. Co-located unit tests cover conv-layout detection, the
raster->merge-window permutation, `rope_parameters` lifting, and the MTP
drop; `tests/glm_ocr_parity.rs` adds config-detection tests plus
checkpoint-gated forward tests.

Closes #547
@inureyes inureyes added type:enhancement New features, capabilities, or significant additions area:models Model architectures, weights, loading, metadata labels Jul 3, 2026
@inureyes inureyes merged commit 68c659b into main Jul 3, 2026
5 checks passed
@inureyes inureyes deleted the feat/issue-547-glm-ocr branch July 3, 2026 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:models Model architectures, weights, loading, metadata type:enhancement New features, capabilities, or significant additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(models): port GLM-OCR (ViT + GLM-4 text OCR)

1 participant