feat(models): port GLM-OCR (ViT + GLM-4 text OCR) by inureyes · Pull Request #646 · lablup/mlxcel

inureyes · 2026-07-03T02:40:31Z

Summary

Adds GLM-OCR (glm_ocr, GlmOcrForConditionalGeneration) support, a document-OCR sibling of the in-tree GLM-4V stack. A 24-block ViT vision tower feeds a 16-layer GLM-4 text decoder driven by full-width even/odd MRoPE. Closes #547.

Architecture

Vision tower (src/vision/encoders/glm_ocr.rs), a variant of the glm4v encoder:
- Per-head q/k RMSNorm over head_dim (64) before the 2D vision rotary (glm4v has none).
- Block norms take rms_norm_eps (1e-5); no learned position embedding and no post-conv norm (the checkpoint ships neither).
- patch_embed and downsample detect both the channels-last export layout and the raw channels-second checkpoint layout.
- Patch-order fix (OCR-critical): patches are reordered from the processor's raster order into spatial-merge-window order after the patch embedding, so the rotary, the consecutive-4 downsample, and the text-side merged-token grid all agree. glm4v leaves this latent misalignment in place (fine for coarse VQA, but OCR reads scrambled patches wrong).
Text backbone reuses Glm4vTextModel unchanged. The loader lifts the nested rope_parameters block (mrope_section [16, 24, 24], partial_rotary_factor 1.0) into the fields the config deserializes, and drops the next-n prediction (MTP) layer at index num_hidden_layers.
Processor reuses Qwen2VLProcessor with the OCR pixel bounds (min 12544, max 9633792).
Wiring: detection -> loader -> LoadedModel::GlmOcr -> VlmRuntimeRef::Qwen -> server/CLI, plus model_metadata and TP dispatch rows.

Validation

Validated on mlx-community/GLM-OCR-4bit through the actual CLI runtime:

Text Recognition on a spatially-asymmetric 4-quadrant image returns ALPHA 111 BRAVO 222 / CHARLIE 333 DELTA 444 in correct spatial reading order (clean text, no character scrambling), confirming the patch-order fix.
Table Recognition emits correct structured HTML matching every cell in row/column order.

Automated tests:

Co-located unit tests: conv-layout detection (both patch-embed and downsample layouts), the raster->merge-window permutation, rope_parameters lifting, and the MTP-layer drop.
tests/glm_ocr_parity.rs: config-detection tests (run in CI without a checkpoint) plus checkpoint-gated real-model load and text-forward tests.

cargo test, cargo fmt --all -- --check, and cargo clippy --all-targets --features metal,accelerate are clean for the new code.

Add GLM-OCR (`glm_ocr`, `GlmOcrForConditionalGeneration`), a document-OCR sibling of the in-tree GLM-4V stack: a 24-block ViT feeds a 16-layer GLM-4 text decoder driven by full-width even/odd MRoPE. Vision tower (`src/vision/encoders/glm_ocr.rs`), a variant of the glm4v encoder: per-head q/k RMSNorm before the 2D vision rotary, block norms take `rms_norm_eps` (1e-5), and it drops the glm4v-only learned position embedding and post-conv norm. Patch-embed and downsample detect both the channels-last export layout and the raw channels-second checkpoint layout. The tower reorders patches from the processor's raster order into spatial-merge-window order after the patch embedding so the rotary, the consecutive-4 downsample, and the text-side merged-token grid all agree; glm4v leaves this latent misalignment in place (fine for VQA, but OCR reads scrambled patches wrong). Text backbone reuses `Glm4vTextModel` unchanged. The loader lifts the nested `rope_parameters` block (mrope_section [16, 24, 24], partial_rotary_factor 1.0) into the fields the config expects and drops the next-n prediction (MTP) layer at index `num_hidden_layers`. The processor reuses `Qwen2VLProcessor` with the OCR pixel bounds (min 12544, max 9633792). Wired end to end: detection -> loader -> `LoadedModel::GlmOcr` -> `VlmRuntimeRef::Qwen` -> server/CLI, plus metadata and TP dispatch rows. Validated on `mlx-community/GLM-OCR-4bit` through the CLI: Text Recognition reads a 4-quadrant image in correct spatial order and Table Recognition emits correct HTML. Co-located unit tests cover conv-layout detection, the raster->merge-window permutation, `rope_parameters` lifting, and the MTP drop; `tests/glm_ocr_parity.rs` adds config-detection tests plus checkpoint-gated forward tests. Closes #547

inureyes added type:enhancement New features, capabilities, or significant additions area:models Model architectures, weights, loading, metadata labels Jul 3, 2026

inureyes merged commit 68c659b into main Jul 3, 2026
5 checks passed

inureyes deleted the feat/issue-547-glm-ocr branch July 3, 2026 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(models): port GLM-OCR (ViT + GLM-4 text OCR)#646

feat(models): port GLM-OCR (ViT + GLM-4 text OCR)#646
inureyes merged 1 commit into
mainfrom
feat/issue-547-glm-ocr

inureyes commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Jul 3, 2026

Summary

Architecture

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant