Skip to content

Add page-tree navigation API#22

Merged
pgundlach merged 2 commits into
mainfrom
feature/pageaccessor
Jun 26, 2026
Merged

Add page-tree navigation API#22
pgundlach merged 2 commits into
mainfrom
feature/pageaccessor

Conversation

@pgundlach

Copy link
Copy Markdown
Member

Why

The reader exposes the catalog but has no public way to get at the pages
themselves — page count, geometry, rotation, resources, or content. That was
the one missing layer for anything that needs to walk a document page by page
(importers, layout and inspection tooling). This PR adds a read-only page-tree
navigation API on top of the existing object model and Resolve* helpers.

What

page.go

  • Reader.PageCount() / Pages() / Page(index)0-based indexing
    (idiomatic Go; matches the existing Objects() ordering), backed by a
    cached page-tree walk.
  • A *Page handle with Box, Boxes, Rotation, Resources,
    ContentStreams, and Content. The inheritable attributes
    (boxes, /Rotate, /Resources) are resolved along the /Parent chain per
    PDF 32000-1:2008 §7.7.3.4.
  • Value types: BoxName (the five boundary boxes) and Rect (normalised,
    with Width()/Height()).
  • Stream.RawBytes() to expose raw, undecoded stream bytes alongside the
    existing decoded Content().

Scope & design

  • Read-only, no interpretation. Content streams are delivered raw and/or
    decoded; they are never parsed or graphics-interpreted. ContentStreams
    flattens a single or array /Contents to the underlying streams; Content
    concatenates the decoded bytes with a newline separator.
  • No default substitution. A missing CropBox does not fall back to
    MediaBox — accessors report exactly what the document declares.
  • No new dependencies, no new import of weight — the shipped/WASM
    footprint is essentially unchanged. The addition is 300 lines of library
    code (
    +6–7% of the non-test module), in line with the project's minimal-
    footprint goal.

Defensiveness

Consistent with the rest of the codebase (cf. TestEmbeddedFilesCyclicKidsTerminates):
cyclic /Kids and /Parent chains terminate (seen-set + depth bound), the
/Count value is never trusted for preallocation, leaf detection keys off
/Kids presence so a missing /Type is tolerated, and malformed boxes report
absent rather than panicking.

Tests & docs

  • Two golden fixtures: multi-level attribute inheritance across page-tree
    levels, and /Contents as an array.
  • page_test.go: inheritance and local overrides, rotation normalisation
    (including out-of-range and wrong-type), malformed/short boxes, index
    bounds, and cyclic /Kids / /Parent termination.
  • New examples/pageinfo (with tests) demonstrating the API end to end.
  • README and doc.go "In scope" lists updated.

All of go build, go vet, gofmt -l, and go test -race ./... are clean.

Expose the page tree as a public, read-only navigation layer on top of the
existing object model. Until now the API surfaced the catalog but offered no
direct page access — this fills that gap so the reader can back page-importer
and layout tooling.

New in page.go:
  - Reader.PageCount / Pages / Page(index) — 0-based, backed by a cached,
    cycle- and depth-guarded page-tree walk (no /Count prealloc; leaf
    detection keys off /Kids, so a missing /Type is tolerated).
  - *Page handle with Box, Boxes, Rotation, Resources, ContentStreams and
    Content. Inheritable attributes (boxes, /Rotate, /Resources) are
    resolved along the /Parent chain per PDF 32000-1:2008 §7.7.3.4, with the
    same seen-set + depth guard used elsewhere in the codebase.
  - Value types: BoxName (the five boundary boxes), Rect (normalised, with
    Width/Height).

Content is delivered, never interpreted: ContentStreams flattens single or
array /Contents to the underlying streams, Content concatenates the decoded
bytes. Stream.RawBytes is added so callers can also get raw, undecoded bytes.
No spec-default box substitution is applied — accessors report what the
document actually declares.

Tests: two golden fixtures (multi-level attribute inheritance; /Contents as
an array) plus page_test.go covering inheritance and overrides, rotation
normalisation, malformed boxes, index bounds, and cyclic /Kids and /Parent
termination. New examples/pageinfo demonstrates the API end to end.

Docs: README and doc.go "In scope" lists updated.
@pgundlach pgundlach requested a review from fank June 26, 2026 09:49

@fank fank left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, defensive, well-tested additive change. Three correctness/spec items and two cleanups below; line refs are against the PR head.

Correctness / spec

1. Box/Boxes inherit BleedBox/TrimBox/ArtBox, but only MediaBox, CropBox, Resources and Rotate are inheritable (page.go:183, page.go:29, page.go:199)

Box routes every name through inherited() (page.go:184), and boxNames (page.go:29) drives Boxes over all five. Per PDF 32000-1:2008 Table 30, only MediaBox/CropBox/Resources/Rotate carry the (Inheritable) marker; §14.11.2 defines BleedBox/TrimBox/ArtBox as page-object entries that default to the CropBox value, never to an ancestor's. So an intermediate /Pages node carrying /TrimBox (legal, non-standard) makes every descendant leaf report that box from Box(TrimBox)/Boxes() — a box the page does not actually have, producing a wrong clip/trim region for imposition callers. Suggest gating inherited() to the four inheritable keys and reading Bleed/Trim/Art from p.dict only (keeping the existing no-CropBox-fallback stance).

2. ContentStreams silently drops array entries that resolve to Null or a non-stream; Content then returns truncated bytes with no error (page.go:261-269)

For an array /Contents, an entry whose object is missing from the xref resolves to Null{}, nil — not an error — so the err != nil skip at page.go:263 is bypassed and the s.(*Stream) assertion at page.go:266 simply fails, dropping the entry. Content (page.go:292) joins whatever survived and reports success. Example: /Contents [5 0 R 6 0 R] with object 6 absent yields only stream 5's bytes; drawing state opened in 5 and consumed in 6's portion is silently lost. The single-stream and well-formed-array paths are tested; this truncation path is not. Consider returning an error (or a sentinel) rather than a partial stream reported as success.

3. collectPages emits an intermediate /Pages node as a phantom leaf when /Kids is absent or non-array (page.go:135-138)

Leaf detection keys off node.Array("Kids") returning ok=false (page.go:135). Array returns ok=false not only when /Kids is absent but also when it is an indirect ref that fails to resolve or resolves to a non-array — in those cases the /Pages node itself is appended as a Page (page.go:137). For a malformed/empty document whose root /Pages has no resolvable /Kids, PageCount() returns 1 and Page(0).Dict() is the /Pages root rather than a real page. (The empty /Kids [] case is fine — it yields 0 pages.) A /Type-aware guard, or rejecting a node with /Count but no /Kids array, would avoid the phantom page.

Cleanup

4. collectPages duplicates walkNameTree's cycle-guarded descent (page.go:131 vs reader.go:498; maxPageTreeDepth page.go:12 vs maxNameTreeDepth reader.go:472)

Both share the same shape — (node *Dict, seen map[Reference]struct{}, depth int, out *[]T), the node == nil || depth > maxXxxDepth guard, and resolve-kid-then-recurse — and both depth bounds are a literal 1000. This is now the second hand-rolled copy of that guard (a third lives in examples/structtree). A shared walkKids(node, seen, depth, visit) would keep the cycle/depth handling in one tested place; today a fix to one can drift from the other.

5. rawStreamBytes copy-pastes decodeStream's bounds check and error string (filter.go:22-29 vs filter.go:13-16)

The s.rawOffset < 0 || s.rawOffset+s.rawLength > int64(len(r.buf)) check and the identical "...bytes out of range" message are duplicated verbatim; only the defensive copy is new. A small rawStreamSlice(s) ([]byte, error) returning r.buf[off:off+len] would let decodeStream use the slice directly and rawStreamBytes copy it, leaving one bounds check and one message.

Minor

  • reader.go:54 pagesLoaded vs the existing infoLoad (reader.go:38) — field-naming inconsistency, trivial.
  • Content's '\n' separator (page.go:292) is safe (whitespace cannot merge tokens, §7.8.2) — fine as-is.

Verified clean

  • The four genuinely-inheritable attributes (MediaBox/CropBox/Resources/Rotate) walk /Parent correctly; ContentStreams correctly reads p.dict (Contents is not inheritable).
  • Cycle termination is sound in both collectPages (ref seen-set) and inherited (*Dict seen-set + depth bound at page.go:163), with dedicated tests.
  • Rotation normalisation (page.go:221-228) is correct on 64-bit for 90 / -90 / 450 / 360 / 45 / wrong-type; rectFromArray/numberValue (page.go:298, page.go:322) handle Integer/Real and indirect-ref entries and reject wrong-length arrays.
  • Fixture /Count values are correct (count = leaf descendants: obj 2 and obj 3 each have two leaves below them).

- Box/Boxes: only MediaBox and CropBox inherit along /Parent (PDF
  32000-1:2008 Table 30). BleedBox/TrimBox/ArtBox are page-level only and
  are now read from the page object, so an ancestor /Pages node carrying one
  no longer leaks a wrong clip/trim box into descendant leaves.
- ContentStreams: an array /Contents entry that resolves to null (missing
  object) or a non-stream now returns an error instead of being dropped, so
  Content no longer reports a silently truncated stream as success.
- collectPages: a node with no resolvable /Kids array but /Type /Pages,
  /Count, or an unresolvable /Kids key is treated as a broken/empty branch
  rather than emitted as a phantom leaf page.
- filter: factor the shared bounds check into rawStreamSlice; decodeStream
  and rawStreamBytes now share one check and one error message.

Tests added for each: non-inheritable boxes, broken /Contents array entry,
and the phantom-leaf guard.
@pgundlach

Copy link
Copy Markdown
Member Author

Thanks — thorough review. Addressed in 023992a.

Correctness (all fixed):

  1. Box inheritanceBox/Boxes now only inherit MediaBox and CropBox along /Parent (Table 30); BleedBox/TrimBox/ArtBox are read from the page object only. An ancestor /Pages node carrying a TrimBox no longer leaks a wrong trim region into descendant leaves. Test: TestPageNonInheritableBoxes.
  2. Content truncation — an array /Contents entry resolving to null or a non-stream now returns an error instead of being silently dropped, so Content no longer reports a truncated stream as success. Test: TestPageContentsArrayBrokenEntryErrors.
  3. Phantom leafcollectPages now treats a node with no resolvable /Kids array but /Type /Pages, /Count, or an unresolvable /Kids key as a broken/empty branch rather than a leaf. Test: TestPagesPhantomLeafGuard (covers the unresolvable-ref, no-/Kids, and /Type-only cases).

Cleanup:

  1. rawStreamBytes dedup — done: factored into rawStreamSlice, so decodeStream and rawStreamBytes share one bounds check and one message.

  2. Shared walkKids — left as-is, deliberately. With the new phantom-leaf guard (Validate attacker-controlled xref/ObjStm integers #3), collectPages's leaf-detection logic has diverged from walkNameTree (which instead processes /Names at each node); a shared helper would have to push that page-specific branch back in via a callback, trading the duplication for more coupling. Happy to revisit if you'd still prefer the unification.

MinorpagesLoaded naming: left as-is (reads clearer than the existing infoLoad); shout if you'd rather it match.

go build, go vet, gofmt -l, and go test -race ./... all clean.

@fank fank left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pgundlach pgundlach merged commit 1bbfc44 into main Jun 26, 2026
1 check passed
@pgundlach pgundlach deleted the feature/pageaccessor branch June 26, 2026 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants