Add page-tree navigation API#22
Conversation
Expose the page tree as a public, read-only navigation layer on top of the
existing object model. Until now the API surfaced the catalog but offered no
direct page access — this fills that gap so the reader can back page-importer
and layout tooling.
New in page.go:
- Reader.PageCount / Pages / Page(index) — 0-based, backed by a cached,
cycle- and depth-guarded page-tree walk (no /Count prealloc; leaf
detection keys off /Kids, so a missing /Type is tolerated).
- *Page handle with Box, Boxes, Rotation, Resources, ContentStreams and
Content. Inheritable attributes (boxes, /Rotate, /Resources) are
resolved along the /Parent chain per PDF 32000-1:2008 §7.7.3.4, with the
same seen-set + depth guard used elsewhere in the codebase.
- Value types: BoxName (the five boundary boxes), Rect (normalised, with
Width/Height).
Content is delivered, never interpreted: ContentStreams flattens single or
array /Contents to the underlying streams, Content concatenates the decoded
bytes. Stream.RawBytes is added so callers can also get raw, undecoded bytes.
No spec-default box substitution is applied — accessors report what the
document actually declares.
Tests: two golden fixtures (multi-level attribute inheritance; /Contents as
an array) plus page_test.go covering inheritance and overrides, rotation
normalisation, malformed boxes, index bounds, and cyclic /Kids and /Parent
termination. New examples/pageinfo demonstrates the API end to end.
Docs: README and doc.go "In scope" lists updated.
fank
left a comment
There was a problem hiding this comment.
Solid, defensive, well-tested additive change. Three correctness/spec items and two cleanups below; line refs are against the PR head.
Correctness / spec
1. Box/Boxes inherit BleedBox/TrimBox/ArtBox, but only MediaBox, CropBox, Resources and Rotate are inheritable (page.go:183, page.go:29, page.go:199)
Box routes every name through inherited() (page.go:184), and boxNames (page.go:29) drives Boxes over all five. Per PDF 32000-1:2008 Table 30, only MediaBox/CropBox/Resources/Rotate carry the (Inheritable) marker; §14.11.2 defines BleedBox/TrimBox/ArtBox as page-object entries that default to the CropBox value, never to an ancestor's. So an intermediate /Pages node carrying /TrimBox (legal, non-standard) makes every descendant leaf report that box from Box(TrimBox)/Boxes() — a box the page does not actually have, producing a wrong clip/trim region for imposition callers. Suggest gating inherited() to the four inheritable keys and reading Bleed/Trim/Art from p.dict only (keeping the existing no-CropBox-fallback stance).
2. ContentStreams silently drops array entries that resolve to Null or a non-stream; Content then returns truncated bytes with no error (page.go:261-269)
For an array /Contents, an entry whose object is missing from the xref resolves to Null{}, nil — not an error — so the err != nil skip at page.go:263 is bypassed and the s.(*Stream) assertion at page.go:266 simply fails, dropping the entry. Content (page.go:292) joins whatever survived and reports success. Example: /Contents [5 0 R 6 0 R] with object 6 absent yields only stream 5's bytes; drawing state opened in 5 and consumed in 6's portion is silently lost. The single-stream and well-formed-array paths are tested; this truncation path is not. Consider returning an error (or a sentinel) rather than a partial stream reported as success.
3. collectPages emits an intermediate /Pages node as a phantom leaf when /Kids is absent or non-array (page.go:135-138)
Leaf detection keys off node.Array("Kids") returning ok=false (page.go:135). Array returns ok=false not only when /Kids is absent but also when it is an indirect ref that fails to resolve or resolves to a non-array — in those cases the /Pages node itself is appended as a Page (page.go:137). For a malformed/empty document whose root /Pages has no resolvable /Kids, PageCount() returns 1 and Page(0).Dict() is the /Pages root rather than a real page. (The empty /Kids [] case is fine — it yields 0 pages.) A /Type-aware guard, or rejecting a node with /Count but no /Kids array, would avoid the phantom page.
Cleanup
4. collectPages duplicates walkNameTree's cycle-guarded descent (page.go:131 vs reader.go:498; maxPageTreeDepth page.go:12 vs maxNameTreeDepth reader.go:472)
Both share the same shape — (node *Dict, seen map[Reference]struct{}, depth int, out *[]T), the node == nil || depth > maxXxxDepth guard, and resolve-kid-then-recurse — and both depth bounds are a literal 1000. This is now the second hand-rolled copy of that guard (a third lives in examples/structtree). A shared walkKids(node, seen, depth, visit) would keep the cycle/depth handling in one tested place; today a fix to one can drift from the other.
5. rawStreamBytes copy-pastes decodeStream's bounds check and error string (filter.go:22-29 vs filter.go:13-16)
The s.rawOffset < 0 || s.rawOffset+s.rawLength > int64(len(r.buf)) check and the identical "...bytes out of range" message are duplicated verbatim; only the defensive copy is new. A small rawStreamSlice(s) ([]byte, error) returning r.buf[off:off+len] would let decodeStream use the slice directly and rawStreamBytes copy it, leaving one bounds check and one message.
Minor
reader.go:54pagesLoadedvs the existinginfoLoad(reader.go:38) — field-naming inconsistency, trivial.Content's'\n'separator (page.go:292) is safe (whitespace cannot merge tokens, §7.8.2) — fine as-is.
Verified clean
- The four genuinely-inheritable attributes (MediaBox/CropBox/Resources/Rotate) walk
/Parentcorrectly;ContentStreamscorrectly readsp.dict(Contents is not inheritable). - Cycle termination is sound in both
collectPages(ref seen-set) andinherited(*Dictseen-set + depth bound atpage.go:163), with dedicated tests. Rotationnormalisation (page.go:221-228) is correct on 64-bit for 90 / -90 / 450 / 360 / 45 / wrong-type;rectFromArray/numberValue(page.go:298,page.go:322) handle Integer/Real and indirect-ref entries and reject wrong-length arrays.- Fixture
/Countvalues are correct (count = leaf descendants: obj 2 and obj 3 each have two leaves below them).
- Box/Boxes: only MediaBox and CropBox inherit along /Parent (PDF 32000-1:2008 Table 30). BleedBox/TrimBox/ArtBox are page-level only and are now read from the page object, so an ancestor /Pages node carrying one no longer leaks a wrong clip/trim box into descendant leaves. - ContentStreams: an array /Contents entry that resolves to null (missing object) or a non-stream now returns an error instead of being dropped, so Content no longer reports a silently truncated stream as success. - collectPages: a node with no resolvable /Kids array but /Type /Pages, /Count, or an unresolvable /Kids key is treated as a broken/empty branch rather than emitted as a phantom leaf page. - filter: factor the shared bounds check into rawStreamSlice; decodeStream and rawStreamBytes now share one check and one error message. Tests added for each: non-inheritable boxes, broken /Contents array entry, and the phantom-leaf guard.
|
Thanks — thorough review. Addressed in 023992a. Correctness (all fixed):
Cleanup:
Minor —
|
Why
The reader exposes the catalog but has no public way to get at the pages
themselves — page count, geometry, rotation, resources, or content. That was
the one missing layer for anything that needs to walk a document page by page
(importers, layout and inspection tooling). This PR adds a read-only page-tree
navigation API on top of the existing object model and
Resolve*helpers.What
page.goReader.PageCount()/Pages()/Page(index)— 0-based indexing(idiomatic Go; matches the existing
Objects()ordering), backed by acached page-tree walk.
*Pagehandle withBox,Boxes,Rotation,Resources,ContentStreams, andContent. The inheritable attributes(boxes,
/Rotate,/Resources) are resolved along the/Parentchain perPDF 32000-1:2008 §7.7.3.4.
BoxName(the five boundary boxes) andRect(normalised,with
Width()/Height()).Stream.RawBytes()to expose raw, undecoded stream bytes alongside theexisting decoded
Content().Scope & design
decoded; they are never parsed or graphics-interpreted.
ContentStreamsflattens a single or array
/Contentsto the underlying streams;Contentconcatenates the decoded bytes with a newline separator.
MediaBox — accessors report exactly what the document declares.
footprint is essentially unchanged. The addition is
300 lines of library+6–7% of the non-test module), in line with the project's minimal-code (
footprint goal.
Defensiveness
Consistent with the rest of the codebase (cf.
TestEmbeddedFilesCyclicKidsTerminates):cyclic
/Kidsand/Parentchains terminate (seen-set + depth bound), the/Countvalue is never trusted for preallocation, leaf detection keys off/Kidspresence so a missing/Typeis tolerated, and malformed boxes reportabsent rather than panicking.
Tests & docs
levels, and
/Contentsas an array.page_test.go: inheritance and local overrides, rotation normalisation(including out-of-range and wrong-type), malformed/short boxes, index
bounds, and cyclic
/Kids//Parenttermination.examples/pageinfo(with tests) demonstrating the API end to end.doc.go"In scope" lists updated.All of
go build,go vet,gofmt -l, andgo test -race ./...are clean.