Skip to content

issue #7427 : Dimension Lookup/Update regression and bug fix#7429

Open
mattcasters wants to merge 1 commit into
apache:mainfrom
mattcasters:issue-7408-regression
Open

issue #7427 : Dimension Lookup/Update regression and bug fix#7429
mattcasters wants to merge 1 commit into
apache:mainfrom
mattcasters:issue-7408-regression

Conversation

@mattcasters

Copy link
Copy Markdown
Contributor

Summary

Issue #7427 addresses regressions and related failures in the Dimension Lookup / Update transform after the validation work landed in #7408 (commit 5ca8eb1eb6, PR #7408). That change added extensive field and metadata validation but inadvertently broke a previously valid configuration: lookup-only mode with an empty version field.

This branch also improves preload-cache behaviour: it replaces the cryptic Comparison method violates its general contract error with actionable validation messages, adds an option to skip zero-length validity rows during preload, fixes a preload date-type mismatch, and adds integration tests under integration-tests/database/.


Root cause analysis

1. Version field regression (primary regression from #7408)

Commit 5ca8eb1eb6 introduced validateFieldsAndMetadata() in DimensionLookup.java. When the version field is empty, it always threw:

Please specify a fieldname to store the version of the dimension entry in.

That check is appropriate for update mode (SCD insert/update algorithm needs a version column), but lookup-only mode (update = N) has never required a version field. Lookup-only pipelines with <version/> empty therefore started failing at init.

The same unconditional requirement existed in DimensionLookupMeta.checkReturns() used by the transform verification UI.

Fix: Gate the version-field requirement behind isUpdate() in both places:

  • DimensionLookup.validateFieldsAndMetadata()
  • DimensionLookupMeta.checkReturns()

2. Preload cache sort comparator error (pre-existing, surfaced by testing)

DimensionCache.compare() is designed for binary-search lookup, not for Collections.sort(). When two rows share the same natural key and overlapping [date_from, date_to) ranges, the comparator can return 0 for non-equal rows, violating the Comparator contract and producing:

Comparison method violates its general contract

This is especially common when upstream flattening produces zero-length validity rows (date_from == date_to).

Fix: Before sortRows(), call new DimensionCache.validateRowsForSort() which checks:

  • Invalid ranges (from >= to, including zero-length)
  • Overlapping ranges per natural key

Errors are reported with natural key and date-range context (capped at 3 messages). HopTransformException is rethrown directly from preloadCache() so messages are not wrapped in a generic HopException.

3. Preload lookup date type mismatch (found during integration testing)

When comparing the lookup date against preloaded rows, java.util.Date (pipeline execution date) was placed directly into a row whose metadata expects java.sql.Timestamp (PostgreSQL preload). This caused:

ClassCastException: java.util.Date cannot be cast to java.sql.Timestamp

Fix: Convert the lookup date through the preload row's date value meta before cache lookup:

lookupRow[data.preloadFromDateIndex] =
    preloadFromDateMeta.convertData(valueDateMeta, valueDate);

Code changes

Transform plugin (plugins/transforms/dimensionlookup)

File Change
DimensionLookup.java Version validation gated on isUpdate(); preload validation call; zero-length filter; date conversion fix; HopTransformException passthrough from preload; improved error logging (stack trace on init failure)
DimensionLookupMeta.java New ignoreZeroLengthValidity property (ignore_zero_length_validity, default false); version check gated on isUpdate() in checkReturns()
DimensionCache.java validateRowsForSort(), isZeroLengthValidity(), excludeZeroLengthValidityRows(), overlap/invalid-range helpers
DimensionLookupDialog.java Checkbox "Ignore zero-length validity records" below "Pre-load the cache"; enabled only when cache + preload + lookup-only
messages_en_US.properties Cache validation errors, ignore-option labels/tooltips, log messages
DimensionCacheTest.java Unit tests for validation (adjacent ranges, overlaps, invalid ranges, null open-ended overlaps) and zero-length detection/filtering

Integration tests (integration-tests/database/)

Wired into main-0012-dimension-lookup-update.hwf after existing 0012-5:

Test Purpose
0012-6-dimension-lookup-no-version-field SQL setup → unit test. Lookup-only, empty version, preload_cache=Y, cache_size=0. Asserts regression fix.
0012-dimension-lookup-zero-length-validity SQL setup → pipeline (expected failure) → Success via failure hop. Asserts preload validation catches bad dimension data.

Supporting files (new):

  • 0012-dimension-lookup-no-version.hpl — includes downstream Verify (Dummy) transform for unit-test row capture
  • 0012-dimension-lookup-zero-length-validity.hpl
  • SQL scripts, golden CSV/dataset metadata, unit-test JSON

Unit-test design note: The Hop unit-test framework captures input rows to the golden-data transform (rowReadEvent), not transform output. Golden data must therefore be attached to a downstream transform (e.g. Dummy Verify) whose input row meta already contains the expected lookup results. This follows the same pattern as 0023-db-procedure.


New option: ignore zero-length validity

Property: ignore_zero_length_validity (default false)

When enabled with preload cache in lookup-only mode, rows where date_from == date_to are excluded before cache validation and sorting. Useful when upstream flattening produces point-in-time records that cannot participate in range-based lookup but should not block the pipeline.

Log messages (detailed/basic) report how many rows were skipped; an empty cache after filtering logs a basic warning.


Testing

Unit tests

./mvnw test -pl plugins/transforms/dimensionlookup

All DimensionCacheTest cases pass, including new validation and zero-length filtering tests.

Integration tests

integration-tests/scripts/run-tests-docker.sh PROJECT_NAME=database

Verified on a clean Docker run (container exit code 0):

  • 0012-6: unit test passed — keys 1/2 resolved to dimension_id 1/2, values alpha/beta
  • 0012-zero-length-validity: pipeline failed with explicit validation message; workflow succeeded on failure hop:
    • Natural key [K1] has an invalid date range [2020/03/01, 2020/03/01) (date-from must be before date-to)
    • Overlapping validity periods reported for the same key
  • main-0012-dimension-lookup-update: all sub-workflows (0012-1 through zero-length-validity) result=[true]

Build note for reviewers: A full reactor build (-am) may fail on unrelated modules (e.g. corrupted hop-action-ftp checkstyle XML). Building the plugin and client directly works:

./mvnw install -pl plugins/transforms/dimensionlookup -DskipTests
./mvnw clean install -pl assemblies/client -DskipTests -Dcheckstyle.skip=true

The Docker test script unzips assemblies/client/target/hop-client-*.zip before building the image; ensure the client assembly includes the rebuilt dimensionlookup JAR.


What reviewers should focus on

  1. Regression fix correctness — Version field is still required and validated in update mode; lookup-only with empty version must not fail validation or UI checks.
  2. Validation vs. comparatorvalidateRowsForSort() uses a sort-safe comparator (natural key, then date_from); the existing compare() for binary search is unchanged. Confirm this separation is intentional and sufficient.
  3. Error message quality — Failures during preload should surface HopTransformException messages directly to the user, not generic wrappers.
  4. Ignore option semantics — Filtering happens before validation; document that ignored rows are invisible to lookup (potential data-gap risk if misused).
  5. Date conversion — Preload path only; non-preload lookup paths unchanged.
  6. Integration test pattern — Verify transform placement for golden-data capture; failure-hop workflow for expected pipeline errors.

Out of scope / known limitations

  • The underlying DimensionCache.compare() contract issue for Collections.sort() is not fixed by changing the comparator; invalid/overlapping data is rejected upfront instead.
  • No third integration test for the ignore zero-length option end-to-end (option is covered by unit tests; could be added later).
  • Unrelated pre-existing integration failure: 0029-sql-file-output (cannot create output folder) — not introduced by this work.

File change summary

 integration-tests/database/main-0012-dimension-lookup-update.hwf          |  66 +++
 plugins/transforms/dimensionlookup/.../DimensionCache.java                | 208 +++++++
 plugins/transforms/dimensionlookup/.../DimensionLookup.java             |  36 +-
 plugins/transforms/dimensionlookup/.../DimensionLookupDialog.java        |  33 +-
 plugins/transforms/dimensionlookup/.../DimensionLookupMeta.java          |  20 +-
 plugins/transforms/dimensionlookup/.../messages_en_US.properties        |  10 +-
 plugins/transforms/dimensionlookup/.../DimensionCacheTest.java          | 158 ++++++
 + 8 new integration-test files (pipelines, workflows, SQL, datasets, metadata)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant