Skip to content

fix: tinybird datasources (CM-1219)#4205

Open
joanagmaia wants to merge 3 commits into
mainfrom
fix/tinybird-datasources
Open

fix: tinybird datasources (CM-1219)#4205
joanagmaia wants to merge 3 commits into
mainfrom
fix/tinybird-datasources

Conversation

@joanagmaia

@joanagmaia joanagmaia commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

This pull request updates several Tinybird datasource schemas and descriptions to improve data consistency, add new fields for analytics, and clarify documentation. The changes include making more fields nullable, converting some numeric fields to strings for compatibility, adding new metrics related to repositories and packages, and improving documentation for clarity and accuracy.

Schema and field type updates:

  • Converted several fields from Float32 to String for compatibility in advisories.datasource, packageRepos.datasource, and repos.datasource (e.g., cvss, confidence, scorecardScore). [1] [2] [3]
  • Made many fields nullable in ossPackages.datasource (formerly packages.datasource), such as namespace, registryUrl, status, description, and various count and score fields, to better handle missing data.

New fields and metrics:

  • Added new fields to ossPackages.datasource for analytics and ranking: downloadsLast30d, centralityScore, and rankInEcosystem. [1] [2]
  • Added new repository security and branch protection fields to repos.datasource, including securityPolicyEnabled, securityFileEnabled, snapshotAt, branchProtectionEnabled, branchProtectionAllowsForcePush, branchProtectionRequiredReviews, and branchProtectionRequiresStatusChecks. [1] [2]

Documentation improvements:

  • Updated and clarified the descriptions for ossPackages.datasource and repos.datasource, including more accurate explanations for fields like createdAt and new fields added for security and ranking analytics. [1] [2] [3]

File renaming:

  • Renamed packages.datasource to ossPackages.datasource to better reflect its contents and usage.

These changes improve the flexibility, accuracy, and analytical capabilities of the data pipelines for open source package and repository metadata.


Note

Medium Risk
Schema type and nullability changes plus repos.createdAt semantics can break downstream Tinybird pipes or consumers expecting floats, defaults, or the old packages datasource name until they are updated.

Overview
Aligns Tinybird ingestion schemas with packages-db replication: several score fields move from Float32 to String (cvss, packageRepos.confidence, repos.scorecardScore, ossPackages.impact), and many optional package metadata columns become Nullable instead of empty-string/zero defaults.

packages.datasource is renamed to ossPackages.datasource, with new ranking/analytics columns (downloadsLast30d, centralityScore, rankInEcosystem). repos.datasource gains snapshot/security and branch-protection fields, and createdAt is redefined as a non-null row-insert timestamp (replacing the prior nullable “GitHub creation date” semantics in the schema docs).

Datasource DESCRIPTION blocks are updated to match these naming and semantics changes.

Reviewed by Cursor Bugbot for commit 3358b98. Bugbot is set up for automated code reviews on this repo. Configure here.

Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3358b98. Configure here.

@@ -1,5 +1,5 @@
DESCRIPTION >
- `packages` contains Tier-2 package metadata for tracked OSS packages across multiple ecosystems.
- `ossPackages` contains Tier-2 package metadata for tracked OSS packages across multiple ecosystems.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tinybird sink topic mismatch

High Severity

Renaming the Tinybird datasource to ossPackages while Sequin still publishes the Postgres table packages breaks ingestion when the Kafka Connect sink posts events with name={{topic}}: payloads keep targeting packages, not the new datasource name.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3358b98. Configure here.

`disabled` Nullable(UInt8) `json:$.record.disabled`,
`isFork` Nullable(UInt8) `json:$.record.is_fork`,
`createdAt` Nullable(DateTime64(3)) `json:$.record.created_at`,
`createdAt` DateTime64(3) `json:$.record.created_at`,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-null repo createdAt risk

Medium Severity

createdAt is now required in repos.datasource, but repos.created_at in packages-db remains nullable without NOT NULL, so CDC rows with a null created_at can fail ingestion or mis-partition on toYear(createdAt).

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3358b98. Configure here.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Tinybird datasource definitions under services/libs/tinybird/datasources/ to better align with packages-db replication semantics, extend repo/package analytics coverage, and adjust certain field types/nullability for ingestion compatibility.

Changes:

  • Extend repos datasource with new security/snapshot/branch-protection fields and adjust createdAt/scorecardScore types.
  • Update ossPackages datasource schema to make many fields nullable and add ranking/analytics fields (downloadsLast30d, centralityScore, rankInEcosystem).
  • Convert selected numeric fields to String in advisories (cvss) and packageRepos (confidence) for compatibility.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
services/libs/tinybird/datasources/repos.datasource Adds new repo security/snapshot/branch-protection columns; adjusts some field types and docs.
services/libs/tinybird/datasources/packageRepos.datasource Changes confidence field type to String for ingestion compatibility.
services/libs/tinybird/datasources/ossPackages.datasource Makes many package metadata fields nullable and adds ranking/analytics fields.
services/libs/tinybird/datasources/advisories.datasource Changes cvss field type to String for ingestion compatibility.
Comments suppressed due to low confidence (2)

services/libs/tinybird/datasources/ossPackages.datasource:46

  • Several fields were changed from String ... DEFAULT '' to Nullable(String) (e.g., namespace/registryUrl/status/description), but the DESCRIPTION section still documents them as empty-string defaults. That documentation is now inaccurate; consumers will see NULLs and may need to coalesce() accordingly.
    services/libs/tinybird/datasources/ossPackages.datasource:32
  • These metrics are described as defaulting to 0/'0' when absent, but the schema defines them as Nullable with no DEFAULT, so they will be NULL when not populated. Also, impact/centralityScore are stored as strings per the schema in this PR, which is worth reflecting in the docs to avoid surprising downstream numeric comparisons.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- `disabled` is 1 if the repository is disabled, 0 if not, NULL until the GitHub enricher runs.
- `isFork` is 1 if this repository is a fork of another, 0 if not, NULL until the GitHub enricher runs.
- `createdAt` is the repository creation date on GitHub/GitLab — a domain timestamp, not a row-insert timestamp.
- `createdAt` is the row-insert timestamp — set once on first insert via DEFAULT NOW(), never updated.
`repoId` UInt64 `json:$.record.repo_id`,
`source` String `json:$.record.source`,
`confidence` Float32 `json:$.record.confidence`,
`confidence` String `json:$.record.confidence`,
`aliases` Array(String) `json:$.record.aliases[:]` DEFAULT [],
`severity` String `json:$.record.severity` DEFAULT '',
`cvss` Float32 `json:$.record.cvss` DEFAULT 0,
`cvss` String `json:$.record.cvss` DEFAULT '0',
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants