fix: tinybird datasources (CM-1219)#4205
Conversation
Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3358b98. Configure here.
| @@ -1,5 +1,5 @@ | |||
| DESCRIPTION > | |||
| - `packages` contains Tier-2 package metadata for tracked OSS packages across multiple ecosystems. | |||
| - `ossPackages` contains Tier-2 package metadata for tracked OSS packages across multiple ecosystems. | |||
There was a problem hiding this comment.
Tinybird sink topic mismatch
High Severity
Renaming the Tinybird datasource to ossPackages while Sequin still publishes the Postgres table packages breaks ingestion when the Kafka Connect sink posts events with name={{topic}}: payloads keep targeting packages, not the new datasource name.
Reviewed by Cursor Bugbot for commit 3358b98. Configure here.
| `disabled` Nullable(UInt8) `json:$.record.disabled`, | ||
| `isFork` Nullable(UInt8) `json:$.record.is_fork`, | ||
| `createdAt` Nullable(DateTime64(3)) `json:$.record.created_at`, | ||
| `createdAt` DateTime64(3) `json:$.record.created_at`, |
There was a problem hiding this comment.
Non-null repo createdAt risk
Medium Severity
createdAt is now required in repos.datasource, but repos.created_at in packages-db remains nullable without NOT NULL, so CDC rows with a null created_at can fail ingestion or mis-partition on toYear(createdAt).
Reviewed by Cursor Bugbot for commit 3358b98. Configure here.
There was a problem hiding this comment.
Pull request overview
This PR updates Tinybird datasource definitions under services/libs/tinybird/datasources/ to better align with packages-db replication semantics, extend repo/package analytics coverage, and adjust certain field types/nullability for ingestion compatibility.
Changes:
- Extend
reposdatasource with new security/snapshot/branch-protection fields and adjustcreatedAt/scorecardScoretypes. - Update
ossPackagesdatasource schema to make many fields nullable and add ranking/analytics fields (downloadsLast30d,centralityScore,rankInEcosystem). - Convert selected numeric fields to
Stringinadvisories(cvss) andpackageRepos(confidence) for compatibility.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| services/libs/tinybird/datasources/repos.datasource | Adds new repo security/snapshot/branch-protection columns; adjusts some field types and docs. |
| services/libs/tinybird/datasources/packageRepos.datasource | Changes confidence field type to String for ingestion compatibility. |
| services/libs/tinybird/datasources/ossPackages.datasource | Makes many package metadata fields nullable and adds ranking/analytics fields. |
| services/libs/tinybird/datasources/advisories.datasource | Changes cvss field type to String for ingestion compatibility. |
Comments suppressed due to low confidence (2)
services/libs/tinybird/datasources/ossPackages.datasource:46
- Several fields were changed from
String ... DEFAULT ''toNullable(String)(e.g., namespace/registryUrl/status/description), but the DESCRIPTION section still documents them as empty-string defaults. That documentation is now inaccurate; consumers will see NULLs and may need tocoalesce()accordingly.
services/libs/tinybird/datasources/ossPackages.datasource:32 - These metrics are described as defaulting to 0/'0' when absent, but the schema defines them as Nullable with no DEFAULT, so they will be NULL when not populated. Also,
impact/centralityScoreare stored as strings per the schema in this PR, which is worth reflecting in the docs to avoid surprising downstream numeric comparisons.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - `disabled` is 1 if the repository is disabled, 0 if not, NULL until the GitHub enricher runs. | ||
| - `isFork` is 1 if this repository is a fork of another, 0 if not, NULL until the GitHub enricher runs. | ||
| - `createdAt` is the repository creation date on GitHub/GitLab — a domain timestamp, not a row-insert timestamp. | ||
| - `createdAt` is the row-insert timestamp — set once on first insert via DEFAULT NOW(), never updated. |
| `repoId` UInt64 `json:$.record.repo_id`, | ||
| `source` String `json:$.record.source`, | ||
| `confidence` Float32 `json:$.record.confidence`, | ||
| `confidence` String `json:$.record.confidence`, |
| `aliases` Array(String) `json:$.record.aliases[:]` DEFAULT [], | ||
| `severity` String `json:$.record.severity` DEFAULT '', | ||
| `cvss` Float32 `json:$.record.cvss` DEFAULT 0, | ||
| `cvss` String `json:$.record.cvss` DEFAULT '0', |


This pull request updates several Tinybird datasource schemas and descriptions to improve data consistency, add new fields for analytics, and clarify documentation. The changes include making more fields nullable, converting some numeric fields to strings for compatibility, adding new metrics related to repositories and packages, and improving documentation for clarity and accuracy.
Schema and field type updates:
Float32toStringfor compatibility inadvisories.datasource,packageRepos.datasource, andrepos.datasource(e.g.,cvss,confidence,scorecardScore). [1] [2] [3]ossPackages.datasource(formerlypackages.datasource), such asnamespace,registryUrl,status,description, and various count and score fields, to better handle missing data.New fields and metrics:
ossPackages.datasourcefor analytics and ranking:downloadsLast30d,centralityScore, andrankInEcosystem. [1] [2]repos.datasource, includingsecurityPolicyEnabled,securityFileEnabled,snapshotAt,branchProtectionEnabled,branchProtectionAllowsForcePush,branchProtectionRequiredReviews, andbranchProtectionRequiresStatusChecks. [1] [2]Documentation improvements:
ossPackages.datasourceandrepos.datasource, including more accurate explanations for fields likecreatedAtand new fields added for security and ranking analytics. [1] [2] [3]File renaming:
packages.datasourcetoossPackages.datasourceto better reflect its contents and usage.These changes improve the flexibility, accuracy, and analytical capabilities of the data pipelines for open source package and repository metadata.
Note
Medium Risk
Schema type and nullability changes plus repos.createdAt semantics can break downstream Tinybird pipes or consumers expecting floats, defaults, or the old packages datasource name until they are updated.
Overview
Aligns Tinybird ingestion schemas with packages-db replication: several score fields move from Float32 to String (
cvss,packageRepos.confidence,repos.scorecardScore,ossPackages.impact), and many optional package metadata columns become Nullable instead of empty-string/zero defaults.packages.datasourceis renamed toossPackages.datasource, with new ranking/analytics columns (downloadsLast30d,centralityScore,rankInEcosystem).repos.datasourcegains snapshot/security and branch-protection fields, andcreatedAtis redefined as a non-null row-insert timestamp (replacing the prior nullable “GitHub creation date” semantics in the schema docs).Datasource DESCRIPTION blocks are updated to match these naming and semantics changes.
Reviewed by Cursor Bugbot for commit 3358b98. Bugbot is set up for automated code reviews on this repo. Configure here.