Skip to content

Add multi GPU CI job for libcu++#9435

Open
pciolkosz wants to merge 1 commit into
NVIDIA:mainfrom
pciolkosz:multi_gpu_CI_jobs_for_libcudacxx
Open

Add multi GPU CI job for libcu++#9435
pciolkosz wants to merge 1 commit into
NVIDIA:mainfrom
pciolkosz:multi_gpu_CI_jobs_for_libcudacxx

Conversation

@pciolkosz

Copy link
Copy Markdown
Contributor

We could use multi GPU CI jobs to test interactions of cccl-rt with multiple GPUs. This PR adds a 2 GPU job to matrix.yaml

Needed to update how the name to label translation works to support the 2 GPU runners

@pciolkosz pciolkosz requested a review from a team as a code owner June 12, 2026 21:43
@pciolkosz pciolkosz requested a review from jrhemstad June 12, 2026 21:43
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 12, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 12, 2026
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cea86613-d6e5-4b94-a594-b4fc3b232e0f

📥 Commits

Reviewing files that changed from the base of the PR and between 9e1588a and 5b44eda.

📒 Files selected for processing (3)
  • .github/actions/workflow-build/build-workflow.py
  • .github/actions/workflow-run-job-linux/action.yml
  • ci/matrix.yaml

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

This PR adds a 2-GPU CI job to ci/matrix.yaml for the libcudacxx project to test interactions with multiple GPUs. It also updates the GPU configuration handling in build and run workflow files to properly support multi-GPU runners.

Changes

.github/actions/workflow-build/build-workflow.py

  • Enhanced get_gpu function to validate that each GPU definition includes required fields (name, runner, sm), raising an exception when missing
  • Updated generate_dispatch_job_name to use gpu["name"] (with ", " prefix) for GPU job labels instead of constructing uppercase strings from raw GPU tags
  • Modified generate_dispatch_job_runner to use gpu["runner"] directly (plus -testing suffix when applicable) instead of using gpu["id"] with a hardcoded -latest-1 suffix

.github/actions/workflow-run-job-linux/action.yml

  • Updated Docker GPU access logic on *-gpu-* runners to conditionally handle multi-GPU setups
  • For multi-GPU runners (GPU count > 1), requests all GPUs via --gpus all
  • For single-GPU runners, continues to use --gpus "device=${NVIDIA_VISIBLE_DEVICES:-}"

ci/matrix.yaml

  • Added new multi-GPU CI job for libcudacxx: a test job targeting gpu: h100_2gpu with sm: gpu
  • Restructured GPU configuration section to include display name and specific runner label for each GPU entry
  • Added h100_2gpu GPU model with associated runner label
  • Updated rtxpro6000 entry to include runner label

Walkthrough

The PR updates GPU configuration and handling across CI workflow generation and execution. GPU definitions in the matrix gain structured name and runner fields; build script validation enforces these required fields and uses them for job naming and runner selection; runtime logic detects multi-GPU runners and conditionally selects Docker GPU device access; a new h100_2gpu test entry exercises the updated configuration.

Changes

GPU configuration and multi-GPU execution

Layer / File(s) Summary
GPU configuration schema with name and runner fields
ci/matrix.yaml
GPU definitions now include name and runner fields alongside sm; new h100_2gpu model and updated rtxpro6000 runner labels are added.
GPU validation and build-time job naming/runner generation
build-workflow.py
get_gpu validates that GPU definitions include required name, runner, and sm fields; generate_dispatch_job_name appends gpu["name"] to job display names; generate_dispatch_job_runner uses gpu["runner"] instead of gpu["id"]-latest-1.
Multi-GPU Docker device selection at runtime
action.yml
GPU device selection logic now detects multi-GPU runner labels via regex and requests --gpus all for counts > 1, otherwise uses prior device=${NVIDIA_VISIBLE_DEVICES:-} behavior.
Multi-GPU test matrix entry for h100_2gpu
ci/matrix.yaml
New pull_request matrix job entry for libcudacxx targeting h100_2gpu GPU configuration.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 1h 44m: Pass: 99%/505 | Total: 4d 12h | Max: 54m 27s | Hits: 98%/655116

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

1 participant