Skip to content

Log SFT metrics during training#719

Open
Kovbo wants to merge 3 commits into
mainfrom
fix/sft-metrics-all-steps
Open

Log SFT metrics during training#719
Kovbo wants to merge 3 commits into
mainfrom
fix/sft-metrics-all-steps

Conversation

@Kovbo
Copy link
Copy Markdown
Collaborator

@Kovbo Kovbo commented Jun 5, 2026

Before:

  • SFT training yielded optimizer metrics for each batch/gradient step, but TrainableModel.train_sft() only collected them, averaged them at the end, and logged one final train/* row. That matched “one SFT run = one checkpoint,” but it meant W&B had no live SFT loss curve.

After

  • SFT logs every optimizer step under sft/, then still logs one final aggregate train/ row.
  • Serverless owns W&B logging so the user can close their laptop after starting training.

Summary

  • Log SFT optimizer metrics for every gradient step without creating per-step checkpoints. This adds a separate sft/* W&B namespace for detailed SFT metrics, using sft/gradient_step as the x-axis. SFT jobs still produce one checkpoint and one aggregate training metric, while detailed per-step metrics are logged under the separate SFT split.
  • Keep local backend logging client-owned, while letting remote SFT backends own server-side metric logging. This is intentionally different from RL: for Serverless SFT, the client may disconnect after starting training, so server-side logging is required if we want metrics to continue being recorded.
  • Forward the minimal SFT metric logging config through serverless training jobs.
  • Define W&B routing for the fixed sft/* namespace.
  • Update dev SFT scripts to exercise Megatron/Qwen SFT.

Testing

  • uv run ruff check src/art/types.py src/art/model.py src/art/utils/sft.py tests/unit/test_frontend_logging.py tests/unit/test_serverless_pipeline_trainer_compat.py
  • uv run ty check tests/unit/test_frontend_logging.py tests/unit/test_serverless_pipeline_trainer_compat.py --output-format concise
  • uv run pytest tests/unit/test_frontend_logging.py::TestTrainSFTMetricsAggregation::test_train_sft_logs_every_gradient_step tests/unit/test_frontend_logging.py::TestTrainSFTMetricsAggregation::test_train_sft_remote_logging_does_not_write_local_history tests/unit/test_serverless_pipeline_trainer_compat.py::test_serverless_train_sft_forwards_metric_logging_config -q

@Kovbo Kovbo force-pushed the fix/sft-metrics-all-steps branch from 895895f to 2eb7d68 Compare June 5, 2026 02:06
@Kovbo Kovbo force-pushed the fix/sft-metrics-all-steps branch from 2eb7d68 to 5fb6b46 Compare June 5, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant