Skip to content

Use NumPy cosine distance in Parakeet speaker matching#7701

Open
tianmind-studio wants to merge 1 commit into
BasedHardware:mainfrom
tianmind-studio:codex/parakeet-cosine-distance-numpy
Open

Use NumPy cosine distance in Parakeet speaker matching#7701
tianmind-studio wants to merge 1 commit into
BasedHardware:mainfrom
tianmind-studio:codex/parakeet-cosine-distance-numpy

Conversation

@tianmind-studio

@tianmind-studio tianmind-studio commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

  • replace SciPy cdist(..., metric="cosine") calls in Parakeet speaker matching with a shared NumPy cosine-distance helper
  • avoid requiring SciPy for the Parakeet stream/prerecorded speaker matching path
  • keep zero-vector behavior safe by returning maximum non-match distance
  • add focused tests for cosine-distance values, shared helper reuse, and the existing stream session flow

Current status

  • Rebased on main at 1a5824403b68ce47c3b0909577cadc1242ba0d3f
  • Head refreshed to 0e1497e9580d27912d6d3c7efe0c4481d8f1d88b
  • GitHub currently reports the PR as mergeable
  • Existing Greptile feedback remains addressed by the shared backend/parakeet/speaker_math.py helper and test_stream_handler_uses_shared_cosine_distance

Verification

  • D:\codex-omi-work\.venvs\omi-backend-vad-refresh\Scripts\python.exe -m pytest backend\tests\unit\test_parakeet_stream_session.py -q --tb=short
    • 17 passed
  • D:\codex-omi-work\.venvs\omi-backend-vad-refresh\Scripts\python.exe -m py_compile backend\parakeet\speaker_math.py backend\parakeet\stream_handler.py backend\parakeet\transcribe.py backend\tests\unit\test_parakeet_stream_session.py
  • D:\codex-omi-work\.venvs\omi-backend-vad-refresh\Scripts\python.exe -m black --line-length 120 --skip-string-normalization --check backend\parakeet\speaker_math.py backend\parakeet\stream_handler.py backend\parakeet\transcribe.py backend\tests\unit\test_parakeet_stream_session.py
    • 4 files would be left unchanged
  • PYTHONUTF8=1 D:\codex-omi-work\.venvs\omi-backend-vad-refresh\Scripts\python.exe backend\scripts\scan_async_blockers.py
    • exit 0; existing async-blocker backlog reported, no new Parakeet blocker
  • git diff --check origin/main...HEAD
  • scripts/pre-commit via Git for Windows sh.exe with the backend Windows venv and local Dart SDK on PATH

@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR replaces scipy.spatial.distance.cdist with a small inline NumPy helper (_cosine_distance) in both Parakeet speaker-matching files, eliminating the SciPy dependency on that code path and preserving identical runtime behaviour. The helper correctly flattens the (1, N) embedding arrays produced by both the built-in pyannote model and the HTTP embedding endpoint before computing the cosine distance, matching what cdist(...)[0, 0] returned.

  • Adds _cosine_distance to stream_handler.py and transcribe.py, updating the single call-site in each; zero-vector inputs safely return 1.0.
  • Refactors the VAD mock in the test from patching the raw _vad attribute to patching _run_vad, and adds focused unit tests for the new helper covering identical, orthogonal, and zero-vector inputs.

Confidence Score: 4/5

Safe to merge — the NumPy helper is a drop-in replacement for the cdist call and handles the actual embedding shapes correctly.

The change removes an external dependency cleanly and the core cosine-distance logic is correct. The only concern is that the helper function is copy-pasted identically into two files rather than shared from one place, which creates a small maintenance surface. No data-correctness or runtime issues were found.

Both stream_handler.py and transcribe.py carry the same duplicated helper; consolidating it into one location would reduce drift risk going forward.

Important Files Changed

Filename Overview
backend/parakeet/stream_handler.py Replaces cdist import and call with a new _cosine_distance helper that uses NumPy; behaviour is preserved because reshape(-1) correctly flattens the (1, N) embeddings that both _get_embedding_builtin and _get_embedding_http return. The function is an exact duplicate of the one added in transcribe.py.
backend/parakeet/transcribe.py Removes the scipy.spatial.distance.cdist import and adds an identical _cosine_distance helper; the call-site in _transcribe_v2_with_diarization is updated to match. Logic is correct for the (1, N)-shaped embeddings used in this file.
backend/tests/unit/test_parakeet_stream_session.py Adds TestCosineDistance with tests for identical vectors (distance 0), orthogonal vectors (distance 1), and a zero vector (distance 1); also refactors the VAD mock from raw _vad attribute patching to _run_vad method patching, making the test more resilient to internal implementation changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Speaker embedding\n(shape: 1xN)"] --> B["_cosine_distance(emb, centroid)"]
    C["Stored centroid\n(shape: 1xN)"] --> B
    B --> D["reshape(-1) → flat N-dim vectors"]
    D --> E["denom = norm(a) * norm(b)"]
    E --> F{denom <= 0?}
    F -- yes --> G["return 1.0\n(zero-vector guard)"]
    F -- no --> H["distance = 1 - dot(a,b)/denom"]
    H --> I["clamp to [0.0, 2.0]"]
    I --> J{dist < threshold?}
    J -- yes --> K["Update centroid\n(running average)"]
    J -- no --> L["Register new speaker"]
Loading

Reviews (1): Last reviewed commit: "Use NumPy cosine distance in Parakeet sp..." | Re-trigger Greptile

Comment thread backend/parakeet/stream_handler.py Outdated
@tianmind-studio tianmind-studio force-pushed the codex/parakeet-cosine-distance-numpy branch from a8f28f6 to cf9ecba Compare June 8, 2026 14:26

@kodjima33 kodjima33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backend bug fix (NumPy cosine distance replaces SciPy in Parakeet speaker matching, drops SciPy dep) — approve only per policy

@tianmind-studio tianmind-studio force-pushed the codex/parakeet-cosine-distance-numpy branch from cf9ecba to 0bcf939 Compare June 10, 2026 03:49

@kodjima33 kodjima33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approved on new commits — backend (approve-only area).

@kodjima33 kodjima33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backend refactor (NumPy cosine replaces SciPy) — approve only per policy.

@tianmind-studio tianmind-studio force-pushed the codex/parakeet-cosine-distance-numpy branch 3 times, most recently from 42f7f67 to 6e4463d Compare June 16, 2026 23:33
@tianmind-studio tianmind-studio force-pushed the codex/parakeet-cosine-distance-numpy branch from 6e4463d to 0e1497e Compare June 17, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants