-
Notifications
You must be signed in to change notification settings - Fork 22
Issues
is:issue state:open
is:issue state:open
Issue creation is restricted in this repository
Search results
fix(vision): captured slice_update in DeepStack/section-assembly injection loops (qwen3_vl, qwen3_vl_moe, qwen2_vl, paddleocr_vl)
area:modelsModel architectures, weights, loading, metadataModel architectures, weights, loading, metadatapriority:highHigh priorityHigh prioritystatus:readyReady to be worked onReady to be worked ontype:bugBug fixes, error corrections, or issue resolutionsBug fixes, error corrections, or issue resolutionsStatus: Open.#650 In lablup/mlxcel;CUDA: long-prompt prefill aborts (exit 134, invalid launch config) when a kernel grid dim exceeds 65535
area:benchmarkBenchmark harness and performance measurement (bench_*.sh, /update-benchmarks)Benchmark harness and performance measurement (bench_*.sh, /update-benchmarks)area:inferenceGeneration, sampling, decoding (incl. speculative, DRY)Generation, sampling, decoding (incl. speculative, DRY)priority:highHigh priorityHigh prioritystatus:readyReady to be worked onReady to be worked ontype:bugBug fixes, error corrections, or issue resolutionsBug fixes, error corrections, or issue resolutionsStatus: Open.#648 In lablup/mlxcel;perf(speculative): CUDA drafter pairing matrix and continuous-batching integration
area:inferenceGeneration, sampling, decoding (incl. speculative, DRY)Generation, sampling, decoding (incl. speculative, DRY)priority:mediumMedium priorityMedium prioritystatus:readyReady to be worked onReady to be worked ontype:performancePerformance improvementsPerformance improvementsStatus: Open.#638 In lablup/mlxcel;perf(cuda/quant): sm_120/sm_121-tuned quantized GEMM for Blackwell prefill
area:coremlxcel-core: MLX FFI, primitives, KV cache, layersmlxcel-core: MLX FFI, primitives, KV cache, layerspriority:mediumMedium priorityMedium prioritystatus:readyReady to be worked onReady to be worked ontype:performancePerformance improvementsPerformance improvementsStatus: Open.#637 In lablup/mlxcel;perf(cuda): single-dtype decode graph: eliminate per-token AsType conversions
area:coremlxcel-core: MLX FFI, primitives, KV cache, layersmlxcel-core: MLX FFI, primitives, KV cache, layerspriority:mediumMedium priorityMedium prioritystatus:readyReady to be worked onReady to be worked ontype:performancePerformance improvementsPerformance improvementsStatus: Open.#636 In lablup/mlxcel;perf(cuda/kv): validate and enable quantized KV cache modes on CUDA
area:coremlxcel-core: MLX FFI, primitives, KV cache, layersmlxcel-core: MLX FFI, primitives, KV cache, layerspriority:mediumMedium priorityMedium prioritystatus:readyReady to be worked onReady to be worked ontype:performancePerformance improvementsPerformance improvementsStatus: Open.#635 In lablup/mlxcel;perf(cuda/attn): port the native paged-attention decode kernel to CUDA
area:coremlxcel-core: MLX FFI, primitives, KV cache, layersmlxcel-core: MLX FFI, primitives, KV cache, layerspriority:mediumMedium priorityMedium prioritystatus:readyReady to be worked onReady to be worked ontype:performancePerformance improvementsPerformance improvementsStatus: Open.#634 In lablup/mlxcel;perf(server): incremental detokenization and per-token streaming overhead reduction
area:inferenceGeneration, sampling, decoding (incl. speculative, DRY)Generation, sampling, decoding (incl. speculative, DRY)priority:mediumMedium priorityMedium prioritystatus:readyReady to be worked onReady to be worked ontype:performancePerformance improvementsPerformance improvementsStatus: Open.#633 In lablup/mlxcel;perf(server): port lookahead async_eval pipelining into BatchScheduler decode
area:inferenceGeneration, sampling, decoding (incl. speculative, DRY)Generation, sampling, decoding (incl. speculative, DRY)priority:mediumMedium priorityMedium prioritystatus:readyReady to be worked onReady to be worked ontype:performancePerformance improvementsPerformance improvementsStatus: Open.#632 In lablup/mlxcel;perf(cuda/ssm): close the hybrid-SSM decode gap on CUDA
area:coremlxcel-core: MLX FFI, primitives, KV cache, layersmlxcel-core: MLX FFI, primitives, KV cache, layerspriority:mediumMedium priorityMedium prioritystatus:readyReady to be worked onReady to be worked ontype:performancePerformance improvementsPerformance improvementsStatus: Open.#631 In lablup/mlxcel;fix(cuda/quant): repair nvfp4 execution path on CUDA
area:coremlxcel-core: MLX FFI, primitives, KV cache, layersmlxcel-core: MLX FFI, primitives, KV cache, layerspriority:mediumMedium priorityMedium prioritystatus:readyReady to be worked onReady to be worked ontype:bugBug fixes, error corrections, or issue resolutionsBug fixes, error corrections, or issue resolutionsStatus: Open.#630 In lablup/mlxcel;perf(cuda/moe): fix MoE prefill collapse on CUDA (gather_qmm / grouped GEMM path)
area:coremlxcel-core: MLX FFI, primitives, KV cache, layersmlxcel-core: MLX FFI, primitives, KV cache, layersarea:modelsModel architectures, weights, loading, metadataModel architectures, weights, loading, metadatapriority:mediumMedium priorityMedium prioritystatus:readyReady to be worked onReady to be worked ontype:performancePerformance improvementsPerformance improvementsStatus: Open.#629 In lablup/mlxcel;