SkillsBench (89 real-world tasks) across four mainstream agent frameworks. Pass@1 with reward ≥ 0.5.
| Rank | Framework | Model | Baseline Pass% | SkCC Pass% | Δ Pass Rate | Mean Reward Δ | p-value | Cohen's d |
|---|---|---|---|---|---|---|---|---|
| #1 | Kimi CLI | kimi-k2.5 | 35.1% | 48.7% | +13.5pp | +0.142 | 0.0063** | 0.33 |
| #2 | Claude Code | claude-opus-4-6 | 21.1% | 33.3% | +12.2pp | +0.274 | 0.0103* | 0.60 |
| #3 | Codex CLI | gpt-5.3-codex | 38.5% | 42.3% | +3.8pp | +0.067 | — | — |
| #4 | Gemini CLI | gemini-2.5-pro | 22.2% | 22.2% | 0.0pp | +0.019 | — | — |
| Model | Framework | Backend | Baseline → SkCC Pass% | p-value | Cohen's d | Verdict |
|---|---|---|---|---|---|---|
| kimi-k2.5 | Kimi CLI | Kimi | 35.1% → 48.7% | 0.0063** | +0.33 | SkCC > Baseline |
| glm-5.1 | OpenHands | Kimi | 48.9% → 50.0% | 0.857 | −0.03 | SkCC ≈ Baseline |
| deepseek-v4-flash | OpenHands | Kimi | 72.7% → 73.9% | 0.2561 | −0.14 | Baseline > SkCC |
The same Kimi-compiled format produces divergent effects across models, confirming that compilation gains are model-specific — justifying SkCC's multi-backend architecture.