Ai Benchmarks for Code

Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

Kimi K2.7-Code claims 30% fewer thinking tokens and a drop-in API swap path, but independent benchmarks show kernel ...

AI Coding Agents Write 180% More Code But Ship Only 30% More Software

AI coding agents boost code output by 180% but shipping rises only 30%, MIT finds. Why private data access beats benchmark ...

16d

Gomboc AI Publishes First Open Benchmark for AI Code Remediation

15 cloud scenarios. 43 merge-ready fixes. 100% loop closure. 12 minutes and $17 to author once; seconds and zero-cost ...

Morning Overview on MSN

Microsoft’s new MAI-Code model turns plain-English descriptions into working app code

Microsoft released MAI-Code, a model designed to convert plain-English descriptions into functional application code, pushing ...

Hosted on MSN

What AI coding benchmarks still miss about software quality

Most AI coding benchmarks still ask the question: did the agent produce code that passes the current tests? This is a useful question, but it is too narrow. Software development is iterative.

Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again

B, a 3-billion-parameter AI model, is challenging OpenAI, Google and DeepSeek on math and coding benchmarks while reigniting ...

SD Times

Beyond Benchmarks: Measuring the True Cost of AI-Generated Code

Value stream management involves people in the organization to examine workflows and other processes to ensure they are deriving the maximum value from their efforts while eliminating waste — of ...

CIO

How CodeRabbit Tackles the AI Code Review Bottleneck

Michael: More code is being generated by AI, and that throughput is putting strain on the review process. AI isn’t always ...

InfoWorld

Why benchmarks are key to AI progress

Researchers are racing to develop more challenging, interpretable, and fair assessments of AI models that reflect real-world use cases. The stakes are high. Benchmarks are often reduced to leaderboard ...

Forbes

AI Models Still Struggle With Reasoning — And Here’s Why

Forbes contributors publish independent expert analyses and insights. I write about the economics of AI. What looks like intelligence in AI models may just be memorization. A closer look at benchmarks ...

1mon

Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark

Microsoft's new vulnerability-scanning system, codenamed MDASH, scored 88.45% on the CyberGym benchmark, surpassing single-model systems from Anthropic and OpenAI by using more than 100 specialized AI ...

Wired

AI Agents Are Getting Better at Writing Code—and Hacking It as Well

One of the best bug-hunters in the world is an AI tool called Xbow, just one of many signs of the coming age of cybersecurity automation. The latest artificial intelligence models are not only ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results