GLM-5.1: China’s New Open Source LLM That Outperforms Claude Opus 4.6 on Software Engineering

In a development that could reshape the competitive landscape of open source AI, Z.ai (also known as Zhupai AI) has released GLM-5.1, a 754-billion parameter Mixture-of-Experts model that outperforms Anthropic’s Claude Opus 4.6 on the prestigious SWE-Bench Pro benchmark for software engineering tasks.

The release, made available under the permissive MIT License on Hugging Face, marks a significant milestone in the evolution of autonomous AI agents. Unlike previous models that typically plateau after a few dozen tool calls, GLM-5.1 can maintain goal alignment over extended execution traces spanning thousands of tool calls—running autonomously for up to eight hours on a single task.

The Staircase Pattern: Breaking Through Performance Plateaus

GLM-5.1’s core technological breakthrough lies in what Z.ai researchers call the “staircase pattern” of optimization. Traditional agentic workflows typically apply familiar techniques for quick initial gains and then stall. GLM-5.1, however, operates through periods of incremental tuning within a fixed strategy, punctuated by structural changes that shift the performance frontier forward.

In one striking demonstration, the model was tasked with optimizing a high-performance vector database. Starting from a Rust skeleton with empty implementation stubs, GLM-5.1 ran through 655 iterations and over 6,000 tool calls. At iteration 90, it autonomously shifted from full-corpus scanning to IVF cluster probing with f16 vector compression, jumping performance from 3,547 queries per second to 6,400. By iteration 240, it introduced a two-stage pipeline involving u8 prescoring and f16 reranking, reaching 13,400 queries per second. The final result: 21,500 queries per second—roughly six times what single-session optimization could achieve.

Extended Autonomous Work Time

“Agents could do about 20 steps by the end of last year,” wrote z.ai leader Lou on X. “GLM-5.1 can do 1,700 right now. Autonomous work time may be the most important curve after scaling laws. GLM-5.1 will be the first point on that curve that the open-source community can verify with their own hands.”

The 202,752 token context window enables the model to maintain coherence over extremely long task sequences, while its 754 billion parameters provide the depth of knowledge needed for complex problem-solving. The model can autonomously run benchmarks, identify bottlenecks, adjust strategies, and continuously improve results through iterative refinement—all without human intervention.

KernelBench Success

GLM-5.1 was also tested on KernelBench Level 3, which requires end-to-end optimization of complete machine learning architectures like MobileNet, VGG, MiniGPT, and Mamba. The results highlight a significant performance gap between GLM-5.1 and its predecessors. While the original GLM-5 improved quickly but leveled off early at a 2.6x speedup, GLM-5.1 sustained its optimization efforts far longer, eventually delivering a 3.6x geometric mean speedup across 50 problems—continuing to make useful progress well past 1,000 tool-use turns.

Commercial Availability

GLM-5.1 is positioned as an engineering-grade tool rather than a consumer chatbot. Z.ai has integrated it into a comprehensive Coding Plan ecosystem designed to compete directly with high-end developer tools. The model is available for download on Hugging Face under the MIT License, allowing enterprises to download, customize, and use it for commercial purposes without licensing fees.

This release comes as Z.ai, listed on the Hong Kong Stock Exchange in early 2026 with a market capitalization of $52.83 billion, seeks to cement its position as the leading independent developer of large language models in the region. With GLM-5.1, the company has demonstrated that the open source community now has access to a model capable of sustained, autonomous software engineering work that rivals the best proprietary alternatives.

The Staircase Pattern: Breaking Through Performance Plateaus

Extended Autonomous Work Time

KernelBench Success

Commercial Availability

Related Posts

Newsletter

Join the discussion Cancel reply