Chinese WanCard GPUs Take on Nvidia

Advertisements

Savings News May 23, 2025

The development of large language models has dramatically accelerated over the past two years, culminating in an urgent demand for processing power. The challenge lies in the excessive scarcity of high-end GPUs, such as the Nvidia A100, which raises a pressing question: Is this situation more of an obstacle or an opportunity? As a response, numerous domestic computing power enterprises in China are actively seeking alternative solutions to navigate this predicament.

Among these companies, Moore Threads has emerged as one of the few Chinese enterprises capable of competing functionally with Nvidia's GPUs. The company aims to break through the power ceiling with its “cluster-based” solution, offering a pathway to enhance domestic GPU capabilities.

On the eve of the 2024 World Artificial Intelligence Conference held on July 3, Moore Threads announced significant upgrades to its KUAE (Kuage) Intelligent Computing Cluster Solution. This upgrade expands its current capacity from thousands to tens of thousands of computing units, fulfilling the requirements necessary for training trillion-parameter models. This substantial increase in power aims to provide continuous, efficient, stable, and versatile general-purpose computing support for large-scale models.

In the arena of artificial intelligence, the race for power has reached new heights. The competitive landscape is becoming akin to an arms race, with major tech giants investing heavily in computing capabilities to meet the demands of the AI model era.

To illustrate this point, Google unveiled its A3 Virtual Machines on May 10, 2023, integrating 26,000 Nvidia H100 GPUs into its supercomputer. At the same time, it constructed a TPUv5p cluster comprising 8960 self-developed chips. Meta followed suit in March 2024 with two new AI training clusters containing 24,576 Nvidia Tensor Core H100 GPUs, reflecting a significant increase from the previous generation's 16,000 units.

OpenAI's ChatGPT-4, equipped with 16 expert models and a staggering 1.8 trillion parameters, necessitated training over 25,000 A100 GPUs for a duration of 90 to 100 days. Such figures underline that a computing scale exceeding 10,000 units is now standard in the field of AI large models.

The pressing question looms: What type of computing power is required in this evolving AI landscape? Understanding the trajectory of large models provides useful insights into this challenge.

Since the introduction of the Scaling Law in 2020, several trends have characterized the aesthetics of violent growth in large models. For instance, OpenAI's ChatGPT development illustrates a shift where the scale of parameters surged from the billions to trillions, a growth factor of over 100 times; the volume of data processed escalated from terabytes to over 10 terabytes, also exceeding 10 times; and the computation demand witnessed a staggering increase of over a thousand times. Such large models hinge on having sufficiently vast computing resources to keep pace with rapid technological evolution.

It is not merely about having expansive scales; AI computing power also demands versatility. Currently, the architecture behind these large models primarily rests on the Transformer framework. While this structure dominates, it can’t encapsulate all solutions as it continues to evolve—moving from dense to MoE configurations and from single-modal to multi-modal systems. Other novel architectures such as Mamba, RWKV, and RetNet are also emerging. The advent of these structures underscores that the Transformer framework is not the sole answer.

Moreover, the integration of AI with other domains like 3D modeling, high-performance computing (HPC), and scientific computations is accelerating. The evolution of computational paradigms, alongside emerging diverse requirements for various applications, has created a significant desire for a general-purpose accelerated computing platform.

As model parameters progress from hundreds of billions to trillions, there is an emergent need for a super training factory—a "large and universal" accelerated computing platform. This facility should drastically shorten training times and facilitate the rapid iteration of model capabilities. According to Zhang Jianzhong, the founder and CEO of Moore Threads, “True efficacy can't be realized unless the structure is sufficiently large, computationally versatile, and ecologically compatible.”

The need for over 10,000 cards in a cluster has become the standard in the pre-training of large models. For infrastructure providers, the presence or absence of such a cluster could lead to profound implications for success in the AI marketplace.

However, constructing a cluster of this magnitude is no simple feat.

Building a cluster of over 10,000 cards cannot be reduced to merely stacking thousands of GPUs. Rather, it represents an intricate system engineering challenge. At its core lies the requirement for large-scale networking and interconnectivity, ultimately enhancing the effective computation efficiency of the cluster. An array of studies illustrates that the linear scaling of clusters does not translate into linear increases in effective computational power.

Furthermore, maintaining high stability and availability, developing rapid fault localization tools, and ensuring diagnostic capabilities are also crucial. A super cluster of over 10,000 cards comprises thousands of GPU servers, switches, and thousands of fiber optics/modules. Each training task involves millions of components working in unison, where the malfunction of any single element may lead to interruptions in training.

Additionally, as large models continue to evolve, demanding various new types of models and architectures, the grid of over 10,000 cards must demonstrate rapid migratory capabilities in line with ecological Day 0 standards, allowing adaptation to changing technological demands. It's also critical to consider future requirements for universal computing beyond today's accelerated computation scenarios.

The path to constructing a 10,000-card cluster resembles a formidable mountain, fraught with challenges, but it is an endeavor that holds merit. The establishment of such a cluster, while complex, isn’t simply about addressing one company's demand; it seeks to tackle the industry's broader scarcity of computational resources.

In the wake of nearly four years of development, Moore Threads has successfully validated many of its implementations at the thousand-card cluster level. The company has launched the KUAE (Kuage) cluster solution, aimed to meet the core demands for "sufficient scale + computational versatility + ecological compatibility" crucial for the age of large models, marking an upgrade of domestic cluster computing capabilities.

The KUAE cluster is founded on universal function GPUs, enveloping a comprehensive, integrated software and hardware solution that includes a core KUAE computing cluster, a KUAE cluster management platform, and a KUAE model service platform. This system aims to resolve the challenges of scalable GPU computing infrastructure and operational management, providing rapid time-to-market for commercial operations.

Moore Threads’ KUAE cluster solution showcases five significant attributes:

First, it surpasses the threshold of 10,000 cards in a single cluster, yielding total computing power exceeding 10 PetaFLOPS.

Second, it aspires for an efficient computing rate that could exceed 60%.

Third, stability is paramount, with weekly training efficiency expected to reach over 99%, sustained by an average failure-free operational streak exceeding 15 days—some systems capable of running stably for over 30 days.

The fourth aspect emphasizes robust computational versatility, specifically designed to accelerate all large models.

Fifth, it boasts outstanding CUDA compatibility, adaptable to the Instant On ecosystem, facilitating the Day 0 migration of new models.

“We aspire for our products to provide clients with a better, more selective homegrown tool accessible at a time when foreign alternatives may not be viable,” Zhang Jianzhong shared. “For current users of large models in China, our strongest asset lies in exceptional ecological compatibility. Migrating developers to our KUAE cluster is almost seamless, requiring negligible code alterations and completing transitions in a mere few hours.”

Realizing the operational capabilities of this large-scale model training factory necessitates support from a robust network of partnerships. Various organizations, from Zhihua AI, Zhiyuan Institute, and startups like Dicu Technology, to many others in the domestic modeling sphere, are successfully running operations on Moore Threads’ KUAE cluster. Notably, Moore Threads stands as the first domestic GPU company to engage with the Wukong Chip Cloud for large model training, while KUAE has become the industry’s first successful platform capable of fully operating a homegrown large model.

Launching a 10,000-card cluster isn't merely a solo endeavor but requires concerted efforts from across the industry. In the recent conference, Moore Threads solidified strategic partnerships with major enterprises such as Qinghai Mobile and Qinghai Unicom for the cluster project. Such collaborations further propel the practical application of Moore Threads’ 10,000-card cluster across various regions.

Thanks to its high compatibility, high stability, extensive scalability, and optimal power utilization, the KUAE Intelligent Computing Cluster has garnered recognition from multiple large model enterprises, establishing itself as a key player in China's realm of model training and application. "Just a few years ago, domestic computing power was viewed as a backup," as Zhang Jianzhong observed, "but it has now shifted to being the primary choice, facilitating long-term supply alongside localized services."

While constructing a 10,000-card cluster poses significant hurdles, Moore Threads has shown unwavering resolve in its quest, knowing that this journey is both challenging and essential. It addresses not just the needs of a single enterprise but tackles the pressing issue of widespread computational power scarcity across the entire sector.

In conclusion, the unveiling of Moore Threads’ full-stack KUAE intelligent computing center solution signals a pivotal breakthrough for domestic GPUs, set to address the complexities associated with training large-scale models comprising trillions of parameters. Moore Threads has transcended its identity as a simple GPU company, positioning itself as a dedicated AI-accelerated computing platform enterprise.

Join 70,000 subscribers!

Email Address

Submit

By signing up, you agree to our Privacy Policy