OpenAI Stargate memory wafer demand reaches 900,000 pieces per month, AI shakes up the traditional memory ecosystem

 8:47am, 15 October 2025

Amid the surge in demand for global artificial intelligence (AI) computing power and efficiency, the massive "Stargate" project planned by ChatGPT developer OpenAI has become the core catalyst driving structural changes in the global memory product market. This expensive project has brought unprecedented demand for memory chips, forcing the supply chain to accelerate innovation and redefining the division of labor and value from high-bandwidth memory (HBM) to standard dynamic random access memory (DRAM), and even solid-state drives (SSD).

In 2024, the global memory market will usher in a critical turning point. According to Reuters, South Korea's two major memory manufacturers, Samsung Electronics and SK Hynix, have officially signed a letter of intent (LOI) to supply key memory chips for OpenAI's data center. This move not only symbolizes the Korean chip manufacturer's formal participation in the huge supply chain of "Stargate", but also establishes the strategic position of memory in the AI ​​era.

As Korean companies play a decisive role in the global memory market, the two companies collectively control about 70% of the global DRAM market. More importantly, in the field of HBM, an indispensable core component for AI servers and data centers, Korean companies have an absolute leading position of nearly 80%. In an environment where the demand for AI accelerated computing is exploding, HBM has become an ideal core component to solve the huge computing power demand.

However, the memory requirements of super AI projects such as "Stargate" are no longer simply capacity superposition, but a comprehensive challenge to speed, capacity, cost, and efficiency. The particularity of AI workloads, especially in the model inference stage, is driving the memory architecture from the traditional single memory pool to the era of "AI memory hierarchy" with refined and hierarchical management.

Grading of memory requirements for AI workloads

AI models, especially large language models (LLM), have data memory requirements similar to those of the human brain, which must have different levels of "memory" to process immediate, short-term and long-term information. According to market news, the demand for memory in the AI ​​era is divided into three main levels due to data. Each level corresponds to different memory products, which also determines the product mix and revenue structure of the future memory market.

First, HBM mainly stores real-time memory data. Because HBM has extremely high bandwidth and read and write speeds, but relatively limited capacity, it is used to process "extremely hot data" and "real-time conversations", which require extremely high latency. The capacity of HBM ranges from approximately 10GB to hundreds of GB. In AI servers, HBM is the key to providing core computing power with GPU processors. Secondly, in terms of DRAM, it will be used as short-term memory data. It is relatively fast and has a large capacity. In servers, the new high-speed interface protocol CXL is often used to extend the system main memory, bringing together several TB of DDR main memory to form a large-capacity cache. DRAM mainly stores "hot data" and "multiple rounds of dialogue", with a capacity of approximately 100 GB to TB.

Finally, there is the situation where SSD plays the role of long-term memory, which is used to store external knowledge. SSD has a huge capacity, ranging from TB to PB, and is mainly used to store hot data such as "historical conversations", "RAG knowledge base" and "corpus". Such a memory hierarchical model shows that the future AI data center will be a complex system in which HBM, DRAM and SSD work together, and "Stargate"'s extreme pursuit of scale and efficiency will accelerate the widespread deployment of this hierarchical architecture.

KV cache is the short-term memory of AI models, and the solution has become the focus of industry views

With the explosion of AI market demand, one of the key technologies driving the surge in memory capacity demand is the "attention mechanism" used by large language models in the inference phase and the resulting "KV Cache" mechanism. In the current AI reasoning process, the model will use an attention mechanism similar to the human brain, and must remember the important part of the query (Key) and the important part of the context (Value) in order to answer the prompt. Traditionally, every time a new token (new word) is processed, the model must recalculate the importance (Key and Value) of each word for all previously processed tokens to update the attention weight, which is an extremely time-consuming process. To solve this bottleneck, LLM was added to the KV cache mechanism. KV cache is similar to the concept of students taking notes. It can store previous important information (Key and Value) in memory, eliminating the cost of recalculation each time, thereby increasing the speed of token processing and generation by several orders of magnitude.

KV cache is called the “short-term memory of the AI ​​model”. It allows the model to remember what has been processed before, so that no matter whether the user restarts the discussion or asks a new question, it does not have to be recalculated from scratch. This allows AI to provide long-form context and understand what users have said, reasoned about, and provided at any time, providing faster and more thoughtful answers to longer, deeper discussions. However, the challenges posed by KV caching are huge. Because the longer the context, the larger the cache required. Even for medium-sized models, KV cache can quickly expand to multiple GB per session, making solutions for KV cache a focus of various hardware and software vendors..

For long-situation inference and top-level memory configuration needs, NVIDIA launches Rubbin CPX

Facing the "Stargate" level long context processing requirements, the epoch-making innovative technology of GPU manufacturer NVIDIA (NVIDIA) directly targets the challenge of high performance and high scalability of memory. On the newly launched Rubin CPX GPU, its core mission is to break through the bottleneck of AI systems in "long-situation" inference. As the need for AI models to process millions of words becomes more common (such as long-form document understanding or one-hour video generation), Rubin CPX breaks through the limitations with a new design that integrates video decoders, encoders and long-situation inference processing on a single chip, providing unprecedented speed and performance.

NVIDIA founder and CEO Jensen Huang pointed out that Rubin CPX is the first CUDA GPU designed specifically for large-scale situational AI. This platform works together with Vera Rubin CPU and Rubin GPU to form the Vera Rubin NVL144 CPX platform, which can provide up to 8 exaflops of AI computing power in a single rack, which is 7.5 times more efficient than the existing GB300 NVL72 system. In order to support such demanding AI workloads, the Rubin CPX system must be equipped with an astonishing memory configuration, 100TB of memory and 1.7PB per second of bandwidth to ensure that data can flow at extremely high speeds.

At the single chip level, Rubin CPX is equipped with 128GB GDDR7 memory, adopts NVFP4 precision, and has a computing power of 30 petaflops, which can handle large-scale AI inference with extremely high energy efficiency. Compared with the previous generation, the focus of the Rubin CPX system is increased by 3 times, allowing the AI ​​model to handle longer situation sequences and maintaining high performance without slowing down. In this way, NVIDIA's solution represents the top demand of the memory market: the pursuit of ultimate speed and capacity, and is supported by high-end HBM and GDDR7, which provides a strong guarantee for ensuring the long-term stable growth of the memory factory's HBM business.

With the rise of cost optimization and cache management technology, Huawei opens the era of dedicated KV cache memory

Although HBM is indispensable in AI computing, its high price has become a major bottleneck in data center memory costs. In order to cope with the ever-expanding KV cache volume and reduce total cost of ownership (TCO), the industry is also actively developing solutions to optimize memory usage through software and architectural innovation, which puts forward higher requirements for the standard DRAM and SSD markets.

Huawei recently developed a new software tool called "Unified Cache Manager" (UCM), which aims to accelerate the training and inference of large language models (LLM) without using HBM. Because UCM is a KV cache-centered inference acceleration suite, it integrates multiple types of cache acceleration algorithm tools. Its core function is to hierarchically distribute AI data between HBM, standard DRAM and SSD based on the latency characteristics of different memory types and the latency requirements of various AI applications. By hierarchically managing the KV cache memory data generated during the inference process, UCM successfully expands the inference context window, achieves a high-throughput, low-latency inference experience, and reduces the inference cost per Token.

Huawei's UCM solution emphasizes that even in the absence of HBM, the capacity and speed issues of KV cache can be solved through software and hardware collaborative optimization. This provides a strong technical basis for DRAM and SSD to play a more active role in the AI ​​ecosystem.

Enfabrica attempts to reduce data center memory costs from hardware architecture

In addition, Enfabrica, a new chip startup backed by NVIDIA, is trying to reduce the high memory costs of data centers by starting from the hardware architecture. They launched a system with EMFASYS software paired with ACF-S chips (also known as SuperNICs). Also because ACF-S is essentially a switching chip that integrates Ethernet and PCI-Express/CXL. Rochan Sankar, founder and CEO of Enfabrica, pointed out that through their self-developed dedicated network chip, the AI ​​computing chip can be directly connected to devices equipped with DDR5 memory specifications.

Although DDR5 transfer speed is not as fast as HBM, it is much cheaper. Enfabrica's goal is to have a shared memory pool large enough to store KV vectors and embeddings that can run at the speed of the host's main memory. Foreign media believe that if the KV cache used in the AI ​​​​inference core can be accelerated, it is expected to become the long-awaited "killer application" for Enfabrica and other industries. And by using self-developed special software to transmit data between AI chips and a large number of low-cost memories, Enfabrica effectively controls costs while ensuring data center performance. This type of architectural innovation highlights the market's strong desire to leverage low-cost, high-capacity DDR5/CXL memory to meet KV cache needs.

Overall, the OpenAI "Stargate" project and the entire AI industry's pursuit of long-situation, high-efficiency reasoning are having a profound and complex impact on the global memory product market. For example, HBM has become the unshakable strategic core, the role of DRAM market has been redefined, and SSD has entered the high-performance AI architecture, coupled with the decisive role of architectural innovation. This makes the ultra-large-scale AI inference demand represented by Stargate completely change the supply and demand ecology of the memory market.