Like many in the AI world, I've been trying to distill the key messages from the unfolding DeepSeek epic. After reading my great grandmother’s blog post on the subject, I was inclined to share my thoughts. When my son came home from kindergarten yesterday and showed me the report he’d written, I felt I needed to weigh in now. So here’s what I think the saga says about the AI landscape in general and the superconducting optoelectronic networks being built by Great Sky in particular.
The impact of DeepSeek on the AI industry is due primarily to two factors. First, DeepSeek claims their model was trained with a much smaller number of GPUs than their competitors, such as OpenAI, Anthropic, and Meta. They claim to have used 2048 H800 GPUs (2.788M H800GPU hours), compared to at least 10x that for OpenAI’s o1. Second, the total cost to train the model was $6M, and again the claim is this is far lower than their competitors. For some, this called into question the value of Nvidia. Their stock dropped $600B (17%) in a single day, and several weeks later it has not recovered much of that. I’ll say more about Nvidia shortly. First, let’s address the veracity of the claims.
Both claims of reduced expense and reduced GPU count have been called into question by AI industry leaders. Some have claimed DeepSeek actually has a GPU cluster with 50,000 GPUs, not 2000 as they claimed in their paper. This might be because they aren’t legally allowed to have this many GPUs due to the Biden administration’s export controls. These 50,000 are likely to be Nvidia H100 GPUs, not the H800s the paper claimed. H100s are more powerful. Further, it appears the company spent $1.6B on these GPUs as well as approximately another $1B operating the cluster, according to SemiAnalysis. In addition to likely misrepresenting the number of GPUs, the flavor of GPUs, and the expense of arriving at the DeepSeek V3 and R1 models, it is also likely DeepSeek made use of LLMs produced by OpenAI in their training, meaning they are leveraging some of the investment made by OpenAI. The idea that DeepSeek indicates we may be able to get far more performance from AI models with less compute hardware and significantly lower expense appears unlikely. From an energy perspective, the DeepSeek boom may prove a bust as well. As the MIT Technology Review reports, the fact that DeepSeek’s R1 reasoning model is very compute-intensive during inference leads to significantly more energy consumption. The model is using about 87% more energy than Meta’s model based on Llama 3, due primarily to the fact that it is generating longer responses. This reasoning model is using about 35x the energy of a conventional LLM.
It appears DeepSeek built a model that is comparably powerful to OpenAI’s latest reasoning model using a comparable compute infrastructure and financial resources. Perhaps it is still a victory for China, but it doesn’t appear to be a sea-change moment in how we think about the cost and hardware requirements for leading AI models. As Gary Marcus wrote, “DeepSeek R1 is not smarter than earlier models” and it “doesn’t solve hallucinations or problems with reliability.” Regarding cost, “It is still expensive to operate [during inference]”, especially for reasoning. While Nvidia’s share price took a hit, Marcus correctly identifies that “The biggest threat isn’t to Nvidia, but rather to OpenAI and Anthropic.” This is a significant take away, and a major thread woven through Great Sky’s business model: AI hardware companies have significantly larger moats than AI software companies. The DeepSeek moment illustrates this point. It is much easier to catch up to software models than it is to develop hardware capabilities. Any AI software company is only one energetic startup away from losing their edge. From this perspective, the turmoil of recent weeks bolsters my perspective that Great Sky is in a unique position to capture a massive AI market, and our moats appear as wide as ever.
Whether or not DeepSeek's model was built with comparable or far fewer resources than US competitors, there are insights Great Sky can glean. The most important insights relate to the innovative technical aspects of their approach to LLMs. Their model includes 671 billion parameters. While this is not the largest model to date (ChatGPT-4 probably has 1.8 trillion parameters), it is still a large model, and as such it will thrive on energy-efficient hardware. The model incorporates a mixture-of-experts architecture, which adds functionality and efficiency, particularly in reasoning. They add multi-head latent attention, breaking the large transformer model into a network of smaller attention heads. Then, by representing different data structures within the model with different precisions, they save memory and reduce data movement. Some numbers are represented with just 1.5 bits (3 values), and the maximum precision of any number in the model is 8 bit, consistent with the trend of moving to lower precision across the AI industry. To add additional speed at inference time, they perform multi-token prediction, where not just one but several tokens are being generated in parallel. While DeepSeek appears to have done a nice job incorporating all these techniques in their model, it’s important to recognize that these techniques have been developed previously, for example in the Switch Transformer paper from Google in 2022. And, what writing by Jeff Shainline would be complete without noting that all these techniques ultimately bear very close resemblance to information processing techniques observed in the brain.
How are these techniques similar to concepts from neuroscience, and how can they be implemented with SOENs? First, mixture of experts is a concept with deep roots in neuroscience. Vernon Mountcastle identified columns of neurons in the neocortex in 1978. These clusters of densely interconnected neurons comprise functional elements that specialize in processing certain types of information. One column may be tuned solely to detecting vertical edges, another to the color red, etc. Mountcastle collaborated with Gerald Edelman to elaborate this theory into a detailed model of the function of the neocortex. This bottom-up neuroscientific line of reasoning intersected with the top-down ideas from psychologist Bernard Baars in his 1988 book, “A Cognitive Theory of Consciousness”. Baars lays out a model for cognition in which a very large assembly of specialists are competing for the resources of the “global workspace”, which is essentially the conscious train of thought of the mind. Later, Stanislas Dehaene expanded this idea and identified the neocortical columns of Mountcastle as the competing experts that inhibit each other to make their signal more prominent. Jeff Hawkins further developed these ideas in his theory of neocortical function called the “Thousand Brains” model. These cortical columns in the brain are extremely similar to the experts used in modern mixture-of-experts AI models. Quoting Ecclesiastes 1:9 “...there is no new thing under the sun.”
Multi-head latent attention is related to this concept. With multi-head latent attention, a large AI model is clustered into many smaller transformers, each capable of focusing attention on different aspects of the sequence being processed. This is “multi-head attention”, with each transformer contributing an attention head. It is “latent attention” because any single transformer may or may not be called upon to contribute to the ongoing information processing at a given time. While transformers are mathematically distinct from common computational models of the brain, recent work has shown that the hippocampal complex taken as a whole accomplishes functions mathematically similar to transformers. But the hippocampal complex is a vast network with around a trillion synapses capable of implementing numerous different transformer-type attention functions at any given time. Crucially, in the brain the neocortex and hippocampus interact constantly, with different columns of the neocortex being excited and employed based on the relevance of their contributed expertise to the information being processed, while the hippocampus is directing the activity of those columns dynamically, much like multi-head attention calls upon distinct attention heads at different times. Taken together, the interaction of the neocortex with hippocampus accomplishes information processing analogous in many ways to a mixture-of-experts model with multi-head latent attention.
Of course, the way the brain performs these operations is significantly more efficient than a GPU (20W versus 20MW or so). Because the brain co-locates memory with processing, every “expert” has the data it needs right there in its synapses, so requirements on data movement are minimized. A major challenge for mixture-of-experts models is determining which experts to prioritize at any given time. With neural systems, this is taken care of with network dynamics. The network forms an associative memory, which means any input efficiently excites only those experts from the network that process information relevant to that input. Inhibition between experts causes those with less relevant information to go silent, leading to sparse activity, improved signal-to-noise, and energy savings. Experts are only active when their expertise is relevant.
Another technique used by DeepSeek to improve performance is mixed precision representation of numbers, with bit depths ranging from 1.5 to 8 bits to represent floating point numbers. Different regions of the brain use a similar trick to reduce data storage requirements. Some synapses toggle between only a few stable states, more analogous to a 1-2 bit number, while other synapses can take hundreds of values, like an 8 bit number. This is a sensible means to ensure that the system is not wasting resources representing information more granularly than is necessary for a given task. Because GPUs have not improved significantly in recent years in terms of speed or energy efficiency, much of the gains accrued by the latest models result from reduced precision representation of the quantities entering the model.
Finally, multi-token prediction provides a speed improvement in R1. This technique allows the model to process common groups of tokens concurrently, providing lower latency for sequence generation. Speedup of around a factor of three is typical with this technique. This goes back to multi-head attention. It is this ability to direct attention to multiple parts of a sequence simultaneously that supports prediction of multiple tokens at once. Such an approach provides the biggest advantage in situations that require processing very long sequences–so-called “long-horizon reasoning”. It is precisely these long-horizon tasks that are proving so challenging for transformer models, and the techniques used by DeepSeek have been the most successful in gaining traction into these important application spaces. Ultimately, most of the tasks for which humans would like to use AI involve reasoning in tasks with long context windows.
What does this all mean for SOENs? My first response is a bit of bolstered confidence, as I see the latest gains in AI models coming from information-processing mechanisms employed by the brain, even if neuroscience wasn’t the direct muse for the innovations. Over the past 10 years, we have designed SOENs from the device to the network level to implement exactly these mechanisms. At the device level, we’ve demonstrated memory cells ranging from 3 bit to 8 bit, conveniently matching the levels of precision employed by DeepSeek. Crucially, our memory cells are co-located with synaptic processing circuitry, so the memory required by each synapse is sitting right there, and no data needs to be moved for the synapse to perform its computations, providing speed and energy savings over DeepSeek’s GPU implementation. At the mesoscale, we have been describing for years how we can stack wafers to form cortical columns, embodying in artificial hardware the brain’s architecture for efficient implementation of a mixture of experts. Our rich repertoire of neuron classes enable us to straightforwardly implement systems in which an extensive array of cortical columns interacts with a massive hippocampus to achieve the complex attention mechanisms analogous to multi-head attention. Such an architecture can extend to the macroscale, with large numbers of columns efficiently exchanging information through fiber-optic interconnects. Following the model of language processing in the brain, such a neocortical-hippocampal system can process language in a hierarchical manner, with certain experts focused on basic sounds and speech fragments, others on words, and onward through sentences and finally complex thoughts with extensive context. This hierarchical network structure is extremely efficient for processing long text, and leads to an improved version of multi-token prediction in which speech is planned in a high-level outline by some brain regions while lower down the hierarchy are areas that shape sentences and finally produce words. This hierarchical comprehension and production of language is likely where the field of AI is going, but they need more efficient hardware to accomplish it in large clusters with low latency. With SOENs, we can incorporate all these principles into massive AI supercomputers that can continue to grow in scale with network communication latency limited only by the time it takes light to propagate across the network. By performing this communication at the single-photon level, we provide the ability to operate such networks without Three Mile Island, and by using superconducting electronics we introduce the capacity to place mixed-precision memory at every processing element while supporting computations a hundred times faster than transistors.
Let’s come back to Nvidia’s stock price. Despite the fact that the settling of the DeepSeek dust appears to reveal that massive GPU clusters remain as valuable as ever, Nvidia’s stock is still about 14% below where it was prior to the whale of a tale that sent the AI world into a tailspin. Perhaps this is because there are multiple underlying reasons that Nvidia as a company was slightly overvalued, as argued convincingly in this extensive post by Jeffrey Emanuel. Some of these reasons include the reality that Google, in partnership with Broadcom, have shown that it is possible to side step Nvidia’s monopoly on pics and shovels. OpenAI and Microsoft are pursuing their own chips, as is Amazon. Apple has been famously successful cutting their own silicon. Simultaneously, AMD has been boxed out of most of the GPU market not by inferior silicon, but by buggy software and drivers, and that appears to be changing as well. Note that in none of these cases do these companies actually fabricate the chips. TSMC retains the strength in that regard. These trends indicate that footbridges and ropes are slowly but surely providing passage across Nvidia’s moats. Nvidia is excellent at what it does, but many have shown they can get pretty close to the same level of performance with lower costs by not funding Nvidia’s 75% profit margins. AI hardware is more difficult to replicate than AI software, but it is not impossible, and now several major tech companies have had enough time to spin up significant effort.
The situation for Great Sky and our superconducting optoelectronic networks is different. SOENs are brand new, and at present we are the only company in the world capable of designing, manufacturing, and implementing models on this hardware. If we can grow our team, scale up our foundry process, and bring our models to maturity, we will have both a more coveted castle than Nvidia and much wider moats. DeepSeek's big splash was due to the hope that somehow they could do more with less compute and energy. But they're still tethered to transformers implemented on GPUs and all the constraints entailed by the paradigm. The episode vividly illustrates the global market appetite for very large models that can learn efficiently with reduced time, energy, and expense—exactly the attributes our hardware is offering. Even if the original numbers put forth by DeepSeek are true, our hardware, once mature, can still achieve far superior training time and inference with lower energy and latency. This can be achieved because we co-locate processing with memory, we use light for communication at the single-photon level, and these advances taken together enable us to more efficiently implement architectures approximated by mixture of experts and multi-head latent attention. The lesson of DeepSeek is that Great Sky needs to execute, and we need to do it now.
