Model Maker Moats

April 2026

There is no question language models will create immense value. Even if development halted today, there would be decades of elevated growth as models diffuse through the economy. In addition, there are no signs of halting on the horizon. Coding capabilities made a big leap in the last few months. Data center investment is accelerating. Improvements are published in optimizers¹ and architectures.²

Future model capabilities are fundamentally uncertain. Predictions range from overtaking all economic activity to providing more intuitive information interfaces.³ Regardless of how model capabilities develop, their makers will be subject to market forces. The profits captured by an individual company often fall short of the value created. Value capture requires barriers that keep competitors at bay.

What are the barriers to making a model? Models are made from three ingredients: algorithms, compute, and data. The original recipe was simple: pre-train with next token prediction and instruction-tune with a reward model.⁴ This simple, publicly available recipe launched the most successful consumer application ever.

Since the launch of ChatGPT in 2022, language modelling has exploded into a multi-billion dollar industry. To understand where model companies can capture value, one must analyze the model making process in the context of markets. In this piece, I explore market dynamics in algorithms, data, compute, and the recipe for combining them.

Algorithms

Of the three base ingredients, algorithms have changed the least since ChatGPT launched. Conceptually, we can split algorithms into the architecture, objective, and optimizer. For architecture, there are no indications of a major overhaul.⁵ There have been tweaks — mixture-of-experts, sparse attention variants, and multi-modality — but nothing that constitutes a step change in capabilities.

In pre-training, neither objectives nor optimizers have changed. Cross-entropy minimization on the next-token prediction objective is the same as in the last public frontier models.⁶ Advancements in pre-training are substantial, but they come from compute and data, not algorithms.

Post-training has seen more changes since the original PPO algorithm.⁴ We now have more stable and efficient learning of human preferences through algorithms like DPO⁷ or SimPO.⁸ Reasoning-focused algorithms like GRPO⁹ or RLVR¹⁰ have also become part of the training recipe. While I'd be surprised if models differ significantly in architecture or pre-training algorithms, I would not be surprised if frontier labs have proprietary flavors of post-training.

Do innovations in architecture or algorithms constitute a barrier for competitors? In my opinion, the answer is no. There are no large barriers for proprietary ideas to spread. The top companies hire from each other all the time. Top people can easily raise money for their own ventures. A single person can keep the ideas in their head. There are no non-competes in California or legal repercussions for sharing ideas. AI ideas are almost like pharmaceuticals without patent protection. Lots of resources were spent in finding the innovations that work, but they are not difficult to copy.

Data

I conceptually split data into the pre-training base and the post-training topping. For the pre-training base, the task is to acquire en masse and filter out junk. For the post-training topping, the goal is manual curation of high-signal examples to push in particular directions. In reality data is more of a spectrum, but these extremes serve as useful anchors.

While we mostly exhausted publicly available base tokens, there are still gains from filtering and synthetic data. Raw internet dumps contain ~100 trillion tokens.¹² One can gather even more via sophisticated crawling techniques and indexing of analog text (PDFs, books, etc). Frontier models today are estimated to train on ~10 trillion tokens. Most of the filtering from ~100T to ~10T is done by heuristics or small classifiers. However, later stages of filtering involve language models themselves. As models improve, pre-training filtering will get better. In addition, an increasing share of pre-training data is generated/re-written by language models themselves.¹³ This share will also improve in tandem with available model capabilities. Importantly, for both filtering and synthetic pre-training data, it doesn't matter if improved capabilities are only available in closed models.

For manual curation of high-signal examples, language models are also incredibly useful. Instead of hiring high-skill humans to manually solve problems, one can use a model. While many doubt that the frontier can be moved by mimicking in this manner, there is little doubt that inferior models can improve from training on solutions from superior ones.

In fact, it's common to train a model on the outputs of another. If the mimicking model has the same capacity as the teacher, it can in theory learn any information the teacher is prompted to provide. A prominent recent example of this trend is the release of Qwen-Claude-Distilled.¹⁴ This model fine-tunes the open-source Qwen model on reasoning traces from Claude to achieve higher performance in tool calling and coding.

The trillions of tokens used to train a model are hard to replicate and can't be leaked by a single person. It takes dedicated effort by high-context individuals to curate quality datasets. Data is a stronger barrier than algorithms, but far from an insurmountable one. In some sense, frontier labs are making large swaths of their data public by integrating it into the weights of their publicly available models. There are terms of service¹⁸¹⁹²⁰ that prohibit distilling, but without a real way of enforcing them across the globe and with the enormous value on the line, I'd be very surprised if those terms are not violated. When it comes to models, the product is also a piece of the factory. By selling your product, you are making it easier for others to reproduce it.

Compute

Compute is complex. Nvidia's latest Rubin rack has more than 1M distinct components. Decades of research and engineering underlie it. Consequently, it's expensive. Even accounting for the star-athlete salaries going to top researchers, compute constitutes the vast majority of model costs. For this section, we will start by ignoring practical complexities and pretend every actor is able to exchange dollars for compute at identical rates.

Even with identical exchange rates, the large amount of up-front compute required for training creates a structural barrier. A high fixed cost $(FC)$ means the average total cost $(ATC)$ drops precipitously with the quantity of usage $(Q)$. Mathematically, this is expressed as:

ATC = \frac{FC}{Q} + MC

A firm that cannot achieve a high enough $Q$ to reach a competitive $ATC$ cannot profitably serve a model. The high fixed cost thus sets a minimum market share for profitable participation. Markets like this naturally settle into a few large players. How many players depends on what market share is required for a $Q$ large enough to make $FC/Q \ll MC$.

Furthermore, even in a world where all actors exchange dollars for compute at identical rates, different actors can access dollars at different rates. A low-risk actor can access dollars at a lower cost. This changes the $ATC$ calculation:

ATC = \frac{FC \times \text{Cost of Dollars}}{Q} + MC

The most effective way of lowering risk is by having a high $Q$. This creates a virtuous cycle where higher $Q$ leads to lower cost of dollars, which allows decreasing prices to further increase $Q$. This difference in cost of dollars creates a no-man's-land. The risk-adjusted return is not enough for capital markets to fund a challenger, but incumbents still enjoy above-average absolute returns.

Relaxing the assumption around an identical dollar cost of compute further heightens the scale economy barrier outlined above. By making capital-intensive long-term bets, large-scale actors can reduce the cost at which they buy compute. Buying land, building datacenters, designing chips, producing power, and creating systems to improve utilization are all ways of reducing the dollar cost of compute. This changes both terms in the $ATC$ calculation:

ATC = \frac{\text{Train FLOPs} \times P \times D}{Q} + \text{Inference FLOPs} \times P

where $P$ is the dollar price per FLOP and $D$ is the cost of dollars.

Even with access to the same algorithms and data, competing with the compute cost structure of incumbents poses a formidable challenge. The only terms of the equation above where challengers can have an edge are the floating point operations (FLOPs).

To get an edge in FLOPs, one needs an edge in algorithms or data. However, compute is the key input for improving both algorithms and data. Machine learning is empirical and compute is the currency of experimentation. More compute means more experimentation, which speeds up innovation. Opportunities to experiment also attract better talent, further increasing the innovation rate. An innovative approach in algorithms or data can be a way to get started on the long slog of bringing down compute costs, but the advantage it provides is hard to sustain over time and far from a certain path to success.

Compute-agnostic edges in data exist in organizations with existing proprietary datasets. Internal data of e.g. an insurance company can be used for training models to process claims. Ideally, such models are only used internally to reduce the risk of distillation. However, this is not a threat to the main markets that model makers are going after. It only means that there will still be corners where players with proprietary data can carve out profitable niches.

General improvements in data, algorithms, and hardware are shifting the cost-performance frontier. There is a world in which $Q$ explodes and $FC$ drops to the point where the model layer fragments. However, this doesn't mean model makers will fail. In this hypothetical, the $ATC$ is still determined by the cost of compute. Model makers can shift to becoming cloud providers and might still be enormously profitable. For context, the major cloud providers — AWS,¹⁵ Google Cloud,¹⁶ and Azure ¹⁷ — collectively generated around $100 billion in annualized operating profits last quarter. In this world, the fixed-cost moat will shift from the compute spent on weights to the investments made to access that compute in the first place.

Furthermore, most evidence points to intelligence continuing to improve with compute.²¹²² The returns might be diminishing, but they will continue. However, many common use cases won't require cutting-edge intelligence. Personal assistants that can call APIs and stitch together schedules will be useful even if they don't win gold at the International Math Olympiad.

Of the three core ingredients, compute forms the largest barrier. With compute, you could hire someone that knows the algorithms and distill high quality data from frontier models. But even with complete access to algorithms and data, the higher cost of compute would eat you alive. Even in a world where model progress plateaus, large fixed-cost investments in compute are a promising path to persistent differential returns for model makers.

Combining the Ingredients

In this last section, we assume algorithms are public, data can be distilled, and compute costs are not a concern. To motivate this, imagine a nation-state with advanced espionage and billions in budget trying to replicate frontier model performance. With access to all three ingredients, what challenges remain in combining them?

Even with unlimited access to data, it's hard to determine the mix that produces the best model. Model makers with distribution have an understanding of how their model is used. They invest heavily in evaluation and processes to track model performance across various dimensions. General purpose models are incredibly hard to evaluate. What is the quality of recipes produced? Do the travel plans make sense? Is it too sycophantic? Is the legal advice sound? There is also the weighing of these considerations. When an idea improves the math eval by 5% but makes the travel eval 10% worse, should it be included?

Beyond evaluation, there is also a time dimension to training. There is still the base pre-training run, but more mid- and post-training steps have been stacked on top. Even with the same data and reward functions, the order in which they are applied and weighted influences the final performance. The general rule is to start with a large base of low-quality data and move towards higher quality over time, but top performance requires precise tuning. Reward functions also have to be carefully balanced to push model behavior in desired directions. It takes an experienced practitioner and comprehensive evaluations to balance data and reward correctly.

At large labs, teams work on each model performance vertical. Each team designs evaluations for their vertical and contributes improvements into a shared pool. Only when a contribution creates a net gain across all vertical evaluations is it accepted into the next pre-training iteration. Each team iteratively patches holes in their evaluations to catch all corners of performance. In my opinion, these detail-oriented and deliberate processes pose a significant barrier for an actor attempting to combine ingredients into a model. Processes take time to build up and align the organization around. A single hire cannot keep all the evaluations and rationales behind them in their head. Processes can't be distilled from model weights nor purchased with dollars. Without processes, you wouldn't know what data to collect even if it was falling from the sky.

Conclusions

Algorithms are easy to copy and data can be distilled. The billions pouring into improving these two ingredients are necessary for pushing the frontier, but don't make up a moat. Traditionally, high fixed costs like these create a scale economies barrier. But that relies on future competitors having to pay the same fixed cost to catch up. That is not the case here; the fixed costs are falling as the frontier moves.

Compute does create a scale economy barrier. To compete, a challenger must have a path to producing tokens with higher value or lower cost. Incumbents have cheaper capital and make long-term investments to decrease compute costs. Higher value tokens can come from clever combinations of algorithms and data, but the search for such combinations is also powered by compute. Furthermore, if models commoditize, the fixed-cost investments in compute will still be a reliable path to profits for today's model makers.

To effectively combine the ingredients, one must understand the distribution of use cases, anticipate where it will move as capabilities improve, and have precise evaluations to track performance. This knowledge is hard to copy because of the time, effort, and attention to detail required. Model makers refine these processes every day. Even without compute cost concerns, the complexity of this process creates a large barrier for challengers to climb.

Specific model applications can have barriers independent of the model itself. Consumer chat has a clear path to a Google-like flywheel of users and advertising. This is a network effect moat — people go to a certain chat because it's the index of the world's information. The chat maker can motivate larger investments in traffic and information acquisition as a result, making the service more valuable to both users and advertisers.

Coding assistants don't seem to have strong inherent moats. We have already seen the market swing from Copilot to Cursor to Claude Code. Coding assistants don't possess power other than the one they themselves are eroding. See my other blog post for a deep dive on AI impacts on software application power. In general, AI companies often use the application for data acquisition. This data-flywheel can explain why chat and chat-verticals are still quite concentrated. Coding capabilities seem less dependent on data because correctness can be verified via compilation and testing.

Personal assistants are likely to be stickier. I can see high switching costs via wide integrations. Enterprise assistants can be similarly sticky. Google can do many things in parallel, but if I was them I'd really focus on the assistant. Their existing product suite positions them particularly well for making the assistant their breakout AI product.

However things play out, it will be incredibly interesting to follow. What a time to be alive!

Edvin T. Berhane

Sources

Muon: An Optimizer for Hidden Layers in Transformers. Jordan et al., 2025. arXiv. arxiv.org/abs/2502.16982
Depth Attention: Attending Across Token Depths. Liao et al., 2025. arXiv. arxiv.org/abs/2502.07864
AI Futures Model — December 2025 Update. AI Futures Project, 2025. blog.ai-futures.org
Training Language Models to Follow Instructions with Human Feedback. Ouyang et al., 2022. arXiv. arxiv.org/abs/2203.02155
CS336: Language Modeling from Scratch — Architecture Lecture. Stanford University, Spring 2025. stanford-cs336.github.io
Language Models are Few-Shot Learners. Brown et al., 2020. arXiv. arxiv.org/abs/2005.14165
Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Rafailov et al., 2023. arXiv. arxiv.org/abs/2305.18290
SimPO: Simple Preference Optimization with a Reference-Free Reward. Meng et al., 2024. arXiv. arxiv.org/abs/2405.14734
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. Shao et al., 2024. arXiv. arxiv.org/abs/2402.03300
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. DeepSeek-AI, 2025. arXiv. arxiv.org/abs/2501.12948
Let's Verify Step by Step. Lightman et al., 2023. arXiv. arxiv.org/abs/2305.20050
RedPajama-Data-v2: An Open Dataset with 30 Trillion Tokens. Together AI, 2023. together.ai/blog/redpajama-data-v2
Synthetic Data in AI: Challenges, Applications, and Ethical Implications. Fang et al., 2024. arXiv. arxiv.org/abs/2510.01631
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled. Jackrong, 2026. Hugging Face. huggingface.co/Jackrong/Qwen3.5-27B-...
Amazon.com Q4 2025 Earnings Release. Amazon, Inc., 2026. SEC Filing. sec.gov
Alphabet Q4 2025 Earnings Report. Alphabet, Inc., 2026. q4cdn.com
Microsoft Q2 FY26 Earnings Summary. Microsoft Corporation, 2026. quartr.com
Terms of Use. OpenAI, 2024. openai.com/policies/terms-of-use
Commercial Terms of Service. Anthropic, 2024. anthropic.com/legal/commercial-terms
Gemini API Terms of Service. Google, 2024. ai.google.dev/gemini-api/terms
Scaling Laws for Neural Language Models. Kaplan et al., 2020. arXiv. arxiv.org/abs/2001.08361
Training Compute-Optimal Large Language Models. Hoffmann et al., 2022. arXiv. arxiv.org/abs/2203.15556