SLMs, The Nesting Dolls of Intelligence
How downsizing intelligence is the clear path forward to successful AI applications
LLMs are incredibly adept at generalized question answering. But for businesses to extract value, they need to bring their context into the LLM with both high accuracy and specificity. Maintaining the existing quality of service when deploying a new AI app is a top concern for businesses. Great strides have been made towards improving context (e.g. automated evaluations, MCP, RAG, etc.), but one fact remains: LLMs are very slow. Once quality is reached, the next differentiator becomes speed. Today’s AI engineers are very lucky to be servicing users who expect several seconds of response time for any given question, but as is the hedonic treadmill, that expectation will not last long. Just as context engineering has prevailed in lieu of waiting for better foundation models, heavily prompt optimized SLMs will prevail in lieu of waiting for foundation models.
A Brief Recap of AI Engineering
For those who were working in AI prior to the release of ChatGPT, deploying a uniquely useful pretrained model without fine-tuning on curated data was extremely unlikely. A handful of embedding models, some CNNs, BERT, YOLO, and CLIP were the only regularly used, pretrained options that come to mind, pre-November-2022. In March 2023, software engineers quickly capitalized on the accessible APIs of ChatGPT to begin building their own GPTs, kicking off a wave of LLM-first software engineering. Now those same engineers are learning data science basics rebranded as “context engineering”. Simultaneously, data scientists who have spent years grappling with these complexities have had to revisit the way they approach new projects altogether, now no longer needing anything but a simple prompt to achieve intelligent predictions. The great news is that the solution to reconciling the slow nature of LLMs is answered by leveraging the middle ground between these fields: SLMs.
LLM-First Software Engineering
LLMs brought the unique capability of solving advanced intelligence problems through prompting alone. This has made the creation of data science applications increasingly more accessible to non-data scientists. The naive process for many software engineers has gone in three steps:
Prompt engineer to demo-able
Deploy upon stakeholder approval
Adjust prompts based on user complaints and new feature requests
This closely mirrors how software development (when lacking automated testing) has been occurring for years throughout enterprises, startups, and dev shops alike. As time has progressed, more sophisticated patterns have emerged:
Work alongside a subject matter expert to curate an evaluation dataset
Leverage experts + LLM-as-a-Judge for model selection and prompt optimization
Incorporate evaluations into CI/CD as an intelligence regression test
Deploy and collect real conversations to supplement the evaluation dataset
Add a Human-in-the-Loop tool for the LLM to ask a human to respond
Repeat steps 1 - 4
This is a far-improved approach over the naive process, but still lacks crucial steps learned from decades of real-world observations from traditional pre-trained data-science applications.
Traditional Data Science Applications
Before deep learning, traditional data science models were not pretrained; they existed only as algorithms to be fit to a dataset provided by the data scientist. The earlier deep learning models that were generally pretrained were incapable of domain-specific, human-like performance through prompting only. Therefore, the common phrase “a model is only as good as its data” was universally embraced among data scientists, until the inception of ChatGPT. Revising that data-first mindset, the traditional path to building a traditional data science application was as follows:
Have an application already creating real world data
Collect and curate a dataset (e.g. data engineering, cleaning, and sampling)
Define a north-star metric (e.g. precision, recall, NDGC, etc.)
Experiment with different model options
If pretrained:
Prompt Engineer
Fine Tune
If not:
Train from Scratch
Hyperparameter Optimization (e.g. Grid, Bayesian)
Evaluate against a holdout set on the north-star metric
Shadow deploy the top model and analyze real world results
If shadow performs well, deploy
Retrain or fine-tune on new data on a regular schedule (e.g. daily, weekly, monthly)
Deploy the updated model if outperforming existing
Notably, an important part of experimenting with different models is right sizing the model. Importantly, large models like transformers require GPU acceleration for high-speed inference, whereas smaller models such as XGBoost can run millisecond inference on a single CPU core. Because of the vastly different costs in self-hosting large vs small models, data scientists are pushed towards finding the most pragmatic balance between complexity, cost, and performance.
Solving The Hedonic Treadmill with SLMs
In late 2024, leading AI labs began reporting slowing improvements in their frontier models (Amodei, A., et al., 2024). Since then, researchers have increasingly demonstrated that Small Language Models (SLMs) with appropriate context engineering can outperform LLMs on the same task (Belcak et.al 2025). Notably, both SLMs and LLMs show similarly decreased performance on OOD examples when they are many-shot prompted (Wynter 2025). In the context of bringing intelligent applications to market, these facts and trends paint a clear picture of LLMs reaching their potential with marginal intelligence improvements lacking meaningful changes to the user’s experience. In accordance with the Hedonic Treadmill, as people become comfortable with the intelligence limitations of LLMs, they will begin to desire faster responses, and the universal truth will continue to remain that a smaller model will always respond faster than a larger model. Considering this, we must borrow from the learnings of traditional data science applications and begin applying them to LLM-first software engineering.
Here’s a suggested path forward:
Work alongside a subject matter expert to curate an evaluation dataset
Leverage experts + LLM-as-a-Judge for model selection and light prompt optimization
Incorporate evaluations into CI/CD as an intelligence regression test
Deploy and collect real conversations to supplement the evaluation dataset
Add a Human-in-the-Loop tool for the LLM to ask a human to respond
Repeat steps 1 - 4 with some tweaks:
Do not over optimize the LLM’s prompt, only adjust as needed
Continue heavy LLM-as-a-Judge prompt optimization of the SLMs
Detect when a SLM outperforms the LLM
Deploy the optimized SLM as the new default
Add an LLM-in-the-Loop tool for the SLM to ask the LLM to fallback to
By reserving heavy automatic prompt optimization to the SLM, and keeping the LLM’s prompt highly generalized, you strike a pragmatic balance between complexity, cost, and performance. For In-Distribution data (ID, explained below) data you have the fast and cheap responses of the SLM and for Out-of-Distribution data (OOD) you have the LLM which has stronger generalization capabilities, and for the extraordinary scenario you have the human. In essence, a Russian-nesting doll of intelligence.
In Distribution - Queries and scenarios that closely match the patterns, domains, and types of problems the model was trained or optimized for. These represent the “expected” or typical use cases.
Out of Distribution - Queries that fall outside the model’s training distribution—novel scenarios, edge cases, or uncommon combinations of requirements that the model hasn’t been specifically prepared to handle.
Conclusion
If you are not already using this approach, I hope you consider it as your next step towards long term wins with your intelligent application. In some ways, we already see similar approaches being embraced by top foundation model providers such as GPT-5’s model router design (OpenAI 2025). Better yet, there is some very recent research from a fellow Atlanta local which concludes “this survey firmly positions SLMs as the default, go-to engine for the majority of agent pipelines, reserving larger LLMs as selective fallbacks for only the most challenging cases” (Sharma 2025).
Food For Thought
For those reading who are familiar with traditional data science or NLP, we can take this nesting doll pattern a step further. Frankly, some questions are truly best solved by simple FAQ responses. After deploying an SLM, you could begin a new pattern of progressively clustering high similarity prompt-response pairs into a general FAQ. When such prompts are sent, FAQ responses could then be retrieved by an embedding model with a high threshold, cosine-similarity match, with low scores escalating to the SLM to fallback to.
Thank you for your time and I hope you leave some feedback!


Great post, and I agree that focusing on small, specialized LMs (SLMs) is the way forward. Today's LLMs are impressive, but they are heavy and sometimes slow. Small, specialized models are the next frontier to unlock faster responses and cost savings without sacrificing accuracy in context. Letting an AI agent pick the right "size" model for each job is smart engineering. However, we should not overlook that training or fine-tuning these SLMs still carries a significant cost barrier.
Even though SLMs can be cheaper to run in production, the upfront expense of curating high-quality datasets, computing distillation runs, and performing fine-tunes on domain data can easily reach tens or hundreds of thousands of dollars. Many smaller teams or startups cannot yet afford that infrastructure. Until accessible "SLM-as-a-service" platforms or open-weight domain models mature, it will remain a high bar for most.
Thanks for sparking this discussion, Ryan. Love the Russian doll analogy. It is exciting to see the "nesting doll" strategy (small model for most tasks, big model as backup) validated by the latest research.
https://arxiv.org/pdf/2506.02153#:~:text=well,only%20to%20much%20larger%20models