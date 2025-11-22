Understanding AI engineering today requires familiarity with ten foundational papers that shaped this rapidly evolving field. These works trace the journey from early breakthroughs to recent advancements, providing essential knowledge for anyone working with modern AI systems.

The story begins with neural networks, invented in the 1940s, which remained relatively dormant until a breakthrough emerged in 2017. Researchers from Google released “Attention Is All You Need,” introducing the transformer architecture that would revolutionise natural language processing. Previous approaches relied on recurrent and convolutional networks that processed text sequentially, creating significant limitations. Training proceeded slowly; models struggled with long-range dependencies when connecting details that appeared far apart in documents, and parallelisation across GPUs remained difficult. The transformer architecture introduced self-attention, allowing models to examine all words in a sentence simultaneously and learn their relationships. This innovation enabled massively parallel training, improved context handling, and made scaling more favourable for engineers. Nearly every modern language model today is based on this fundamental design.

Three years later, in 2020, the field experienced another significant advance with “Language Models are Few-Shot Learners,” commonly known as the GPT3 paper. Researchers discovered that sufficiently large transformers could perform new tasks using only a few examples provided in the prompt, without requiring task-specific fine-tuning. By training a very large decoder-only transformer and systematically evaluating it across multiple tasks while varying only the text prompt, the team demonstrated remarkable flexibility. Models could work with zero-shot instructions alone, one-shot single examples, or few-shot approaches with multiple examples. The breakthrough emerged not from new architecture, but from demonstrating that scale, combined with prompting, unlocks in-context learning. This reframed system development enables practitioners to create general models for specific needs, rather than training separate models for each task.

Scale alone, however, proved insufficient. The 2022 paper “Training Language Models to Follow Instructions with Human Feedback” from OpenAI addressed models that produced unhelpful or toxic responses. The approach involved three stages: supervised instruction tuning on examples of good behaviour, training a reward model to prefer better answers based on human rankings, and adjusting the base model to produce outputs that the reward model favoured. Smaller aligned models often outperformed much larger unaligned ones because they followed directions and respected user intent more effectively. Subsequent advances like Direct Preference Optimisation have refined these techniques, learning directly from ranked preferences without requiring explicit reward models.

The practical implementation presented additional challenges beyond the theoretical understanding. When models need to perform well on specific tasks, such as returning responses in particular formats or using domain-specific language from legal or medical texts, fine-tuning becomes essential. The 2021 LoRA paper provided a practical solution for fine-tuning large models efficiently. Rather than updating all weights, the method inserts small low-rank adapters that nudge weight matrices in low-dimensional directions while keeping the base model frozen. This approach reduces trainable parameters by factors of 10,000 and cuts GPU memory requirements to approximately one-third of full fine-tuning on certain configurations. LoRA transformed fine-tuning from a resource-intensive research project into something achievable on a single GPU, particularly when combined with quantisation techniques.

Access to information beyond training data remained problematic until the 2020 paper “Retrieval Augmented Generation for Knowledge Intensive NLP Tasks” proposed having models retrieve relevant documents before responding. This approach addresses both outdated knowledge and hallucination by connecting models to internal databases or public web sources, allowing them to cite their findings. Production systems have evolved from simple top K retrieval to sophisticated multi-step pipelines that iteratively refine queries, aggregate information across sources, and evaluate faithfulness while providing citations. Retrieval quality often matters more than the specific base model, with chunking, indexing, search ranking, and query rewriting strategies proving crucial for success.

These powerful models with data access still required orchestration to become truly useful. “The Rise and Potential of Large Language Model Based Agents,” a 2023 survey, provides a comprehensive framework for understanding agent systems. The survey describes agents as having three components: a brain where the LLM plans and decides actions, perception for reading tool results and environmental information, and action capabilities for executing steps like API calls or file operations. The paper explores various configurations, including single agents, multi-agent teams, and human-agent collaboration, while addressing practical requirements such as clear tool schemas, guardrails preventing runaway loops, and result verification checks.

Efficiency considerations have driven several important innovations. The “Switch Transformers” paper demonstrated scaling to trillion-parameter models using a mixture of experts architecture. This approach creates specialised mini networks where a router selects the most relevant expert for each token, using only a fraction of parameters for any single computation. Conditional computation enables larger capacity without incurring the full computational cost on every forward pass, although serving sparse models presents engineering challenges related to traffic balancing, latency management, and bottleneck avoidance.

Making models smaller for deployment received attention through knowledge distillation, demonstrated in the 2019 DistilBERT paper. By teaching smaller student models to mimic larger teacher models during pre-training, the approach achieved a 40% parameter reduction and a 60% speed improvement while retaining 97% of the language understanding capabilities. This enables deployment to edge devices with tight latency budgets, limited memory, and privacy constraints.

Quantisation offers another path to efficiency. The 2022 “LLM.int8()” paper demonstrated preserving transformer performance at a multi-billion parameter scale while storing numbers with fewer bits. The method identifies outlier features with unusually large activations that break naive quantisation, keeping these in higher precision while quantising the majority to int8. This mixed-precision approach roughly halves the memory requirements for large components, making single-GPU inference feasible for models that previously required clusters.

Finally, the 2024 Model Context Protocol from Anthropic, although not published as a paper, represents a significant development in connecting models to external systems. Rather than coding individual integrations for every database, API, or tool, MCP provides a standard schema for exposing capabilities that any compatible client can discover and utilise. This standardisation simplifies the development of agent systems that interact with diverse external resources.

These ten works provide a solid foundation for understanding modern AI engineering. While many important areas remain unexplored, including scaling laws, infrastructure, and system design, this foundation offers a solid starting point for engaging with this dynamic field. Each paper addresses specific challenges in making AI systems more capable, efficient, and practical for real-world applications.

