Demystifying AI in Engineering

2026-01-26

When talking about AI in software engineering, I often hear things like:

“I don’t trust it.” “Does it really save any time?” “Why use it when I can just do it myself?” “It’s just a statistical model.” “It just writes slop.”

These concerns are not all wrong. Under the hood, modern AI really is a statistical model. That does not mean it is not useful.

My goal here is not to sell you on AI or write an academic paper. It is to give you a clear, practical look at how today’s models actually work so you can better judge when they are valuable and when they are not.

To do that, it helps to understand where a lot of this skepticism came from.

Yesterday’s AI

Let’s go back about ten years to the mid-2010s, when machine learning was the buzzword of the moment. We were starting to call things “AI,” but most of what we had were narrow, specialized models.

The thinking at the time was straightforward. Computers are good with numbers, so if we could turn text, images, and other messy data into numbers, we could train models on it. Techniques like vector embeddings, popularized by tools such as word2vec, made this possible by representing things like words or images as numerical vectors that preserved some notion of meaning.

From there, we trained models using large labeled datasets. You would show the model examples. This is a muffin. This is not a muffin. Over time, it learned statistical patterns that let it guess whether a new image was likely a muffin or not.

This approach worked, but only within limits.

Most machine learning at the time was supervised, narrow, and task-specific. Models were trained to do one thing well, such as classifying images, tagging text, or detecting spam. They processed one input at a time in a tightly constrained problem space.

As a result, these systems were slow to train, brittle in practice, and difficult to generalize. A model that was great at identifying muffins was useless for detecting cancer or translating text. What we called “big” models were measured in millions, and sometimes tens of millions, of parameters. They were only as good as the specific data they were trained on.

Given that history, it is no wonder many engineers learned to distrust “AI.”

What Changed

So how did we get from that world to the generative models we use today?

Several things changed, but three matter most.

1. Transformers

The most important shift was the introduction of the transformer architecture.

Before transformers, models processed language mostly sequentially. They looked at one word at a time and had limited ability to understand how all parts of a sentence related to each other.

Consider a sentence like "The bank can guarantee deposits will eventually cover future tuition." To understand whether "bank" refers to a financial institution or a riverbank, you need to look at words that appear much later: "deposits," "cover," "tuition." Sequential models struggled with this because by the time they reached "deposits," the context around "bank" had faded or been compressed into a fixed representation.

Transformers solved this through a mechanism called attention. Instead of processing words one at a time, attention allows the model to look at all words simultaneously and learn which ones are relevant to each other. When processing "bank," the model can directly attend to "deposits" and "tuition" regardless of how far apart they are, weighing their relevance to determine meaning.

This doesn't happen just once, but multiple times across multiple layers, with the model learning increasingly sophisticated relationships. Early layers might connect "bank" to "deposits," while deeper layers connect "deposits" to "cover future tuition," building up a rich understanding of the entire sentence.

This single change dramatically expanded what models could understand and generate. It moved machine learning beyond narrow classification tasks and made general-purpose language models possible.

2. Model-Native Context

Context changed as well, and this shift is more significant than it might initially seem.

A decade ago, context was mostly managed by application code. If you wanted to use a model for sentiment analysis on customer reviews, you had to carefully preprocess each review into the exact format the model expected: maybe a few dozen words, stripped of anything extraneous.

The model had no memory of previous reviews, no understanding of the product being discussed, no awareness of the customer's history. Each prediction was isolated. If you wanted to analyze patterns across reviews, you had to build that logic yourself, running hundreds of individual predictions and manually aggregating the results.

The practical limit was often just a few hundred tokens per request. Models were stateless and forgot everything between calls.

Today, context is largely model-native. Models manage it themselves across much larger windows, often hundreds of thousands of tokens.

You can now give a model your entire product documentation, a collection of customer feedback, recent support tickets, and your current feature roadmap all at once. The model can identify that customers are frustrated with checkout because a feature you deprecated last month was solving a workflow problem you didn't realize existed.

The model learns what to pay attention to. When you ask about customer pain points, it dynamically weights relevant context by connecting complaints across different channels, identifying patterns in how different user segments describe the same issue, and deprioritizing one-off complaints or unrelated feedback.

This is why models can now help with tasks like synthesizing user research or generating documentation that accounts for multiple use cases. The limiting factor has shifted from "can the model see enough?" to "can it reason effectively about what it sees?”

3. Scale and Generality

Once transformers proved they could scale, we started training models on much broader datasets. These included publicly available text, code, documentation, books, and research.

The old approach was to curate datasets for specific tasks. You would gather thousands of labeled spam emails to build a spam filter, or thousands of medical images to detect tumors. Each model was a specialist.

The new approach flipped this. Instead of training different models for different tasks, we trained single models on enormous, diverse datasets and let them learn general patterns across all of it.

This matters because those patterns only emerge at scale. Train a model on a hundred Python scripts and it learns basic syntax. Train it on millions of repositories across dozens of languages and it learns deeper patterns: how architectural decisions lead to certain bugs, how testing strategies differ across ecosystems, how naming conventions signal intent.

This is why you can ask a modern model to write Rust code even if you've never written Rust yourself, or explain a complex algorithm like you're explaining it to a friend. The model has seen enough examples that it can generalize to requests it has never encountered before.

We went from millions of parameters to billions. The payoff is a fundamentally different kind of tool—one that can work across domains rather than being locked into a single task.

Why Today’s AI Feels Different

These changes did not turn statistical models into magic. What they did was make them broadly useful.

Modern models can ingest large portions of a codebase and reason across multiple files. They can synthesize information from documentation, tests, and error output in a way older systems never could.

Yes, they are still predicting the most likely next token. The difference is that they do so with far more context, better representations, and significantly improved performance.

That is why an LLM can often help you track down a nasty bug or draft a reasonable implementation sketch in minutes. Tasks that once required hours of manual searching and context switching can now be accelerated.

Where Models Excel and Where They Struggle

Today’s models are especially strong at pattern recognition and synthesis. They work well across large contexts and are good at generating first drafts of code, tests, or documentation.

They still have limits.

They tend to be opinionated and often nudge you toward common patterns that may not match your architecture. They can also get lost when iterating through complex changes, especially when tests start failing in unexpected ways.

In many ways, they behave like an overeager intern. They are genuinely helpful, surprisingly capable, and occasionally too confident for their own good.

How Engineers Should Approach Them

Chances are you already have access to these tools, whether through Copilot, Claude, Cursor, or something similar.

The key is learning how to work with them, and that comes from deliberate practice rather than occasional tinkering.

Start with low-stakes tasks where you can easily verify the output. The next time you need to write a test for a function you just wrote, try asking the model to generate it. Give it the function and a brief description of what edge cases matter. See what it produces. Check whether the tests actually cover what you care about or if they just look plausible.

Pay attention to how you phrase requests. Vague prompts like "make this better" usually produce vague results. Specific prompts like "refactor this function to handle the case where the user list is empty" tend to work better. The model has no context beyond what you give it, so being explicit about constraints, requirements, or concerns makes a difference.

Notice where the model gets lost. If you’re iterating on a complex change and the suggestions start drifting away from what you actually need, that is a signal. You might need to provide more context, break the problem into smaller steps, or just handle that piece yourself. The model is a tool, not a replacement for judgment.

Experiment with different kinds of tasks. These models are often better at some things than others. Generating boilerplate, drafting documentation, explaining unfamiliar code, and suggesting test cases tend to work well. Architecting systems, making nuanced trade-off decisions, or debugging subtle concurrency issues are hit or miss.

Treat early results as drafts, not solutions. Even when the output looks right, read it carefully. Models are confident even when they are wrong, and they will occasionally generate code that compiles but does the wrong thing or uses patterns that do not fit your codebase.

Like any tool, the value comes from understanding what it is good at, what it struggles with, and how to adapt your workflow to make use of it effectively.

If you're experimenting with these tools, I'm curious what you're finding. What's worked? What hasn't? Where have you been surprised?