Day 1: Evolution of AI
Reflection
Below is my learning journey and quick, on-the-spot reflections in the order they happened. Feel free to sidestep my missteps.
Recommended starting point:
Watch But What Is a Neural Network? (3Blue1Brown) first. It neatly lays the groundwork by explaining neural networks in clear, visual terms. After that, dive straight into Attention Is All You Need if you're up for a short yet demanding read—especially if you're non-technical. If you just want the quick-start knowledge, skip the reflections and jump to the summary.
1. Attention Is All You Need
Ease of reading: 1/5
Starting my bootcamp with this landmark transformer paper was, in hindsight, overly ambitious. Twenty minutes in, I'd only cleared the first two pages and could tell the full read would blow past my four-hour window. I pivoted to learn from ChatGPT o3, worked through the sections conversationally for about an hour. ChatGPT o3 did an incredible job explaining the concepts.
2. "The Brief History of Artificial Intelligence"
Ease of reading: 5/5
A brisk, digestible timeline of AI milestones. Light on technical depth but perfect for resetting my brain after the highly technical essay. In retrospect, this should have been Day 1's opener.
3. "Convolutional Neural Networks: A Brief History of Their Evolution"
Ease of reading: 2.5/5
The article dove straight into LeNet-5 and AlexNet jargonI. I used ChatGPT to provide understandable context on neural networks and convolutions, and after thirty focused minutes the high-level picture of CNNs and their historical developments made a lot more sense.
4. But What Is a Neural Network? (3Blue1Brown)
Ease of watching: 3.5/5
The video's visual take on neural networks was truly insightful, the animations made everything easier to digest. This is a must-watch but it does dip into some math jargon, but a quick ChatGPT conversation (or a little calculus background) would make it completely digestable.
5. Lecture 1: Introduction to CNNs for Visual Recognition (CS231n)
Ease of watching: 4/5
A pretty easy-to-digest video. Like the Brief History article, it provides more context than deep understanding. However, the context here is more substantial since it not only lists various historical paradigms in computer vision but also explains, at a high level, the ins and outs of each. Fei-Fei Li is great. This series is highly recommended.
6. Introduction to Reinforcement Learning (White & Sutton, 2024)
Ease of reading: 2.5/5
Much gentler opening than Attention. I coasted through basic agent-environment interaction and return functions before the intense math notions were involved. I used ChatGPT to unpack the key terms step-by-step. RL hype now feels grounded.
7. "OpenAI's GPT-4o: The Multimodal Revolution That Changes Everything"
Ease of reading: 5/5
After two technical papers, this celebratory blog read like click-bait. The "wow" moments were supported by shallow, surfaced metrics (lower latency, multimodal UX) without diving into the details of the model.
Summary:
A Non-Technical Founder's Guide to Neural Networks, CNNs, Transformers, Reinforcement Learning, and Multimodal AI
I believe that, to truly grasp the significance of each pivotal development in AI's history, it is important to understand each concept individually while remembering why each one matters to the others. To keep this summary coherent and accessible, I'll use mostly non-technical language (adding technical detail only when necessary) and present the ideas in strict chronological order. Without further ado, let's dive in. I'll also include a glossary of relevant technical terms in plain English, which should help you along the way.
Neural Networks
A neural network is a many-layered chain of tiny calculators called neurons. Each neuron multiplies every incoming number by its own adjustable weight, adds a bias, pushes the sum through a squashing curve such as ReLU (which keeps positives and zeros out negatives), and forwards the result to the next layer. When an input (say, a 28 × 28-pixel image of the digit "5") flows all the way to the last layer, the network produces a guess like "95% sure this is a 5." One scalar loss then measures how far that guess is from the truth. Using ordinary calculus, back-propagation works backward from the loss, giving every weight a tiny correction (gradient) that tells it which direction to nudge. Repeating this forward-then-backward loop across millions of examples sculpts early layers into simple edge detectors and deeper layers into high-level concept detectors (digits, faces, verb tenses, and beyond).
Because the learning recipe never changes, scaling is straightforward: feed in more data, add more layers and weights, and train longer on faster hardware. This simple "just add compute" rule is why a program that once read bank checks with 60 k weights has the same core logic as today's trillion-parameter language models. In fact, since 2010 the computation used to train headline systems has doubled roughly every six months, a pace far steeper than classic Moore's Law hardware gains.
To ground the math, imagine a single-neuron toy network. Two input pixels flow into weights , with bias . The neuron adds everything up: (If you use the original numbers, ; the original text had , but the math works out to .)
ReLU leaves positive numbers unchanged, so the output is . That lone neuron has effectively learned: “bright first pixel and dim second pixel = strong activation.” Scale this to thousands of neurons per layer and dozens of layers in depth, and the network can discriminate brush strokes, cat ears, or idiomatic phrases—all by adjusting weights exactly like in this tiny example.
Convolutional Neural Networks (CNNs)
A CNN is a specialized neural network for images that trades millions of one-per-pixel weights for a handful of reusable filters. A single filter is a tiny grid of numbers (for RGB) called a kernel. The network slides that kernel across the picture—left-to-right, top-to-bottom—and, at every position, multiplies corresponding pixel values by the kernel's weights, then sums the result. The output forms a new image-sized grid called a feature map that "lights up" wherever the filter's pattern (say, a vertical edge) occurs. Because the same nine numbers are reused everywhere with weight sharing, the model gains translation equivariance (it spots the edge no matter where it lies) while slashing parameters from millions to mere dozens. Early filters detect edges; deeper layers stack dozens of filters to detect corners, eyes, or whiskers; the deepest layers combine those into full objects. Pooling layers down-sample these maps, keeping "edge exists" but dropping exact coordinates, which saves memory and adds robustness to small shifts.
A toy example makes the math concrete. Take a grayscale patch:
and a edge kernel
Place the kernel on the top-left region:
Slide one step right:
Repeat this slide-and-dot-product over the whole image to build a feature map whose bright spots mark vertical boundaries. Stack dozens of filters and layers, train their weights with back-prop just like any neural net, and a CNN can tell cats from dogs—or feed its feature maps into systems like AlphaGo and GPT-4o for higher-level tasks.
Self-Attention and the Transformer Architecture
Standard neural nets process inputs all at once, and CNNs process local pixel neighborhoods, but neither handles long-range relationships in language very efficiently. Recurrent networks tried to fix that by reading a sentence word-by-word, yet information had to hop through every intermediate step, making the path from, say, the first word to the last painfully long. The Transformer, introduced in 2017, upended that constraint with a single idea called self-attention.
At each layer, every token (word or sub-word) creates three tiny vectors: a query that asks, "What am I looking for?", a key that says, "What do I represent?", and a value carrying my actual content. The model compares every query with every key using a dot product; a softmax then turns these scores into weights that sum to one. Those weights say, for this token, how much attention to pay to every other token. Finally, the token builds a new representation by taking the weighted average of all the value vectors. Because this happens for all tokens in parallel, a single matrix multiply lets "dog" instantly focus on "runs" five words later or "quick" two words before—no step-by-step relay required.
A few engineering twists make the mechanism practical. Multi-head attention splits each embedding into several sub-spaces (heads) so different heads can specialize. One might track subject-verb agreement, another coreference, and another rhythm. Because attention itself is order-agnostic, the model adds positional encodings—sinusoidal waves or learned vectors—so it can still tell "cat chases mouse" from "mouse chases cat." Each attention block is followed by a small two-layer feed-forward network, and the whole attention → feed-forward pair is stacked a dozen to hundreds of cdot, with residual (skip) connections and layer normalization keeping gradients stable.
Why was this a breakthrough? First, every token now has a direct "one-hop" path to every other token, so the network models long-range dependencies without the fading memory of RNNs. Second, the computations line up perfectly on GPUs and TPUs: the self-attention matrix multiply scales in parallel, so increasing data or model size simply means launching more hardware, not waiting longer per step. That architectural headroom is what let language models balloon from millions of parameters to the hundreds of billions that power today's chatbots.
The same blueprint now underlies multimodal systems. OpenAI's GPT-4o, whose system card describes it as "a single autoregressive model that natively handles text, images, and audio," feeds mixed tokens (sub-words, image patches, and audio frames) through the exact same attention layers, proving the mechanism is modality-agnostic. In practice, that means you can paste a chart, ask a follow-up voice question, and receive a coherent answer from the very same network, with no hand-engineered bridges in between. Self-attention, once invented for machine translation, has quietly become the universal wiring diagram for large-scale reasoning systems.
Reinforced Learning
Reinforcement Learning (RL) turns a pattern-recognizing neural network into an active decision-maker. Imagine dropping a software "agent" into a world (anything from a Go board to a robotic arm to a web interface). The agent follows a perpetual loop. First, it looks around and captures a snapshot of the current situation, called the state. Next, it chooses an action: place a stone, bend a joint, or click a button. The world reacts, sliding into a new state and handing the agent a single number, the reward, that says how good or bad that move turned out. Finally, the agent updates its internal playbook, or policy, so that moves leading to higher long-term reward become a little more likely the next time a similar situation appears. Running this cycle millions of cdot lets the agent discover intricate strategies that no human ever explicitly programs.
The concept of "discounting" (Discount Factor) keeps the learning practical. Rewards that arrive many steps in the future are multiplied by a small factor so they count slightly less than immediate payoffs. Set the factor low and the agent behaves short-sightedly, fixating on quick wins; move it closer to one and it becomes patient, willing to endure short-term sacrifice for bigger future gain. This single knob lets designers balance impulsive versus farsighted behavior in everything from video game bots to energy-saving HVAC systems.
Different RL algorithms focus on different pieces of that loop. Value-based methods, such as Deep Q-Networks, try to predict the long-term score attached to every action in every situation, then simply pick the action with the highest forecast. Policy-gradient methods, like REINFORCE, bypass score-keeping and directly nudge the policy toward actions that worked well. Hybrids such as PPO pair a fast-acting "actor" (which chooses moves) with a "critic" (which judges how good those moves were), giving modern agents both stability and speed.
A famous showcase is AlphaGo. Engineers first trained a neural network to evaluate Go positions by feeding it millions of human games. Then they let twin copies of the agent play each other endlessly, rewarding a win and penalizing a loss. Every match generated a new experience, which in turn refined the policy. After countless rounds of self-play, AlphaGo developed tactics unseen in centuries of human study and, in 2016, defeated the world champion. That victory crystallized RL's promise: with clear feedback and enough practice, an agent can outperform expert-written heuristics in vast, complex decision spaces.
Today the same trial-and-error recipe fine-tunes chatbots, teaches quadruped robots to walk, optimizes data center cooling, and even schedules spacecraft maneuvers. Strip away the engineering jargon, and each headline success hinges on the same simple engine: observe the world, act, harvest a reward, adjust, and repeat.
Multimodality—one Transformer that sees, listens, and speaks
Traditional AI systems treat each sense as a separate problem: an image classifier handles pictures, a speech recognizer handles audio, a language model handles text, and engineers stitch their outputs together with custom code. Multimodal models erase those boundaries. They feed pixels, sound waves, and words into the same Transformer layers, letting one set of attention weights discover how these inputs relate.
OpenAI's GPT-4o is the first production-scale showcase. According to its system card, the model is trained end-to-end on mixed text, images, audio, and even short video clips, so "all inputs and outputs are processed by the same neural network." A photo is broken into small patches, speech into short acoustic frames, and each patch or frame is simply treated as another token alongside ordinary word pieces. Self-attention then works exactly as it does for text: every token—whether it represents a pixel, a phoneme, or a word—scores how much every other token matters and blends their information in one parallel operation. Because the machinery is identical across modalities, the model can reason across them fluently: describe a chart it has just "seen," answer a follow-up voice question, and output either prose, an edited image, or synthesized speech without context-switch delays.
Speed and cost benefits followed. GPT-4o responds to voice in a median 320 milliseconds—human-conversation pace—and "matches GPT-4 Turbo on text while running much faster and 50% cheaper," the system card reports. A Medium deep dive echoes the headline figures: twice the speed, half the price, and a five-cdot higher rate limit than GPT-4 Turbo, all while outperforming it on vision tasks. Those efficiencies come from training one backbone instead of maintaining separate vision and language encoders.
Practically, the unified token space lets users ask compound questions that would stump duct-taped systems: "Here's a smartphone screenshot—highlight the error, draft the email to engineering, and read it back to me." The model never hands off to a different subsystem; it just runs another forward pass. Safety work rides the same rails: Red-teamers evaluated GPT-4o's text, image, and voice outputs under one policy umbrella rather than three disconnected pipelines, easing alignment and monitoring.
In short, multimodality takes the Transformer's all-to-all attention idea and extends it across senses. It finishes the ladder that began with neural nets (learnable weights), specialized with CNNs (vision), scaled with self-attention (global context), and was put into action by RL; now, the same architecture digests every kind of human signal, opening the door to agents that can seamlessly see, talk, and act.
Glossary
Early Access
We share our learnings and insights on our newsletter. We also provide weekly insights on AI progress and new tools.
By signing up, you agree to our privacy policy.