Backprop is Main B****
A fundamental algorithm underpinning the "unreasonable effectiveness of deep learning."
If you didn’t catch my first blog post, my name is Andrew Bell and I am a Ph.D. Candidate at NYU researching fairness and explainability in Machine Learning (ML) and Artificial Intelligence (AI)… topics that have very little at best (or nothing at worst) to do with understanding the mechanisms behind Large Language Models (LLMs).
This series of blog posts is written for me (not you), and it documents my learning process as I attempt to understand how the f*** LLMs work, from the ground up. Importantly, you should be aware I am not an expert in LLMs, and I reserve the right to be wrong about anything and everything I write in these posts.
I am learning in public — a process you might find interesting; and if you’re feeling brave, you can even learn along with me. With this disclaimer out of the way, let’s get started on today’s post.
When I’m learning something new I always seek to understand the fundamentals — I want to understand why things are done the way are, and also understand why not something else. This article represents a journey inwards to find the center of LLMs.
If you are reading this, you likely know that LLMs (or will be interested to know) that LLMs are built using something called transformer architecture. This is the deep learning architecture based on the multi-head attention mechanism (something we will definitely study), and it underpins modern LLMs like ChatGPT and Llama. In other words, without transformer architecture, no LLMs as we know them today.
But importantly, this architecture is far from fundamental. In fact, as the word architecture would suggest, it’s a composite of many other tools from the machine learning world. These include embeddings layers (which the principles we discussed in my last post), attention layers (queries, keys, values), softmax functions — and who the hell knows what else. One thing is for sure, there are giant, giant neural networks involved — i.e. deep learning. I think it’s fair to say that if we want to understand how LLMs function, we need to understand deep learning.
Hold on to your mind, we are going to get existential.
I was recently reading Dr. Simon J.D. Prince’s new textbook Understanding Deep Learning, where I came across this incredible claim:
To summarize, it’s neither obvious that we should be able to fit deep networks nor that they should generalize. A priori, deep learning shouldn’t work. And yet it does.
He notes that in general, our understanding of why deep learning works so well its still quite limited. There are theories, like the spline theory of deep networks, but in general, there are many aspects of deep learning that feel counterintuitive. For example, there is evidence that given a fixed amount of data, building bigger networks (networks who size may dominate the size of the data itself) can lead to better generalization on out of sample data — not overfitting. What? Also, you can continue to train a model with 0 loss on the training data, and it will continue to minimize loss on the out-of-sample test data (and again, without causing problems for generalizability).
This all reminds me of humanity’s understanding of airplane lift — if you didn’t know, no one can explain why airplanes stay in the air, but they are remarkably effective at doing so. And we can build them quite well (except for maybe Boeing.)
With deep learning, it’s almost as if there is something complex about the “process of learning” we don’t fully understand — maybe these are the same mechanisms that power the human brain?
Maybe, but also maybe not. While “neural networks” were so-named because they mimic some structure of the human brain (i.e. neurons and activations), the analogy breaks down quickly. As a fundamental example, there is absolutely no evidence at all that the human brain uses a mechanism like backpropogation to learn.
Now, I know what many of you are thinking: what the f*** is backpropogation? We’ve arrived at the fundamentals: it is a fundamental algorithm underpinning the learning of neural networks.
Backpropogation, or backprop for short, is “simply” a gradient estimation method, and it helps us train neural networks (said more technically: adjust the parameters of our model so that it minimizes a loss). Backprop itself is actually a special case of a more general approach to calculating gradients called reverse-mode automatic differentiation.
Note to the reader: I personally went down the rabbit-hole of understanding automatic differentiation and found it worthwhile, and at worst a good review of the Chain Rule for differentiation; if you want to do the same, start with this lecture.
Honestly, I’m not sure how important it is that you understand how backprop works to meaningfully understand how LLMs work. This is also not the first time I’ve studied it; I’ve learned about backprop in my ML and AI courses at NYU. But this time, I wanted to really understand it, because one thing is for sure: if you don’t understand how backprop works at all, you have no hope of rebuilding LLMs from scratch.
Further, in my opinion, learning about how deep neural networks are actually trained demystifies some of the magic—suddenly something that sounds arbitrarily complex can feel principled and intuitive.
To make sure I understand the mechanisms of backprop, I created a simple notebook you can check out here. Unfortunately, I slightly failed in my objective—I actually implemented gradient descent (or the process of using gradients to optimize an objective), rather than backprop itself (an approach for finding those gradients in neural networks). Luckily, in the ML world a lot of people incorrectly but colloquially use the term backprop to refer to whole training process (gradient estimation + descent) rather than just the gradient estimation part.
Either way, you’ll see in my notebook how effective gradient descent is at fitting models. As I’ve mentioned before, I’m not going to use these blog posts to teach you concepts like backprop or gradient descent; there are much better resources for doing so here. But I will give you my take: in a sense, gradient descent is Side B****, and backprop is Main B***. And deep neural networks are the mother f****n’ pimps.