If you don't understand Word Embeddings you can't understand Language Models

A Large Language Model is only as good as the embeddings upon which it is built... maybe?

Apr 25, 2024

This is the very first post on my journey to understanding the inner-workings of Large Language Models (LLMs).

I am a Ph.D. candidate in computer science studying artificial intelligence (AI). Most of my research has been on the fairness and explainability (i.e. how well something can be understood by humans) of predictive AI and ML systems… topics that have very little at best (or nothing at worst) to do with understanding LLMs.

I don’t want to be left behind, so this blog marks the beginning of my study into LLMs.

Normally I use Overleaf or GitHub for keeping notes, but this time I’ve decided to do something different and “learn in public.” Why not learn and maybe become internet famous in the process? Or internet embarrassed, let’s see…

Either way, this is what you can expect from my posts on this blog: they might not be well-written, well-edited, or make sense… but they are a true representation of my approach to learning and my understanding of how LLMs work at the moment a post is published. You can expect to be introduced to learning resources, codes, tangents on things I know about, steps forward, and steps backward. Come learn along.

Importantly, you should be aware I am not an expert in LLMs — I reserve the right to be wrong about anything and everything I write in these posts.

With that out of the way, let’s begin.

I’ve spent the last few months reading academic papers and watching videos on YouTube to build up my general understanding of LLMs (a solid very first video is this introduction by Andrej Karpathy, a giant in this space). While watching Serrano.Academy (what I’ve found to be a great resource, bearing in my mind I’m not enough of an expert to discern between a good & bad resource), I heard this interesting claim:

Embeddings are really important in LLMs… I would say that they are most important part of LLMs, and the reason is that they are the bridge between humans and computers.

I like the tautology of this claim. I did a Google search and found a few random articles that support this idea about the importance of embeddings to LLMs — and in fact, as if I needed a sign from the God of AI themselves, several of my Ph.D. colleagues are having a conversation about improving the quality of embeddings in their work at this exact moment I am writing this. They are talking about something about joining embeddings for two different models. Hopefully in 6 months I’ll understand what that means.

Either way, let’s learn what embeddings are, and how to create them.

Word embeddings are numerical representations of words — this is what Serrano.Academy meant by embeddings being the “bridge between humans and computers.” We need numerical representations of words for them to be understandable by computers (or do we? A future research direction), and importantly, used in building AI models like Neural Networks (NNs).

Okay, great, simple enough. If you want to learn more about the basics of word embeddings, I recommend searching the web — I’m not the best person to teach you that topic and I don’t want to spend too much writing on the basics.

To me, one of the interesting aspects of embeddings is that they have desirable properties. The property I hear about the most is that similar words should have similar embeddings. For example, if we have the following mappings:

# let e be some embedding function (or mechanism)
e('man') = 0.1
e('apple) = 0.8

Then we would also want:

e('woman') = 0.14
e('orange') = 0.78

In other words, the words man and woman have similar embedding values (i.e. are close in some vector space) because they are “similar” words, and they also have sufficiently different embeddings to apple and orange (which are themselves close). Importantly, the word “similar” is doing a lot of heavy lifting here: what does it mean for two words to be similar to one another? Is similarity based on the definition of words, the way they are used in language, or the concepts we hold in our heads as we use those words?

Importantly, word embeddings inhernetly flirt with the entire field of linguistics. In fact, in their 1999 textbook Foundations of Statistical Natural Language Processing, Manning and Schütze dedicate several sections and chapters to linguistics topics like Linguistics Essentials, Word Sense, and Lexical Acquisitions. I would love to read this textbook at some point.

One random thought: I would like to see more written on the desirable properties of word embeddings other than “similar words have similar embeddings.” It probably exists, I just need to find it.

The term word embeddings was proposed by Bengio et. al. (2001, 2003), in foundational work that created embeddings using a NN architecture that I plan to understand in this post. I want to quickly flag for myself that word embeddings have come a long way since Bengio et. al., and that people now use large, pre-trained mappings of words like word2vec to derive embeddings. Now word2vec seems to refer to both a technique, as well as several pre-trained models — for example, the original (or one of the original) models was trained by researchers at Google using 6-billion words worth of Google News articles (i.e. 6B tokens).

A note to myself: I found this interesting paper I’d like to read which creates dict2vec, or a word embedding model created using lexical dictionaries, i.e. Miarriam-Webster.

To really understand this topic, I want to create my own word embeddings, mostly from scratch. I’m going to follow the original NN architecture proposed by Bengio et. al., which is really made up of three important parts:

An embedding layer that generates the word embeddings; this is really the input layer, where each word corresponds to a node, and the weights attached to the layer nodes will be used for the word embeddings.
A hidden layer of one or more layers, which introduces non-linearity to the embeddings; or a layer that gives us the dimensions of the embedding, i.e. if the hidden layer has 2 nodes then each embedding will have 2 values, e.x:
```
# embedding from a NN with a hidden layer of size 3
e('man') = (0.01,0.98,0.4)
```
An output layer, which, as far as I can tell, is only used for learning the embeddings. Sometimes there is a softmax layer here — but I’m not sure of the utility so I’m going to gloss over this for now. I’ll come back to at a later date.

To fully understand word embeddings, I created a Python Notebook playground in Google Colab, which can be found here. I added detailed comments to the Python Notebook and it should be easy enough to follow on it’s own, so I’ll save space here by only recapping the important points: the purpose is to create a 1-dimensional and 2-dimensional word embedding of a simple document. We also experiment with using Principal Component Analysis (PCA) for creating the embedding, which works surprisingly well (but importantly, this is likely only because our document has an underlying linear structure). If you don’t know what PCA is, see here.

I want to end this first blog post the way I’ll end every blog post — by confusing the shit out of myself (and likely you). Apparently (if Stack Overflow is to be trusted), LLMs don’t use word embeddings from pre-trained models… and they don’t even use embeddings in the way we thought about them in this article at all. What the fuck?

How the f*** do LLMs work?

Discussion about this post