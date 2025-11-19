I recently taught a virtual AI workshop in Poland, and something surprised me. When I asked the room who knew what a token was, only about one-third of the hands raised. Perhaps I’ve been living in an AI bubble, but it struck me that many developers are working with LLMs without truly understanding the fundamentals. So let’s fix that.

Tokens are the currency of LLMs. When you send “Hello World” to an LLM, that text gets broken down into constituent tokens. Send “hello world” to OpenAI, and you’re looking at three tokens, billed at a tiny amount per thousand tokens. The LLM processes your input and produces output tokens, which get billed at a different rate. To calculate your total cost, divide the input tokens by 1,000 and multiply by the rate. Do the same for the output tokens, and then add them together.

Here’s where it gets interesting. I ran a simple test using Claude 3.5 Haiku and sent it “hello world.” The response was friendly enough, asking how I was doing and offering to help. That used 20 output tokens, but bizarrely, 11 input tokens. That’s weird considering I only prompted it with two words. When I ran the exact same prompt through Google’s Gemini 2.0 Flash, I got 4 input tokens and 11 output tokens. Same prompt, different models, completely different token counts. What’s going on?

Every LLM has a different token vocabulary. These vocabularies contain all the different words, subwords, and characters that the model is aware of. Each is assigned a number, and that number serves as the token. When you call an LLM, it encodes your text into tokens by splitting it into the largest individual tokens in its vocabulary. “Hello world!” might become four tokens: “hello,” a space, “world,” and an exclamation mark. Each of these gets matched to its corresponding number in the vocabulary.

I tested this using OpenAI’s tokeniser implementation in TypeScript. I fed it a chunk of text about 2,300 characters long, and it produced fewer than 500 tokens. When I tried something simpler like “hello world,” the 12 characters became just 3 tokens. These tokens, which are simply arrays of numbers, are what are actually fed into the LLM for processing.

The full process looks like this: your text is input, split into the largest chunks the model recognises, encoded into tokens by looking up numbers in the vocabulary, the LLM performs its thinking, outputs more tokens, and those tokens are decoded back into text and joined together. While you think the LLM is dealing with text, it’s actually dealing with numeric representations of text chunks.

To understand why tokens differ between providers, you need to know how vocabularies get built. Imagine you have a corpus of text. In reality, this would be gigabytes or terabytes of data, usually the same data the model itself was trained on. But let’s use a simple example: “the cat sat on the mat.”

You could build an extremely simple tokeniser by extracting unique characters from this text. I coded up a character-level tokeniser in TypeScript that does exactly this. When I ran “cats sat mat” through it, the 11 characters produced 11 tokens. The entire vocabulary consisted of only 10 tokens: space, various letters, each assigned a unique number. This means the number of tokens always equals the number of characters, which is terrible. The more tokens you have, the more work the LLM has to do.

Vocabulary size really matters. Say you only have about a thousand tokens in your vocabulary. The word “understanding” might produce five tokens: “under,” “st,” “and,” “ing.” Bump that up to 50,000 tokens, and you might get larger subword chunks, producing just three tokens: “under,” “standing.” At 200,000 tokens, you might split it into just two. Two tokens are far more efficient for the LLM to process than five.

These are the trade-offs different model providers make. You can’t simply scale vocabulary to infinity, either, because the larger your vocabulary, the bigger the model needs to be to accommodate it, and the more memory it requires to execute.

So how do we get from character-level to something better? We identify groups that commonly occur together. In “the cat sat on the mat,” you’d find that “th” occurs together, “he” occurs together, and “at” appears in “cat,” “sat,” and “mat.” I built a subword-level tokeniser that does this. When I ran “cats sat mat” through it, those 11 characters produced only 8 tokens. The vocabulary expanded to 15 tokens, including subwords like “th,” “at,” and “he,” each of which was assigned its own number.

Real tokenisers go deeper, identifying larger and larger groups by finding groups of groups. They end up with sets of numbers, each assigned to an individual token.

Here’s something fascinating: when a tokeniser encounters an unusual word, it struggles. I tested this with “frabjous” from Lewis Carroll’s poem. It’s a made-up word, so it’s not frequent in the dataset. OpenAI’s tokeniser split it into four separate tokens. Unusual words break up into more tokens than frequently occurring words. This also means if you’re querying the LLM in a language that’s not well-represented in the training data, it’ll break up into more tokens. This is true for spoken languages, as well as for coding languages. It takes fewer tokens to send 20 lines of JavaScript than 20 lines of Haskell, giving commonly used languages yet another advantage in the AI era.

Let me sum this up. Tokens are the currency of LLMs, and you’re charged by the token. Different model providers treat inputs differently and produce different token counts. Encoding means turning text into tokens by splitting it into the largest tokens possible and converting them to numbers from the vocabulary. Decoding is simpler: you just take the numbers, find the relevant chunks in your vocabulary, and join them together. The entire LLM process involves encoding, where the LLM thinks in tokens, producing output tokens, and then decoding those back into text. Tokens are essentially what the LLM uses for its thinking.

