← Posts

Four Concepts to Understand Large Language Models

Token, context window, temperature, hallucination — these four concepts determine what you should and shouldn’t expect from an LLM. For most people who use ChatGPT, Claude, or Gemini every day, artificial intelligence is nothing more than these tools. When you ask “What is an LLM?” the answer you get is usually “it’s AI.” This is like someone who drives a car answering “What is an engine?” with “the thing that makes the car go.” Technically correct, practically useless — because when the LLM’s output isn’t what you expected, it doesn’t help you analyze where the error came from.

LLM — Large Language Model — is not artificial intelligence itself, but a subfield of artificial intelligence. It’s a model trained on massive amounts of text data, specialized in generating and understanding language. Understanding these four concepts enables you to get better results using the same tool.

1. Token: The LLM’s Unit of Reading

LLMs don’t work with words — they work with tokens. A token is sometimes a word, sometimes a syllable, and sometimes just a single character. “Hello world” is two words but could be three or four tokens. Agglutinative languages like Turkish consume more tokens; “kullanılamayacaklarından” (a single Turkish word meaning “because of those that cannot be used”) is one word but multiple tokens.

Let’s examine this difference through Nasreddin Hodja’s “What If It Works?” joke:

Token analysis of a Nasreddin Hodja joke in Turkish — 233 characters, 89 tokens
Token analysis of a Nasreddin Hodja joke in Turkish — 233 characters, 89 tokens
English translation of the same joke — 250 characters, 67 tokens
English translation of the same joke — 250 characters, 67 tokens

The same story, similar character counts (233 vs. 250), but the Turkish version uses 89 tokens while the English version uses 67 tokens. In this example, using Turkish costs roughly 33% more tokens to say the same thing — while the ratio varies from model to model and tokenizer to tokenizer, the difference is always noticeable. (The curious can test their own texts with the OpenAI Tokenizer.)

One reason is Turkish’s agglutinative structure: words like “oturmuş” (was sitting), “elindeki” (in his hand), “sormuş” (asked) get split into multiple tokens, while their English equivalents like “sitting” and “asked” are typically single tokens. But the reason isn’t just the language’s structure — there’s a more fundamental cause: the vast majority of these models’ training data consists of English text. The tokenizer — the algorithm that splits text into tokens — is optimized to represent English words and patterns far more efficiently. Languages like Turkish, which are less represented in the training data, get split into more tokens for this reason as well.

Why it matters: Pricing, speed, and context — all are measured in tokens. When you write a long prompt, or more precisely, when you write a prompt that contains more tokens even if it’s the same length, you’re consuming more resources. The person who conveys the same task with fewer tokens gets faster results at a lower cost.

2. Context Window: Short-Term Memory

The context window is the total number of tokens an LLM can see at once — including both what you write and what the model generates. What happens when this window fills up varies depending on the tool you’re using:

  1. Old information is silently dropped. Messages from the beginning of the conversation are removed from the context. The model no longer “sees” them and therefore cannot remember them.
  2. You’re asked to start a new chat. Some interfaces directly warn “Context is full, please start a new chat” when the limit is reached.
  3. The context is summarized and compressed. More advanced systems extract a summary of the current conversation, compressing the context and allowing you to continue — but details are inevitably lost in the process.

Context window size varies from tool to tool: some models offer 200K tokens, while others go up to 1M or even 2M tokens (2M tokens is roughly 1,500,000 words — a 6,000-page book). Moreover, the same model’s API and chat interface may offer different limits. This sounds enormous, but a medium-sized code project can easily fill this window.

Why it matters: When the context window fills up, the LLM starts “forgetting” what you said at the beginning of the conversation. When answers become inconsistent in a long chat, when it ignores instructions you gave earlier — the reason is usually this. The moment you say “I just told you that, why don’t you remember?” you’ve probably hit the context window limit.

But here’s what’s really interesting: even being within the context window doesn’t guarantee that all data will be remembered. The information is still inside the window, but the model may not be able to find it. Research shows that models are very proficient with information at the very beginning and very end of the conversation, but tend to make errors with content in the middle.

There are even users who test this in practice: at the start of a conversation, they provide a random piece of information like “I drink my coffee every morning at 9:26 AM.” As the conversation grows longer, they periodically ask “What time do I drink my coffee?” The moment the model can no longer recall this information, they know the context is no longer reliable and start a new chat. It’s like the Turkish saying “searching for a needle in a haystack” — and this test is actually called exactly that: Needle in a Haystack. This test, which measures whether a model can find a single piece of information hidden within a long context, proves that context window size alone is not sufficient.

You might be thinking, “But ChatGPT remembers me — it refers to things I said last week?” Some products offer persistent layers like Memory or Project Memory beyond the context window. These are information fragments carried between conversations — your name, preferences, recurring instructions, and so on. But this memory isn’t magic either: the stored information is added to the context window as part of the context when generating a response. So ultimately, it’s still subject to the same window limits.

3. Temperature: The Creativity Dial

Temperature determines the level of randomness in an LLM’s responses. Low temperature (close to 0) produces more predictable, consistent answers. High temperature produces more creative but riskier outputs. The upper limit varies by model: OpenAI and Google models work in a 0–2 range, while Claude stays within 0–1.

Let’s try the same prompt at different temperatures — “Describe in one sentence the relationship between the arrival of spring and the release of a new version of a program”:

Temperature 0: “Just as spring brings new vitality and freshness to nature, a new version of a program offers users a fresh start where bugs are fixed, features are improved, and the experience is renewed.”

Temperature 1: “Both bring innovations, improvements, and renewal after a long waiting period; however, users should still be prepared to encounter a few bugs along the way. 🌸💻”

The first is safe and predictable; the second is creative enough to use emojis.

When writing code, low temperature is generally preferred. This ensures the model produces more consistent and predictable outputs. However, slightly higher values can be useful when you want to explore alternative solutions. When brainstorming, you want high temperature — your chances of encountering surprising ideas increase. Although most systems now manage this setting in the background, knowing how it works helps us understand why we sometimes get “robotic” and sometimes “outlandish” responses.

4. Hallucination: Why Does It Say Wrong Things So Confidently?

An LLM is not an encyclopedia. It doesn’t memorize information and recall it. At each token generation, it answers the question “What is the most likely next token?” This is a probability calculation, not information retrieval.

This is exactly where hallucination comes from. The model can produce an answer that “looks correct” statistically but doesn’t actually exist in reality. It gives the name of a nonexistent library, cites a fabricated source — and does so in an extremely convincing tone. Because its job is to produce convincing text, not to provide accurate information.

Why it matters: Instead of trusting LLM output as “correct,” you should approach it as “plausible.” You should always verify critical information.

Where Does This Knowledge Take You?

Knowing these concepts takes you from being an “AI user” to being a “conscious AI user.” When you know about the context window limit, you understand when to reset the conversation. When you know about hallucination, you don’t blindly accept the output. When you know about temperature, you adjust the tool according to the type of task.

An LLM is not a thinking machine — it’s a prediction machine. But if you know how it predicts, you can make its predictions work in your favor.

So how are these limitations being overcome? RAG, Tool Use, Agentic AI — we’ll cover the approaches the industry is building on top of these concepts in the next article.

Which of these four concepts do you feel the absence of most in your daily usage? The forgetting of context, or hallucinations?

Share