2_notes (3)
2_notes (3)
Overview
This lecture delves into the intricacies of Large Language Models (LLMs), which are sophisticated
artificial intelligence systems that excel in processing and generating human language. LLMs are
transforming how we interact with technology, enabling more natural conversations and more effective
communication with machines.
Human Language
Human language is fundamentally a tool that facilitates communication and is a cornerstone of human
progress and innovation. It is not only a means of expressing ideas but also a meta-tool that enables the
creation, sharing, and refinement of various other tools. The complexity of human language arises from
its numerous irregularities, exceptions, idioms, and evolving meanings, which can pose challenges for
both learners and AI systems. Additionally, language is inherently ambiguous, as its meaning can shift
based on context, tone, and cultural nuances. This richness allows for creativity in expression, leading
to the emergence of new words and slang as society evolves. Furthermore, language has layers that
encapsulate emotional, cultural, and social norms, making it a dynamic and multifaceted system. As AI
models attempt to mimic human language, they must navigate these complexities to generate relevant
and coherent text.
Types of LLMs
Several prominent LLMs have emerged in recent years, each with unique features and capabilities.
Notable examples include Deepseek-R1 by High-Flyer, GPT-4o by OpenAI, Qwen2.5 by Alibaba, and
Llama3.3 by Meta. Each of these models employs advanced techniques and architectures to enhance
performance and adaptability.
LLMs can be grouped by how they are accessed: downloadable models (Deepseek-R1, Qwen2.5,
Llama3.3) and API-based models (GPT-4o). Downloadable models can be installed and run on your
own computer, giving you more control and customization options, but they require a powerful machine
to operate. On the other hand, API-based models are hosted on the servers of the provider, meaning you
can use them over the internet through an application programming interface (API). This method is
easier to integrate into apps and doesn’t require you to manage any hardware, but it usually comes with
usage fees and can be slower because it relies on internet connection.
Prediction Mechanisms
There are two primary mechanisms for predicting the next word in a text sequence: deterministic (not
commonly used) and stochastic (commonly used) prediction. Deterministic next word prediction means
that given the same input, the model will always predict the same next word, which can be useful in
controlled environments. In contrast, stochastic next word prediction introduces variability by utilizing
probabilities to determine the likelihood of various possible continuations. For instance, when presented
with the phrase "To be or not to ___," the model might predict "be" with a 75% probability, while other
options like "do" or "say" have lower probabilities. This stochastic nature allows LLMs to generate
diverse and contextually rich text, making their outputs more dynamic and engaging. Adjusting the
probabilities can control the randomness of the predictions, which can be particularly useful in
applications where creativity is desired.
Training Data
The quality and diversity of training data are critical to the success of LLMs. High-quality training data
must be large in volume and diverse in content, encompassing a wide range of topics and language uses
to ensure the model learns effectively. Most LLM providers do not disclose the exact datasets used, but
they typically consist of vast amounts of text harvested from the internet, including books, articles, and
websites. This data must be carefully curated to avoid biases and maintain relevance. For example,
Hugging Face’s FineWeb dataset comprises 3 billion web pages from 39 million domains,
demonstrating the scale required for effective training. The removal of personally identifiable
information (PII) and irrelevant content is also essential to ensure ethical use and compliance with data
privacy standards.
Important Terms
• Tokens are the fundamental units of text that the model processes, typically representing words
or subworlds. Tokenization is the process of breaking down text into smaller units, or tokens,
which can be words or subworlds. This enables the model to analyse and generate text more
efficiently. Effective tokenization helps the model manage vocabulary size and handle rare or
complex words by breaking them into manageable pieces, allowing for better performance
across diverse language inputs.
• The transformer model is a crucial architecture that powers many LLMs, utilizing attention
mechanisms to determine which parts of the input data are most relevant during processing.
This attention-based approach allows LLMs to consider the relationships between words in a
sentence, improving their ability to generate coherent and contextually appropriate text.
Attention works by computing a score for each word in relation to others, determining which
words should be emphasized when making predictions. This is particularly beneficial for
capturing long-range dependencies in language, where the meaning of a word can be influenced
by others that are far apart in the text. For example, in the sentence “The cat that chased the
mouse was very quick,” the attention mechanism helps the model understand that “cat” and
“quick” are related despite being separated by several words.
Customizations of LLMs
Customizing LLMs can significantly enhance their performance and adapt them to specific tasks. Two
common customization options include adjusting the temperature and the maximum length of generated
text. The temperature controls the randomness of predictions; a lower temperature (e.g., 0.2) results in
more deterministic and focused outputs, while a higher temperature (e.g., 1.0) yields more diverse and
creative responses. This allows users to strike a balance between coherence and creativity based on their
needs.
The maximum length parameter dictates how many tokens the model will generate in response to a
prompt. Setting an appropriate maximum length is crucial, as it affects the completeness and relevance
of the output. Too short a length may truncate valuable information, while too long a length can lead to
irrelevant or overly verbose responses. By fine-tuning these parameters, users can tailor LLM outputs
to better suit specific applications, whether for casual conversation, technical writing, or creative
storytelling.