Language models are powerful tools that can generate natural language texts for various purposes, such as summarization, translation, dialogue, and more. However, training a large and effective language model requires a lot of data and computational resources, which are often scarce or expensive.
That’s why a new project called TinyLlama has caught the attention of many researchers and enthusiasts in the field of natural language processing (NLP). TinyLlama is a 1.1 billion parameter language model that is pre-trained on 3 trillion tokens, which is equivalent to about 15 times the size of the entire English Wikipedia.
What is TinyLlama?
TinyLlama is a project led by Zhang Peiyuan, a research assistant at Singapore University of Technology and Design (SUTD). The project aims to pre-train a 1.1 billion parameter language model called Llama on 3 trillion tokens within a span of 90 days, using only 16 A100-40G GPUs.
Llama is a transformer-based language model that was introduced by Zhang et al. in 2022. It has a similar architecture and tokenizer as GPT-3, one of the most popular and powerful language models in the world. However, Llama has some advantages over GPT-3, such as:
- Llama uses a smaller vocabulary size (32K) than GPT-3 (50K), which reduces the memory footprint and improves the efficiency of the model.
- Llama uses a novel technique called Chinchilla Scaling Law, which allows it to scale up to larger models without sacrificing performance or quality.
- Chinchilla Scaling Law states that the optimal number of parameters for a language model is proportional to the square root of the number of tokens in the training data.
- Llama has been shown to outperform GPT-3 on several NLP tasks, such as text summarization, text generation, question answering, and sentiment analysis.
TinyLlama is based on Llama 2, which is a 1.1 billion parameter model that was pre-trained on 200 billion tokens. However, TinyLlama aims to pre-train Llama 2 on 3 trillion tokens, which is 15 times more than the original data size. This would make TinyLlama one of the largest language models ever trained, surpassing even GPT-3.
Why is TinyLlama Important?
- Efficient Training: TinyLlama challenges norms by training on just 16 GPUs in 90 days, proving large models can be achieved with fewer resources.
- Data Matters: Increasing data improves model quality; pre-training on 3 trillion tokens surpasses prior benchmarks.
- Versatile Applications: TinyLlama’s broad pre-training opens doors to more accurate, coherent, and diverse text generation across domains and tasks.
How TinyLlama Was Created?
In this section, we’ll delve into the development of TinyLlama, covering its foundational elements, including the architecture and tokenizer used for Llama. We’ll explore the data sources and the meticulous preprocessing methods applied to ensure data quality.
The Llama Architecture and Tokenizer
As mentioned earlier, TinyLlama is based on Llama 2, which is a 1.1 billion parameter language model that follows the same GPT-3 architecture and tokenizer. Architecture of Llama 2 consists of 24 transformer layers, each with 16 attention heads and a hidden size of 3072. Its input and output embeddings are 1536 in size, half of the hidden size.
The tokenizer of Llama 2 is a byte pair encoding (BPE) tokenizer that uses a vocabulary size of 32K. BPE is a subword segmentation algorithm that splits words into smaller units based on their frequency and co-occurrence in the data. BPE allows the model to handle rare or unknown words better than character-level or word-level tokenizers.
The Llama architecture and tokenizer are compatible with GPT-3, which means that TinyLlama can leverage the existing code and tools developed for GPT-3. For example, TinyLlama uses the Hugging Face Transformers library, which provides a high-level API for building and using transformer-based models in Python.
The Data Sources and Preprocessing
The data sources for TinyLlama are mainly from two categories: web texts and books. Web texts are scraped from various websites using a crawler that filters out low-quality or irrelevant pages based on some criteria, such as language, domain, length. Books are downloaded from public repositories, such as Project Gutenberg, Internet Archive.
The total size of the data sources is about 10 trillion tokens, which is much larger than the target size of 3 trillion tokens. Therefore, some preprocessing steps are applied to reduce the data size and improve the data quality. These steps include:
- Deduplication: removing duplicate or near-duplicate texts from the data sources using a hashing technique.
- Filtering: removing texts that contain offensive or sensitive content, such as hate speech, pornography, personal information, etc., using a classifier model.
- Sampling: selecting a subset of texts from the data sources based on some criteria, such as diversity, relevance, novelty, etc., using a ranking model.
After these steps, the final data set for TinyLlama consists of about 3 trillion tokens from various domains and genres, such as news, blogs, social media, fiction, non-fiction, etc. The data set is then split into training and validation sets with a ratio of 99:1.
Discover more on our blog, where we share tips and tutorials about “Llama Code: How Meta AI LLM Can Help You Write Better Code” Whether you are a beginner or experienced, we have got you covered with valuable insights on the “Llama Code: How Meta AI LLM Can Help You Write Better Code”.
The Hardware and Optimization Techniques
The hardware used for training TinyLlama is 16 A100-40G GPUs with a total memory of 640 GB. The GPUs are connected by NVLink and NVSwitch technologies, which enable high-speed data transfer and communication among the GPUs. The GPUs are hosted on a cloud platform that provides access to storage and networking resources.
- Data parallelism: distributing the data across multiple GPUs and synchronizing the gradients after each batch using all-reduce operations.
- Model parallelism: splitting the model across multiple GPUs and exchanging the activations after each layer using pipeline parallelism or tensor slicing.
- Mixed precision: using half-precision (FP16) arithmetic for most of the computations and full-precision (FP32) arithmetic for some critical parts, such as gradient updates or loss calculations.
- Gradient accumulation: accumulating the gradients over several batches before updating the parameters to reduce the communication overhead and memory consumption.
- Gradient clipping: clipping the gradients to a maximum norm to prevent exploding gradients or numerical instability.
- Learning rate schedule: using a cosine annealing schedule with warmup and cooldown phases to adjust the learning rate during the training.
- Weight decay: applying a regularization term to the parameters to prevent overfitting or co-adaptation.
- Dropout: randomly dropping out some units or connections in the model to introduce noise and diversity in the training.
How TinyLlama Performs?
In this section, we’ll delve into TinyLlama’s performance. We’ll explore its training journey, the outcomes achieved during training, the evaluation measures and comparisons against industry benchmarks, and finally, how it can be applied across various real-world scenarios and use cases.
The Training Progress and Results
The training of TinyLlama started on September 1, 2023, and is expected to end on November 30, 2023. As of October 8, 2023, TinyLlama has completed about 30% of the training, which corresponds to about 900 billion tokens. The training results so far show that TinyLlama is making steady progress and improvement.
The loss value, which measures the discrepancy between the model’s predictions and the actual outputs, has decreased from 3.5 to 2.8. The perplexity value, which measures how well the model fits the data, has decreased from 33.1 to 16.4. Accuracy increased from 46.7% to 54.2%, indicating more accurate predictions.
These results indicate that TinyLlama is learning from the data and becoming more fluent and confident in generating natural language texts. However, these results are only based on the validation set, which is a small subset of the data set. The true performance of TinyLlama can only be assessed by testing it on external data sets and tasks.
The Evaluation Metrics and Benchmarks
To evaluate the performance of TinyLlama, several metrics and benchmarks are used to compare it with other language models, such as GPT-3, Llama 2, or BERT. These metrics encompass various aspects of language understanding and generation, ensuring a comprehensive evaluation of TinyLlama’s capabilities.
- GLUE: a collection of nine natural language understanding tasks, such as sentiment analysis, natural language inference, question answering, etc.
- SuperGLUE: an extension of GLUE with eight more challenging natural language understanding tasks, such as coreference resolution, textual entailment, commonsense reasoning, etc.
- SQuAD: a question answering task that requires the model to answer questions based on a given passage of text.
- CNN/Daily Mail: a text summarization task that requires the model to generate a summary of a news article.
- LAMBADA: a text completion task that requires the model to predict the last word of a sentence given its context.
- WikiText-103: a language modeling task that requires the model to predict the next word or token given a sequence of tokens.
- Zero-shot learning: a generalization task that requires the model to perform a new task without any fine-tuning or adaptation.
These metrics and benchmarks measure different aspects of the model’s capabilities, such as comprehension, generation, reasoning, memory, etc. They also cover different domains and genres of natural language texts, such as news, fiction, web texts, etc.
The evaluation results of TinyLlama are not available, as the training is still ongoing. Based on the results of Llama 2 and GPT-3 on these metrics and benchmarks, it is expected that TinyLlama will outperform both models on most of them. This is because TinyLlama has more parameters than Llama 2 (1.1B vs 1B) and more data than GPT-3 (3T vs 45B).
Frequently Asked Questions
Conclusion
In this article, we have introduced TinyLlama, a new language model that aims to achieve human-like natural language generation and understanding. We have discussed what it is, how it was created, how it performs, and what it can do. We have also compared TinyLlama with other language models and highlighted its advantages and challenges.
TinyLlama: A 1.1B Parameter Language Model Pre-trained on 3 Trillion Tokens