Meta AI’s LLaMA (huge Language Model Meta AI) is a huge language model that was launched in February 2023. Model sizes ranging from 7 billion to 65 billion parameters were trained. The developers of LLaMA indicated that the 13 billion parameter model outperformed the considerably bigger GPT-3 (with 175 billion parameters) on most NLP benchmarks, and that the largest model was competitive with state-of-the-art models such as PaLM and Chinchilla.
LLaMA is a transformer-based language model, which implies it has a neural network architecture designed for machine translation. Transformers can learn long-term word dependencies, making them well-suited for jobs like natural language processing and generation.
Tips: Fill out this form to get weights for the LLaMA models.
After obtaining the weights, they must be changed to the Hugging Face Transformers format using the conversion script. The script can be invoked using the following (example) command:
python src/transformers/models/llama/convert_llama_weights_to_hf.py -input_dir /path/to/downloaded/llama/weights -model_size 7B -output_dir /output/path
Following conversion, the model and tokenizer can be loaded by using:
from transformers import LlamaForCausalLM, LlamaTokenizer tokenizer = LlamaTokenizer.from_pretrained(“/output/path”) model = LlamaForCausalLM.from_pretrained(“/output/path”)
It should be noted that running the script requires enough CPU RAM to host the entire model in float16 precision (even though the largest versions arrive in numerous checkpoints, each of that holds a portion of the model’s weight, so we need to load them all in RAM). Thus, 130GB of RAM is required for the 65B model.
The LLaMA tokenizer is a sentence piece BPE model. When decoding a sequence, the tokenizer does not prepend the prefix space to the string if the first token represents the start of the word (e.g. “Banana”).
LLaMA Model
class transformers.LlamaModel ( config: LlamaConfig )
LLaMA Model Parameters
config (LlaMAConfig) — Model configuration class containing all of the model’s parameters. When starting with a config file, only the configuration is loaded, not the weights associated with the model. To load the model weights, use the from_pretrained() method. config — LlaMAConfig
The bare LlaMA Model, which produces raw hidden states with no special head on top. This model originated from PreTrainedModel. Check the superclass documentation for the generic methods implemented by the library for all of its models (such as downloading or saving, scaling input embeddings, pruning heads, and so on).
This model is also a subclass of PyTorch torch.nn.Module. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all basic usage and behavior questions.
Layered transformer decoder with config.num_hidden_layers layers. LlaMADecoderLayer is used for each layer.
forward ( input_ids: LongTensor = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[typing.List[torch.FloatTensor]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None )
LLaMA Config Parameters
input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Token identifiers for input sequences in the vocabulary. Padding, if provided, will be neglected by default.
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — To prevent performing attention on padding token indices, use a mask. Mask values in the range [0, 1]:
- 1 for unmasked tokens
- 0 for masked tokens.
If past_key_values is being used, just the last decoder_input_ids must be entered (see past_key_values).
If you want to change the padding behavior, read modeling_opt._prepare_decoder_attention_mask and make the necessary changes.
- 1 denotes that the head is not masked.
- 0 denotes that the head is masked.
position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of each input sequence token’s position in the position embeddings. The value [0, config.n_positions – 1] was chosen.
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple containing two shape tensors (batch_size, num_heads, sequence_length, embed_size_per_head) and two extra shape tensors (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).
Contains pre-computed hidden-states (key and values in the self-attention and cross-attention blocks) that can be utilized to speed up sequential decoding (see past_key_values input).
If past_key_values is utilized, the user can choose input only the last decoder_input_ids of shape (batch_size, 1) (those that do not have their past key value states given to this model) instead of all decoder_input_ids of shape (batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Instead of giving input_ids, you can also pass an embedded representation directly. This is handy if you want greater control over how input_ids indices are converted into associated vectors than the model’s internal embedding lookup matrix provides.
use_cache (bool, optional) — If True, the key value states from the past are returned and can be used to speed up decoding (see past_key_values).
output_attentions (bool, optional) — Whether or not to return all attention layers’ attention tensors. For more information, see attentions under returned tensors.
output_hidden_states (bool, optional) — Whether or not to return all layers’ hidden states. For further information, see hidden_states under returned tensors.
return_dict (bool, optional) — Whether to return a Model Output instead of a simple tuple.
The __call__ special function is overridden by the LlaMA Model forward method.
Also read: For a more comprehensive overview of Gorilla refer to our guide LLM Connected with APIs
FAQs of LLaMA Model
Conclusion
Finally, the LlaMA model offers a game-changing breakthrough in language-driven machine learning. It enables machines to understand, generate, and communicate in human-like ways by leveraging the power of language. The LlaMA concept has enormous potential for revolutionizing different industries and impacting the future of artificial intelligence technologies. Please feel free to share your thoughts and feedback in the comment section below.
LLaMA Model: Leveraging Meta AI Large Language Model