Large Language Model(LLM) Interview Questions and Answers

7 min readDec 5, 2024

1. What is Tokenization?

Tokenization is the process of breaking down text into smaller units, called tokens, which can be individual words, subwords, characters, or even meaningful phrases.

Why is it important?
Language models process text as sequences of numbers, where each token corresponds to an index in a vocabulary.
It improves computational efficiency by converting text into a format understandable by models.
Handles rare words by splitting them into subwords, e.g., “unbelievable” → [“un”, “believable”].
Ensures flexibility in multilingual tasks and reduces the size of vocabulary.

2. What is LoRA and QLoRA?

LoRA (Low-Rank Adaptation):

A fine-tuning technique that adapts pretrained language models using a small number of additional parameters.
Instead of updating all the model parameters, LoRA introduces low-rank matrices to reduce the memory and computational cost of fine-tuning.
Benefits: Efficient adaptation without retraining the entire model.

QLoRA (Quantized LoRA):

A variant of LoRA that quantizes model weights (e.g., using 4-bit quantization).
Significantly reduces memory footprint and computational requirements.
Ideal for running large models on hardware with limited resources.

3. What is Beam Search, and how does it differ from Greedy Decoding?

Beam Search:

A decoding algorithm used in text generation that maintains multiple candidate sequences (beams) at each step.
It explores multiple paths and selects the sequence with the highest cumulative probability.
Ensures more coherent and contextually relevant output by avoiding local maxima.

Greedy Decoding:

Selects the token with the highest probability at each step without considering future possibilities.
Faster but can lead to suboptimal outputs as it may miss better sequences due to its lack of foresight.

Key Difference:
Greedy decoding focuses on immediate rewards, while beam search evaluates sequences in parallel for better outcomes.

4. Explain the Concept of Temperature in LLM Text Generation.

Temperature is a parameter that controls the randomness in language model outputs by adjusting the probability distribution of tokens.

Low temperature (<1):

Makes outputs more deterministic and focused.
Model prioritizes high-probability tokens, reducing diversity.

High temperature (>1):

Adds variability by making token probabilities more even.
Useful for creative or diverse outputs.

5. What is Masked Language Modeling (MLM)?

MLM involves randomly masking some tokens in the input text and tasking the model to predict them based on the surrounding context.
Popularized by BERT (Bidirectional Encoder Representations from Transformers).
Objective: Learn bidirectional dependencies in language by leveraging context from both directions.
Example:
Input: “The [MASK] is blue.”
Prediction: “The sky is blue.”

6. What are Sequence-to-Sequence Models?

Seq2Seq models transform an input sequence into a corresponding output sequence, often used in:

Machine Translation: English → French.
Text Summarization: Long text → Summary.
Question Answering: Context → Answer.
Components:

Encoder: Processes input and generates a representation.

Decoder: Converts the representation into the desired output.

Examples: Transformer models like T5 and RNN-based models like Seq2Seq with Attention.

7. How do Autoregressive and Masked Models Differ?

Autoregressive Models (e.g., GPT):

Generate text one token at a time, using previously generated tokens as context.
Example: Predicting the next word in a sequence.
Strength: Good for text generation.

Masked Models (e.g., BERT):

Predict masked tokens within a sequence using bidirectional context.
Strength: Effective for understanding tasks like classification or QA.

8. What Role do Embeddings Play in LLMs?

Embeddings are vector representations of tokens that encode:

Semantic information (meaning).
Syntactic information (structure).
They transform discrete tokens into continuous numerical values for model input.
Types:

Word embeddings: Represent whole words.

Subword embeddings: Handle unknown/rare words.

Examples: Word2Vec, GloVe, and learned embeddings in transformers.

9. What is Next Sentence Prediction (NSP)?

NSP is a pretraining objective used to teach models to understand relationships between sentences.

Model predicts whether a given sentence B naturally follows a sentence A.
Example (BERT):
Input:
Sentence A: “I love books.”
Sentence B: “They expand my knowledge.”
Model Output: True.

10. What is the Difference Between Top-k and Nucleus Sampling?

Top-k Sampling:

Restricts token choices to the top k tokens with the highest probabilities.
Adds randomness but keeps outputs focused.

Nucleus Sampling (Top-p):

Dynamically selects tokens with a cumulative probability threshold ppp.
More adaptive as it adjusts to the probability distribution of each step.

Key Difference:
Top-k limits choices to a fixed number, while nucleus sampling is more flexible, selecting tokens based on cumulative probability.

11 . How Does Prompt Engineering Influence LLM Outputs?

Prompt engineering is the process of designing clear, specific, and goal-oriented input prompts to steer large language models (LLMs) toward desired outcomes. It is crucial in maximizing LLM performance, particularly in scenarios like:

Zero-shot learning: Where the model performs tasks without prior examples, relying on prompt clarity to infer intent.
Few-shot learning: Where minimal examples are provided in the prompt to guide the model.

Effective prompt engineering:

Provides context to reduce ambiguity.
Uses structured instructions to emphasize the task’s requirements.
Exploits strategies like Chain-of-Thought (CoT) prompting to elicit logical reasoning.

For example, rephrasing “Summarize this text” to “Provide a 3-sentence summary focusing on the main theme” enhances output precision.

12. How Can Catastrophic Forgetting Be Mitigated in LLMs?

Catastrophic forgetting occurs when LLMs lose previously learned knowledge while training on new tasks. Mitigation strategies include:

Rehearsal Methods: Combining new and old data during retraining to reinforce previous knowledge.
Elastic Weight Consolidation (EWC): Assigns importance weights to model parameters, penalizing changes to parameters critical for prior tasks.
Modular Approaches: Introduces separate modules or adapters for new tasks, preserving the core model’s existing knowledge.

These techniques ensure the model retains past learning while adapting to new requirements.

13. What is Model Distillation, and How is it Applied to LLMs?

Model distillation involves transferring knowledge from a large, computationally intensive “teacher” model to a smaller “student” model. The student learns by mimicking the teacher’s:

Soft predictions (probability distributions over classes).
Intermediate representations (if available).

In LLMs, distillation reduces computational costs for deployment without significant loss of accuracy. For example, GPT-3’s capabilities can be approximated in a distilled version for faster inference.

14. How Do LLMs Handle Out-of-Vocabulary (OOV) Words?

LLMs address OOV words through subword tokenization techniques such as:

Byte-Pair Encoding (BPE): Breaks words into smaller units like prefixes, suffixes, or character pairs.
WordPiece: Similar to BPE but with slightly different merging criteria.
Unigram Language Model: Selects the most probable subword sequence.

This ensures even unseen words can be represented as a combination of known subwords.

15. How Does Transformer Architecture Overcome Seq2Seq Model Challenges?

Transformer architecture revolutionized sequence-to-sequence tasks by replacing recurrent mechanisms with:

Self-attention: Processes all tokens in parallel, capturing global dependencies efficiently.
Positional encoding: Adds token order information.
Scalability: Handles long sequences without the vanishing gradient issues inherent in RNNs.

Transformers are faster and better suited for tasks requiring context-aware predictions over long input sequences.

16. What is Overfitting, and How Can It Be Prevented?

Overfitting occurs when a model memorizes training data patterns instead of generalizing to unseen data. Preventive measures include:

Regularization: Penalizes large weights (e.g., L2 regularization).
Dropout: Randomly disables neurons during training to promote robustness.
Data augmentation: Expands training data diversity.
Early stopping: Halts training when validation performance stagnates.
Simpler models: Reduces complexity to prevent overfitting.

17. What are Generative and Discriminative Models?

Generative Models: These learn the underlying probability distribution of the data to generate new, similar data samples. They model both the input features xxx and their corresponding labels yyy, enabling them to answer questions like “Given xxx, what is the probability of yyy?” and “What is the likely xxx for yyy?”.

Applications: Text generation (GPT), image synthesis (GANs), speech generation.
Example: GPT generates human-like text by predicting the next word based on previous words.

Discriminative Models: These focus on learning the decision boundary between different classes. They directly model P(y∣x)P(y|x)P(y∣x), the probability of a label yyy given input xxx, without learning the data distribution.

Applications: Classification tasks (e.g., spam detection, sentiment analysis).
Example: BERT classifies text into categories like positive or negative sentiment.

18. How is GPT-4 Different from GPT-3?

Multimodal Inputs: GPT-4 can process both text and images, making it versatile in tasks like visual question answering.
Larger Context Window: GPT-4 can handle significantly longer input sequences than GPT-3, improving coherence in lengthy discussions or documents.
Accuracy: Enhanced language understanding, logical reasoning, and factual correctness due to refined training and increased parameters.
Multilingual Capabilities: Improved handling of a broader range of languages, making GPT-4 more accessible globally.

19. What are Positional Encodings in LLMs?

Transformers lack an inherent sense of sequence because they process input tokens in parallel. Positional encodings address this by introducing sequence information:

How it Works: Positional encodings use mathematical functions (e.g., sine and cosine) to assign a unique encoding to each token based on its position in the sequence.
Why it’s Important: It helps the model distinguish “The cat chased the dog” from “The dog chased the cat.”

20. What is Multi-Head Attention?

Multi-head attention allows the model to focus on different aspects of the input sequence simultaneously:

How it Works: Splits the attention mechanism into multiple “heads,” each learning a different representation of the input.
Benefits:
Captures diverse relationships, such as syntax and semantics.
Enhances the model’s ability to understand complex dependencies.