Basic NLP Interview Questions and Answers
Q: What is the difference between a document, a corpus, and vocabulary in NLP?
- Document:
A document refers to a single piece of text, which can be as small as a sentence or paragraph, or as large as an entire article. In datasets, this often corresponds to one row in a textual dataset.
Example: A single news article in a collection of news data. - Corpus:
A corpus is a collection of documents. It represents the entire dataset of text being analyzed or used for training an NLP model.
Example: A database of 1,000 articles collected for sentiment analysis. - Vocabulary:
The vocabulary is the set of all unique words present in the corpus. It is used to build features for NLP models and often excludes stop words and rarely occurring terms for efficiency.
Example: The vocabulary from the phrase “I like apples and oranges” might include {‘I’, ‘like’, ‘apples’, ‘oranges’} if stop words are excluded.
Q: Explain tokenization and its types.
Tokenization:
Tokenization is the process of splitting text into smaller, manageable units called tokens. These tokens can represent sentences, words, or sub-word units, depending on the level of tokenization.
Types of Tokenization:
- Sentence Tokenization:
Splits a text into individual sentences.
Example:
Text: “I love NLP. It’s amazing.”
Tokens: [“I love NLP.”, “It’s amazing.”] - Word Tokenization:
Splits a sentence into words or terms.
Example:
Sentence: “I love NLP.”
Tokens: [“I”, “love”, “NLP”] - Sub-word Tokenization:
Breaks down words into smaller units like n-grams or Byte Pair Encoding (BPE) tokens. Useful for handling rare or unseen words.
Example:
Word: “unhappiness” → Sub-word Tokens: [“un”, “happi”, “ness”]
Q: What are stop words? Why are they removed in NLP?
Stop Words
Stop words are commonly occurring words in a language, such as “is,” “the,” “and,” and “on,” which generally do not provide significant information for NLP tasks.
Why Remove Stop Words?
- Reduces noise: These words can overshadow meaningful patterns in text analysis.
- Reduces dimensionality: Removing them decreases the size of the vocabulary and simplifies computations.
- Example:
Text: “The cat is on the mat.”
After removing stop words: “cat mat”
Q: What is the difference between stemming and lemmatization?
Stemming:
Stemming involves chopping off prefixes or suffixes to reduce words to their root form. It’s a rule-based and fast process but may produce non-meaningful root words.
Example:
Words: “running,” “runner,” “ran” → Stem: “run”
Lemmatization:
Lemmatization converts a word to its base or dictionary form (lemma) using vocabulary and grammatical rules. It is more accurate but computationally intensive.
Example:
Words: “running,” “ran” → Lemma: “run”
- Key Difference: Stemming is a heuristic process (less accurate), while lemmatization uses linguistic rules (more accurate).
Q: What is POS tagging? Why is it important?
POS (Part of Speech) Tagging:
Assigns a grammatical category (noun, verb, adjective, etc.) to each word in a sentence.
- Example:
Sentence: “The quick brown fox jumps over the lazy dog.”
Tags: [“The” (Determiner), “quick” (Adjective), “fox” (Noun), “jumps” (Verb), …]
Importance of POS Tagging:
- Helps in understanding sentence structure and syntax.
- Aids in downstream NLP tasks like named entity recognition, syntactic parsing, and machine translation.
- Provides context for ambiguous words.
Example: “Book” can be a noun or verb. POS tagging disambiguates it based on context.
Q: Explain the differences between Bag of Words (BOW) and TF-IDF.
1. Bag of Words (BOW)
Definition:
Bag of Words is a simple representation of text where a document is converted into a vector of word frequencies. It disregards grammar, word order, and semantics but focuses solely on the frequency of each word in the document.
How it works:
- Create a vocabulary of all unique words in the corpus.
- For each document, count the frequency of each word from the vocabulary.
- Represent the document as a vector of word counts.
Advantages:
- Simple and computationally efficient for smaller datasets.
- Effective for tasks where word occurrence is more important than context (e.g., spam detection).
Limitations:
- Ignores semantics and word order, meaning it treats “I love NLP” and “NLP love I” as identical.
- Large vocabularies result in sparse vectors and higher memory usage.
2. TF-IDF (Term Frequency-Inverse Document Frequency)
Q: What are word embeddings? How do Word2Vec and GloVe differ?
1. Word Embeddings
Definition:
Word embeddings are dense vector representations of words where similar words have similar vector representations. They capture semantic relationships (e.g., “king” — “man” + “woman” ≈ “queen”).
Why Use Embeddings?
- Unlike BOW and TF-IDF, embeddings capture the meaning of words based on their context in a corpus.
- They are compact (dense vectors) and encode relationships between words.
2. Word2Vec
Method:
Word2Vec generates embeddings by predicting a word based on its context or vice versa, using two architectures:
CBOW (Continuous Bag of Words): Predicts the target word from surrounding context words.
Example: Given “The __ is barking,” predict “dog.”
Skip-Gram: Predicts surrounding words from a target word.
Example: Given “dog,” predict words like “The,” “is,” and “barking.”
Characteristics:
- Focuses on local context windows.
- Learns word embeddings using neural networks.
3. GloVe (Global Vectors for Word Representation)
Method:
GloVe generates embeddings by factorizing a co-occurrence matrix of word pairs. It learns embeddings based on how often words co-occur in the entire corpus.
Characteristics:
- Captures both local (context window) and global (entire corpus) information.
- Optimizes word relationships explicitly using co-occurrence statistics.
Q: What advantages does BERT provide over Word2Vec and GloVe?
1. BERT (Bidirectional Encoder Representations from Transformers)
- Definition:
BERT is a transformer-based model that generates contextual word embeddings by analyzing text bidirectionally (both left-to-right and right-to-left).
Advantages of BERT:
- Contextual Understanding:
Unlike Word2Vec and GloVe, which produce static embeddings (same representation for a word in all contexts), BERT generates dynamic embeddings based on the surrounding text.
Example: The word “bank” in “river bank” vs. “financial bank” has different embeddings with BERT. - Bidirectional Context:
BERT looks at both preceding and following words to understand the context, unlike Word2Vec, which is unidirectional. - Pretrained Tasks:
- Masked Language Modeling (MLM): Predicts missing words in a sentence.
Example: “The ___ is barking” → Predict “dog.” - Next Sentence Prediction (NSP): Predicts whether two sentences are sequential.
4. Supports Complex NLP Tasks:
- Text classification, named entity recognition, question answering, and more.
- Unlike Word2Vec and GloVe, BERT can directly handle sentence-level and document-level tasks.
Q: What is Named Entity Recognition (NER)?
Definition:
Named Entity Recognition (NER) is an NLP technique used to locate and classify named entities in a body of text into predefined categories. These categories might include:
- Persons: Names of individuals (e.g., “Elon Musk”).
- Organizations: Companies, institutions, etc. (e.g., “Apple” as a company).
- Locations: Geographical names (e.g., “New York”).
- Dates: Specific time references (e.g., “January 1, 2023”).
- Others: Monetary values, percentages, products, and more.
How it Works:
- Tokenization: The text is broken into tokens (words or phrases).
- Feature Extraction: Contextual and linguistic features like word position, capitalization, or parts of speech are used.
- Model Prediction: Machine learning models (like Conditional Random Fields, LSTMs, or Transformer-based models like BERT) are trained to recognize and classify entities.
Example:
Text: “Apple Inc. was founded by Steve Jobs in Cupertino, California.”
NER Output:
- “Apple Inc.” → Organization
- “Steve Jobs” → Person
- “Cupertino, California” → Location
Applications:
- Information Retrieval: Extracting key entities from documents or web pages.
- Customer Feedback Analysis: Identifying brands, products, or services mentioned.
- Medical NLP: Recognizing symptoms, drugs, and diseases in medical records.
- Question Answering Systems: Locating entities relevant to user queries.
Q: Describe how you would convert text to numerical data for machine learning.
Why Convert Text to Numbers?
Machine learning models work with numerical data. Text data must be transformed into numerical representations to serve as input for these models.
Techniques for Text-to-Numerical Conversion:
- Bag of Words (BOW):
- Description: Represents text as a vector of word counts or frequencies. Each document is represented by the frequency of each word in a fixed vocabulary.
How it Works:
- Create a vocabulary of unique words across the corpus.
- Count occurrences of each word in each document.
Example:
- Corpus: [“I love NLP”, “I love AI”]
- Vocabulary: [“I”, “love”, “NLP”, “AI”]
- BOW Representation:
Document 1: [1, 1, 1, 0]
Document 2: [1, 1, 0, 1]
2. TF-IDF (Term Frequency-Inverse Document Frequency):
Description: Weights words based on their importance. Common words (like “the”) are down-weighted, and rare but significant words are emphasized.
Formula:
- TF: Frequency of a word in a document.
- IDF: Logarithm of the inverse document frequency across the corpus.
- Example: Words that appear in every document (e.g., “the”) have low IDF, while rare words (e.g., “quantum”) have high IDF.
3. Word Embeddings (Word2Vec, GloVe):
Description: Dense vector representations of words that capture semantic meanings. Words with similar meanings have similar vectors.
How it Works:
Word2Vec: Predicts a word from its context (CBOW) or predicts context from a word (Skip-Gram).
GloVe: Learns embeddings using word co-occurrence statistics.
Advantages: Captures relationships like synonyms and analogies.
4.Transformer-Based Embeddings (BERT):
Description: Generates contextual word representations based on the surrounding text. Unlike Word2Vec, embeddings vary depending on context.
Example:
- The word “bank” in:
- “He went to the river bank” → Context: Nature
- “He deposited money in the bank” → Context: Finance
- Use Cases: Advanced NLP tasks like sentiment analysis, question answering, and named entity recognition.
Q: What is the purpose of text preprocessing in NLP?
Definition:
Text preprocessing refers to a series of steps applied to raw text to prepare it for analysis and machine learning models. Raw text often contains noise, inconsistencies, and irregularities that make it unsuitable for direct processing.
Purpose:
- Cleaning Data:
Raw text data may include irrelevant information like HTML tags, special characters, and noise, which need to be removed for meaningful analysis. - Standardization:
Preprocessing ensures uniformity in data, like converting all text to lowercase for case-insensitive analysis. - Reducing Dimensionality:
By removing stop words or rare terms, preprocessing reduces the size of the vocabulary, making models more efficient. - Improving Model Performance:
Clean, consistent data allows models to focus on meaningful patterns, improving predictive accuracy.
Common Steps in Text Preprocessing:
- Lowercasing:
Converts all text to lowercase to avoid treating “Apple” and “apple” as different words. Example:
Input: “Hello World!” → Output: “hello world!” - Removing Punctuations, Special Characters, and Stopwords:
Eliminates irrelevant elements to reduce noise.
Example:
Text: “He’s a good boy, isn’t he?” → Output (after removing punctuation and stop words): “good boy” - Expanding Contractions:
Converts words like “don’t” to “do not” for consistency during tokenization and analysis.
Example: “I’ll” → “I will.” - Correcting Text (Spelling Correction):
Fixes typos and spelling errors using libraries likepyspellchecker
orSymSpell
.
Example: “Thsi is a tst” → “This is a test.” - Tokenization:
Splits text into individual units (words, sentences, or sub-words).
Example:
Text: “I love NLP.” → Tokens: [“I”, “love”, “NLP”]
Q: How do you handle out-of-vocabulary (OOV) words?
Definition:
Out-of-vocabulary (OOV) words are terms that are not part of the model’s training vocabulary. This can happen with rare words, typos, or newly coined terms.
Why is Handling OOV Words Important?
- Ensures models can still make predictions even when encountering unseen words.
- Prevents breakdowns in tasks like text classification or translation.
Techniques to Handle OOV Words:
Using a Special Token (<UNK>
):
- Replace OOV words with a placeholder token
<UNK>
during both training and inference. - Helps the model handle unexpected input gracefully.
Example:
Text: “The zephyr was gentle.”
Vocabulary: {“the”, “was”, “gentle”}
Result: “The<UNK>
was gentle.”
Sub-word Tokenization:
- Breaks words into smaller units to handle unseen terms.
- Techniques:
- Byte Pair Encoding (BPE): Frequently used in GPT models.
Example: “unseen” → [“un”, “seen”] - WordPiece: Used in BERT.
Example: “running” → [“run”, “##ning”] (## indicates sub-word).
Character-level Embeddings:
- Represents words as sequences of characters, enabling the model to handle rare or misspelled words by learning patterns at the character level.
- Example:
Word: “unbelievable” → [“u”, “n”, “b”, “e”, “l”, “i”, …]
Contextual Models:
- Models like BERT and GPT-3 can infer the meaning of OOV words based on surrounding context, even if the word itself is unseen.
Q: What are the challenges with sparse matrices in Bag of Words (BOW)?
Bag of Words (BOW) creates a matrix where each document is represented by a vector, and each dimension corresponds to a word in the vocabulary. While this is simple and intuitive, it has significant challenges:
1. High Memory Consumption:
- Why: The vocabulary size grows with the corpus, and most words do not appear in every document. This leads to sparse matrices with many zero entries.
- Impact: These matrices consume more memory as the number of documents and unique words increases. Example: A vocabulary of 50,000 words for a corpus of 1,000 documents creates a 50,000 × 1,000 matrix, which is largely empty but still occupies significant memory.
2. Poor Scalability for Large Corpora:
- Why: The BOW approach scales poorly with larger datasets, as adding more documents increases both the vocabulary size and matrix dimensions.
- Impact: The computational resources required for storage and processing grow exponentially, making it impractical for very large datasets.
3. Computational Inefficiency Due to Sparsity:
- Why: Sparse matrices introduce inefficiencies in mathematical operations like matrix multiplication, which are common in machine learning models.
- Impact: Algorithms must handle zeros explicitly, increasing computation time and slowing down training. Example: Training a model with a sparse matrix requires special sparse matrix libraries, which are slower compared to dense matrix operations.
4. Loss of Semantic Information:
- Why: BOW ignores the order of words and the relationships between them.
- Impact: This leads to a loss of contextual meaning, limiting the model’s ability to capture nuances in language. Example: “The cat chased the dog” and “The dog chased the cat” are treated identically in BOW.
Q: What are the challenges with sparse matrices in Bag of Words (BOW)?
Bag of Words (BOW) creates a matrix where each document is represented by a vector, and each dimension corresponds to a word in the vocabulary. While this is simple and intuitive, it has significant challenges:
1. High Memory Consumption:
- Why: The vocabulary size grows with the corpus, and most words do not appear in every document. This leads to sparse matrices with many zero entries.
- Impact: These matrices consume more memory as the number of documents and unique words increases. Example: A vocabulary of 50,000 words for a corpus of 1,000 documents creates a 50,000 × 1,000 matrix, which is largely empty but still occupies significant memory.
2. Poor Scalability for Large Corpora:
- Why: The BOW approach scales poorly with larger datasets, as adding more documents increases both the vocabulary size and matrix dimensions.
- Impact: The computational resources required for storage and processing grow exponentially, making it impractical for very large datasets.
3. Computational Inefficiency Due to Sparsity:
- Why: Sparse matrices introduce inefficiencies in mathematical operations like matrix multiplication, which are common in machine learning models.
- Impact: Algorithms must handle zeros explicitly, increasing computation time and slowing down training. Example: Training a model with a sparse matrix requires special sparse matrix libraries, which are slower compared to dense matrix operations.
4. Loss of Semantic Information:
- Why: BOW ignores the order of words and the relationships between them.
- Impact: This leads to a loss of contextual meaning, limiting the model’s ability to capture nuances in language. Example: “The cat chased the dog” and “The dog chased the cat” are treated identically in BOW.
Q: When would you prefer TF-IDF over BOW?
TF-IDF is a more sophisticated technique than BOW, and it is preferred in certain scenarios where word relevance and importance matter:
1. When Important Words Need Emphasis:
- Why: Unlike BOW, TF-IDF assigns weights to words based on their frequency in a document and rarity across the corpus. This helps emphasize meaningful words.
- Use Case: Tasks like text classification or keyword extraction where identifying key terms is critical. Example: In a news dataset, TF-IDF ensures that unique words like “election” or “policy” are weighted more heavily than generic terms like “the” or “is.”
2. When Common Words Need to Be Down-Weighted:
- Why: Words that appear frequently across documents (e.g., “the,” “and”) contribute less useful information. TF-IDF reduces their impact by down-weighting them.
- Use Case: Search engines use TF-IDF to prioritize documents based on the relevance of search terms. Example: A search for “AI advancements” will highlight documents that contain these terms prominently rather than documents with a high count of generic words.
3. When Dimensionality Needs to Be Reduced:
- Why: TF-IDF naturally reduces the impact of non-informative terms, which indirectly reduces dimensionality by emphasizing important words.
- Use Case: Efficient storage and faster processing for medium-sized datasets. Example: In customer reviews analysis, TF-IDF helps focus on product-related terms rather than general expressions.
Q: What are the main differences between GloVe and BERT embeddings?
Both GloVe and BERT generate vector representations of words, but their approaches and capabilities differ significantly:
Q: How would you build a text classification pipeline in NLP?
A text classification pipeline involves several steps, each contributing to the overall process of transforming raw text data into meaningful predictions. Below is a detailed explanation:
1. Text Preprocessing
Raw text data is often noisy and inconsistent, making preprocessing a crucial step.
- Cleaning:
Remove irrelevant elements like special characters, numbers, and HTML tags.
Example:
Input: “<p>Hello, World! 123</p>” → Output: “hello world” - Tokenization:
Split the text into individual words or tokens.
Example:
Sentence: “Text classification is fun!” → Tokens: [“Text”, “classification”, “is”, “fun”] - Stopword Removal:
Remove common words like “is” and “the” that don’t add value to the analysis.
Example:
Tokens: [“Text”, “classification”, “is”, “fun”] → [“Text”, “classification”, “fun”] - Lemmatization/Stemming:
Reduce words to their base or root form.
Example:
“running”, “ran” → “run”
2. Feature Extraction
Text data must be transformed into numerical format for model training.
TF-IDF:
Represents text as weighted vectors based on term frequency and inverse document frequency.
Example:
Document: “I love NLP and AI.”
Vector: [“I”: 0.1, “love”: 0.3, “NLP”: 0.5, …]
Word Embeddings:
Dense vector representations of words that capture their semantic meanings.
Techniques: Word2Vec, GloVe, or contextual embeddings like BERT.
Custom Features:
Include domain-specific features such as word length, sentiment scores, or specific keywords.
3. Model Building
Choose and train a machine learning or deep learning model.
Traditional Machine Learning Models:
- Support Vector Machines (SVM): Good for small datasets with linear decision boundaries.
- Naive Bayes: Simple and effective for text classification, especially with TF-IDF features.
Neural Networks:
- Recurrent Neural Networks (RNNs): Good for sequential data.
- Convolutional Neural Networks (CNNs): Effective for capturing patterns in text.
- Transformer Models (e.g., BERT): Best for context-rich and complex text classification tasks.
4. Evaluation
Evaluate the model using appropriate metrics.
Metrics:
- Accuracy: Proportion of correctly classified instances.
- Precision: Ratio of true positives to predicted positives.
- Recall: Ratio of true positives to actual positives.
- F1-Score: Harmonic mean of precision and recall.
Cross-Validation:
Use k-fold cross-validation to ensure the model performs well across unseen data.
Q: Explain topic modeling in NLP.
Definition:
Topic modeling is an unsupervised learning technique used to discover latent topics within a collection of documents. Each document is represented as a mixture of topics, and each topic is a distribution over words.
Techniques:
1. Latent Dirichlet Allocation (LDA):
How it Works:
LDA assumes that each document is a mixture of topics, and each topic is a mixture of words. It uses probabilistic modeling to assign topics to documents.
- Words are generated based on topic probabilities.
- Topics are distributed probabilistically across documents.
Example:
Documents:
- “AI is transforming healthcare.”
- “Doctors use AI for diagnosis.”
Topics (output): - Topic 1: [“AI”, “machine”, “technology”]
- Topic 2: [“healthcare”, “doctor”, “treatment”]
2. Non-Negative Matrix Factorization (NMF):
How it Works:
Factorizes the document-word matrix into two smaller matrices:
- Document-topic matrix.
- Topic-word matrix.
Difference from LDA:
NMF is deterministic, whereas LDA is probabilistic.
Applications:
- Content recommendation systems (e.g., grouping similar articles).
- Customer feedback analysis (e.g., identifying common themes in reviews).
- Trend analysis in research papers or news articles.