Pandas Interview Questions and Answers

7 min readJan 2, 2025

1. What is Pandas in Python, and why is it used?

Question: What is the primary purpose of the Pandas library in Python, and what are its core data structures?

Answer:
Pandas is an open-source Python library used for data manipulation and analysis. It provides easy-to-use data structures and tools for handling structured data, such as tabular data in spreadsheets or SQL tables.

Core Data Structures:

Series: A one-dimensional labeled array capable of holding any data type.
DataFrame: A two-dimensional labeled data structure with rows and columns, similar to a spreadsheet or SQL table.

Pandas is widely used for tasks like data cleaning, transformation, aggregation, and merging.

2. Explain the difference between a Pandas Series and a DataFrame.

Question: How do Series and DataFrames differ in Pandas?

Series:

One-dimensional labeled array.
Homogeneous data (all elements are of the same type).
Example: A single column from a DataFrame.

DataFrame:

Two-dimensional data structure.
Heterogeneous data (each column can have a different type).
Example: A spreadsheet with labeled rows and columns.

3. How do you handle missing data in Pandas?

Question: What methods are available in Pandas to deal with missing data?

Identify Missing Data: Use .isnull() to check for missing values.
Drop Missing Values: Use .dropna() to remove rows or columns with missing data.
Fill Missing Values: Use .fillna() to fill missing data with a specified value (e.g., mean, median).
Interpolate: Use .interpolate() for linear or other types of interpolation.

4. How can you merge or concatenate DataFrames in Pandas?

Question: Describe how to combine multiple DataFrames in Pandas.

Answer:

Concatenation: Use pd.concat() to combine DataFrames either along rows (axis=0) or columns (axis=1).
Merging: Use pd.merge() to combine DataFrames based on common columns or indices.
Join: Use .join() for merging on indices.

5. How do you filter rows in a DataFrame?

Question: What techniques are available in Pandas to filter data based on conditions?

6. Explain the `groupby` operation in Pandas.

Question: What is the purpose of the groupby function in Pandas?

Answer:
The groupby() function is used to group data based on one or more columns and perform aggregate functions like sum(), mean(), or custom functions.

7. What are some common methods to summarize data in Pandas?

Question: How can you generate summary statistics for a DataFrame?

Answer:

Use .describe() to get statistics like count, mean, std, min, max, etc.
Use .info() to understand the structure of the DataFrame.
Use .value_counts() to count occurrences of unique values in a Series.

8. How do you sort data in Pandas?

Question: How can you sort rows or columns in Pandas?

Answer: Use .sort_values() to sort rows by values

Use .sort_index() to sort by index.

9. What are common file formats Pandas can read from or write to?

Question: Which file formats are supported by Pandas for reading and writing data?

Answer:

Reading:

CSV: pd.read_csv()
Excel: pd.read_excel()
SQL: pd.read_sql()
JSON: pd.read_json()

Writing:

CSV: .to_csv()
Excel: .to_excel()

10. What is the purpose of the `.apply()` method in Pandas?

Question: How does the .apply() method enhance functionality in Pandas?

Answer:
The .apply() method allows applying a function to each element or row/column of a DataFrame or Series.
Example:

11. How can you select specific rows and columns in Pandas?

Question: What are the ways to index and slice data in Pandas?

12. Explain the difference between `.iloc` and `.loc`.

Question: How does .iloc differ from .loc in Pandas?

Answer:

.iloc: Uses integer-based indexing (row/column positions).
.loc: Uses label-based indexing (row/column labels).

13. How can you modify column names in a DataFrame?

Question: What are ways to rename columns in Pandas?

14. What is vectorized operation in Pandas, and why is it important?

Question: Why are vectorized operations faster than loops in Pandas?

Answer:

Vectorized operations apply a function over an entire array or Series simultaneously.
They are optimized and implemented in C, making them faster than Python loops.
Example