Interview Questions on Descriptive Statistics
1. Define mean, median, and mode. How do they differ?
- Mean: The mean is the arithmetic average of a set of values. You calculate it by adding all the numbers in a dataset and then dividing by the number of values. The mean is sensitive to every value in the dataset, which means it can be affected by outliers or extreme values. For example, if you have a dataset of incomes, and one person earns significantly more than the others, the mean will increase accordingly.
- Median: The median is the middle value in an ordered dataset. If the dataset has an odd number of observations, the median is the middle number. If there is an even number of observations, the median is the average of the two middle numbers. The median is less sensitive to outliers and skewed data than the mean, making it a better measure of central tendency when the data is not symmetrically distributed.
- Mode: The mode is the value that appears most frequently in a dataset. There can be more than one mode if multiple values have the same highest frequency, or there may be no mode if all values are unique. The mode is particularly useful for categorical data, where the mean and median cannot be calculated.
- Differences: The key difference between these measures is how they respond to the data:
- The mean takes into account all values but is influenced by outliers.
- The median represents the central point and is less affected by extreme values, making it robust against outliers.
- The mode identifies the most common value and is especially useful in categorical data analysis.
2. What is the difference between population variance and sample variance?
- Population Variance: This is the measure of how data points in an entire population are spread out from the mean. It’s calculated by taking the average of the squared differences between each data point and the population mean. Population variance is denoted as σ2\sigma²σ2.
- Sample Variance: When you calculate variance from a sample rather than the entire population, it’s called sample variance. The calculation is similar to population variance, but instead of dividing by the number of data points, you divide by n−1n-1n−1 (where nnn is the sample size). This adjustment is known as Bessel’s correction and is used to correct the bias in the estimation of the population variance. Sample variance is denoted as s2s²s2.
- Differences: The main difference lies in how they are calculated:
- Population variance divides by the total number of data points (N), assuming you have data from the entire population.
- Sample variance divides by n−1n-1n−1, accounting for the fact that the sample might not capture all the variability present in the entire population.
3. How do outliers impact mean and median?
- Mean: Outliers can have a significant impact on the mean because it considers all values in the dataset. An outlier, being an extreme value, can skew the mean towards itself, leading to a misleading representation of the dataset’s central tendency. For example, in a dataset of household incomes, if most incomes are around $50,000 but one household earns $1,000,000, the mean income will be much higher than the majority of the data suggests.
- Median: The median is less sensitive to outliers because it only considers the middle value(s) in a sorted dataset. No matter how extreme an outlier is, it does not affect the median unless the outlier is at the median position itself. Therefore, the median often provides a better measure of central tendency for skewed data.
4. Explain the concept of a quartile and interquartile range.
- Quartiles: Quartiles divide a ranked dataset into four equal parts. The three quartiles are:
- Q1 (First Quartile): The value below which 25% of the data falls.
- Q2 (Second Quartile or Median): The value below which 50% of the data falls.
- Q3 (Third Quartile): The value below which 75% of the data falls.
Quartiles are useful in understanding the distribution of data and identifying the spread of the middle 50% of values.
- Interquartile Range (IQR): The IQR is the range between the first quartile (Q1) and the third quartile (Q3), calculated as IQR=Q3−Q1IQR = Q3 — Q1IQR=Q3−Q1. The IQR measures the spread of the middle 50% of the data, providing a robust measure of variability that is not influenced by outliers. It’s often used in conjunction with boxplots to identify outliers and assess the distribution’s spread.
5. What is skewness and kurtosis?
- Skewness: Skewness measures the asymmetry of a distribution.
- Positive Skew: If the distribution has a long right tail, the distribution is positively skewed, meaning most values are concentrated on the left, with fewer larger values to the right.
- Negative Skew: If the distribution has a long left tail, it is negatively skewed, meaning most values are concentrated on the right, with fewer smaller values to the left.
Skewness indicates whether the data is leaning towards higher or lower values, helping in understanding the distribution’s shape.
- Kurtosis: Kurtosis measures the “tailedness” or peakedness of the distribution.
- High Kurtosis (Leptokurtic): Indicates a distribution with heavy tails and a sharp peak, suggesting the presence of outliers.
- Low Kurtosis (Platykurtic): Indicates a distribution with light tails and a flatter peak, suggesting fewer outliers.
Kurtosis is useful in understanding the likelihood of extreme outcomes in a dataset. In finance, for example, high kurtosis can indicate higher risk due to the potential for extreme returns.
6. How can you measure the spread of data?
- Range: The difference between the maximum and minimum values in a dataset. It’s the simplest measure of spread but doesn’t provide information about the distribution between these extremes.
- Interquartile Range (IQR): The difference between the third quartile (Q3) and the first quartile (Q1). The IQR measures the spread of the middle 50% of the data and is robust against outliers.
- Variance: The average of the squared differences from the mean. Variance gives a measure of how far each number in the dataset is from the mean, but it’s in squared units, making interpretation less intuitive.
- Standard Deviation: The square root of variance, representing the average distance from the mean in the original units of the data. Standard deviation is widely used because it’s more interpretable than variance.
- Mean Absolute Deviation (MAD): The average of the absolute deviations from the mean, offering another measure of variability. It is less sensitive to outliers compared to variance and standard deviation.
These measures help in understanding how spread out or clustered the data is around the central tendency.
7. What is a boxplot and what information does it convey?
- Boxplot: A boxplot, or box-and-whisker plot, is a graphical representation of a dataset that displays the distribution’s five-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
- Box: The central box represents the interquartile range (IQR), containing the middle 50% of the data.
- Whiskers: Lines extending from the box to the minimum and maximum values within 1.5 times the IQR from Q1 and Q3.
- Outliers: Data points outside the whiskers are plotted as individual points, indicating potential outliers.
Boxplots are useful for comparing distributions, identifying skewness, and detecting outliers. They give a quick visual summary of the data’s central tendency, spread, and variability.
8. Describe a situation where the median is more appropriate than the mean.
- When to Use the Median: The median is more appropriate in situations where the data is skewed or contains outliers. For example:
- Income Data: In a dataset of household incomes, where most households earn around $50,000 but a few earn over $1,000,000, the mean would be higher than the majority of incomes, giving a distorted view of the typical income. The median, being less influenced by these extreme values, would provide a better representation of the central tendency.
- Housing Prices: In real estate, median home prices are often reported instead of the mean because the median is less affected by a few very expensive or very cheap homes.
9. What is a Z-score?
- Z-score: A Z-score indicates how many standard deviations a data point is from the mean of the dataset. It’s calculated as:
Z=X−μσZ = \frac{X — \mu}{\sigma}Z=σX−μ
where XXX is the data point, μ\muμ is the mean, and σ\sigmaσ is the standard deviation.
A Z-score of 0 means the data point is exactly at the mean. Positive Z-scores indicate values above the mean, while negative Z-scores indicate values below the mean.
Z-scores are useful for standardizing data, allowing comparisons between different datasets or different variables within the same dataset. They are also used to identify outliers (typically, Z-scores beyond ±2 or ±3).
10. Explain the 68–95–99.7 rule.
- 68–95–99.7 Rule: Also known as the empirical rule, this rule applies to normal distributions:
- 68% of data falls within 1 standard deviation of the mean.
- 95% falls within 2 standard deviations.
- 99.7% falls within 3 standard deviations.
This rule helps in understanding the distribution of data and making predictions about where most data points are likely to fall. For instance, if you know a dataset is normally distributed, you can expect that nearly all data points will be within three standard deviations of the mean.
11. How do you handle missing data?
- Handling Missing Data: There are several strategies:
- Deletion: Removing rows or columns with missing values. This is simple but can lead to loss of valuable information, especially if many data points are missing.
- Imputation: Filling in missing data with a substitute value, such as the mean, median, or mode. More sophisticated methods include using regression models, k-nearest neighbors, or machine learning models to predict the missing values.
- Flagging: Creating a new variable that flags the presence of missing data, which can be useful for models that may interpret missingness as informative.
- Ignoring: Some algorithms can handle missing data directly, so no imputation is necessary.
The choice of method depends on the extent of missing data, the reasons for its absence, and the impact on the analysis.
12. How would you treat outliers in your dataset?
- Treating Outliers:
- Trimming/Removing: Excluding outliers from the dataset if they are deemed to be erroneous or not representative of the population.
- Transformation: Applying transformations, such as logarithmic or square root, to reduce the impact of outliers. This can make the data more normally distributed.
- Capping/Winsorizing: Limiting extreme values to a maximum threshold. For example, replacing values beyond the 95th percentile with the 95th percentile value.
- Investigation: Analyzing the cause of outliers to determine whether they should be kept, corrected, or removed. Outliers might represent important phenomena, so they shouldn’t always be discarded.
The approach depends on the context and the goal of the analysis.
13. What is a percentile?
- Percentile: A percentile indicates the value below which a certain percentage of data points fall. For example:
- The 25th percentile (Q1) is the value below which 25% of the data lies.
- The 50th percentile (median) is the value below which 50% of the data lies.
- The 90th percentile is the value below which 90% of the data lies.
Percentiles are used to understand the relative standing of a data point within the distribution. In educational testing, for example, a student’s score in the 85th percentile means they scored better than 85% of the other test-takers.
14. Describe the difference between covariance and correlation.
- Covariance: Covariance measures the direction of the linear relationship between two variables. If both variables tend to increase or decrease together, the covariance is positive. If one increases while the other decreases, the covariance is negative. However, the magnitude of covariance is not standardized, making it difficult to interpret the strength of the relationship.
- Correlation: Correlation standardizes the covariance by dividing by the product of the standard deviations of the variables, resulting in a value between -1 and 1. This makes it easier to interpret:
- A correlation of 1 means a perfect positive linear relationship.
- A correlation of -1 means a perfect negative linear relationship.
- A correlation of 0 indicates no linear relationship.
Correlation is widely used to assess the strength and direction of relationships between variables, especially when comparing variables with different units or scales.
15. How do you normalize data?
- Normalization: Normalization resizes data to a common scale, typically 0 to 1, without distorting differences in the range of values. It’s done using the formula:
Xnormalized=X−XminXmax−XminX_{\text{normalized}} = \frac{X — X_{\text{min}}}{X_{\text{max}} — X_{\text{min}}}Xnormalized=Xmax−XminX−Xmin
where XminX_{\text{min}}Xmin and XmaxX_{\text{max}}Xmax are the minimum and maximum values in the dataset.
Normalization is useful when the data has varying scales, and you want to ensure that all variables contribute equally to the analysis, such as in machine learning algorithms like K-nearest neighbors (KNN) or neural networks.
16. What are the benefits of using standard deviation over range?
- Standard Deviation vs. Range:
- Range: Measures the difference between the maximum and minimum values in the dataset. While simple, it only considers two values and doesn’t reflect the distribution of the rest of the data.
- Standard Deviation: Measures the average distance of each data point from the mean. It considers all data points and provides a more accurate and informative measure of spread. Standard deviation is particularly useful in datasets with outliers or skewed distributions because it reflects the overall dispersion of the data.
Standard deviation is preferred because it provides more information about the variability and is less sensitive to outliers than the range.
17. Define the five-number summary in statistics.
- Five-Number Summary: The five-number summary provides a quick overview of the distribution of a dataset:
- Minimum: The smallest data point in the dataset.
- Q1 (First Quartile): The 25th percentile, below which 25% of the data falls.
- Median (Q2): The 50th percentile, the middle value of the dataset.
- Q3 (Third Quartile): The 75th percentile, below which 75% of the data falls.
- Maximum: The largest data point in the dataset.
The five-number summary is often visualized using a boxplot and is useful for identifying the center, spread, and skewness of the data, as well as detecting outliers.
18. How would you compare the distributions of two different datasets?
- Comparing Distributions:
- Boxplots: Visual comparison of the spread, central tendency, and outliers of two datasets. Boxplots provide a clear representation of the range, quartiles, and median.
- Histograms: Visual comparison of the shape, spread, and skewness of distributions. Histograms show the frequency of data points in each bin, helping to identify patterns like normality, skewness, or bimodality.
- Descriptive Statistics: Comparing measures like the mean, median, standard deviation, and IQR. These statistics provide numerical insights into the central tendency and variability of each dataset.
Comparing distributions helps in understanding differences in the datasets, such as variability, central tendency, and presence of outliers, and is crucial in statistical analysis.
19. Describe the relationship between variance and standard deviation.
- Variance and Standard Deviation:
- Variance: Represents the average of the squared differences from the mean, providing a measure of how spread out the data points are. However, it’s in squared units, which can make interpretation difficult.
- Standard Deviation: The square root of the variance, bringing the measure back to the original units of the data. It represents the average distance of each data point from the mean.
The relationship is that standard deviation is simply the square root of variance. Both are measures of variability, but standard deviation is more interpretable because it’s in the same units as the data, making it easier to understand the spread in context.
20. How is a mode useful when analyzing categorical data?
- Mode in Categorical Data: The mode is the most frequent category or value in a dataset, making it particularly useful when dealing with categorical (nominal or ordinal) data. For example:
- Survey Responses: In a survey about preferred programming languages, the mode would identify the most popular choice.
- Customer Feedback: If you categorize customer feedback into themes, the mode would tell you which theme occurs most frequently.