Statistics is one of the most powerful tools in the arsenal of a data scientist. Raw data can help you guess a few of the insights that you may require to make a decision, but statistics arms you with the information to understand the data as a whole and its true nature. This in turn helps you make more concrete conclusions on the data rather than just making estimates.
While interviewing as a data scientist, you will be tested on statistics and the multitude of associated topics. But fear not, as we have got a comprehensive list of commonly occurring questions in data science interviews on statistics. Use them to your advantages, and find gaps in your knowledge that you may prepare well ahead of the interview.
1. How is statistics related to data science?
With computer science and applications, data sciences also include mathematical statistics. Data science turns vast amount of data into knowledge by using statistics, visualization, applied mathematics and computer science. That makes statistics is one of the main parts of data science. Statistics is a mathematical branch dealing with the interpretation, collection, organization,presentation and analysis of data.
2. Describe a few methods or techniques used in statistics for analyzing the data.
To sort through big data, following are a few of the important techniques used.
· Mean- The sum of all data points in a dataset divided by the number of data points is called mean. It is useful in giving a rapid snapshot of your data or an idea of the overall data trend.
· Standard Deviation – The standard deviation denotes the spread of data around the average (mean). A high standard deviation greater spread about the mean, where as a low standard deviation indicates greater alignment with the mean.
· Regression – Regression finds a relationship between dependent and independent variables, plotted on a scatter plot. Analysis also indicates the strength of the relationship between the model and the data.
· Sample Size Determination – To find out about a large data set or population, getting a representative sample is good enough to measure the data.
· Hypothesis Testing–On setting a hypothesis for a data set or population, it finds out if the premise is actually true or not. It is also commonly called t testing. In statistics, result of a hypothesis test is significant if the results are not possible via random chance.
3. What are the different branches of statistics?
Descriptive statistics and inferential statistics are the two main branches of statistics. Descriptive statistics mainly involves the collection and presentation of data.Inferential statistics deals with inferring the right conclusions from the analysis performed using descriptive statistics.
4. Is standard deviation robust to outliers?
A low standard deviation indicates a low spread about the mean while a high standard deviation means the data shows very wide distribution. Extreme values of data points would increase standard deviation as they would be far away from the average value. Thus outliers will affect the value of the standard deviation.
5. What do you mean by linear regression?
It is a method that relates two variables with a simple model to predict the data distribution.A single predictor variable X impacts on a single dependent variable Y and its effect is modeled.
6. What are interpolation and extrapolation?
Estimating the value of a point between two data points within a set of discrete of data points is called interpolation. Determining the value of a data point that is outside the range of existing data points, using predictive analysis is called extrapolation.
7. What is the difference between Cluster and Systematic Sampling?
Cluster sampling is used when simple random sampling cannot be used and it becomes difficult to study the target population with a wide spread. It involves a sample where each sampling unit is a group of elements. Systematic sampling is a technique where selection of elements is from an ordered sampling frame. In systematic sampling, the list is traversed in a circular manner so once the end is reached,you start from the top again.
8. What does P-value signify about the statistical data?
P-value is denotes the importance of results after a hypothesis test. P-value helps in drawing conclusions and exists between 0 and 1.
• P- Value > 0.05 indicates weak evidence against the null hypothesis which means the null hypothesis cannot be rejected.
• P-value <= 0.05 indicates it is not the null hypothesis
• P-value=0.05 is the neutral value denoting either way can be chosen.
9. What are the assumptions required for linear regression?
The regression has five key assumptions:
· Linear relationship – linear regression needs the relationship between the independent and dependent variables to be linear.
· Multivariate normality – all variables to be multivariate normal. Normality can be checked with a goodness of fit test,
· No or little multicollinearity – Multicollinearity occurs when the independent variables are too highly correlated with each other.
· No auto-correlation – Auto correlation occurs when the residuals are not independent from each other
· Homoscedasticity(meaning the residuals are equal across the regression line in a scatter plot
10. What is a statistical interaction?
In statistics, an interaction may arise when considering the relationship among three or more variables, and describes a situation in which the effect of one causal variable on an outcome depends on the state of a second causal variable (that is, when effects of the two causes are not additive).
11. What is selection bias?
Selection Bias happens when individuals are not selected randomly, i.e. no evident randomization is there in the groups or data to be analysed. It basically means that the given sample does not accurately represent the under analysis. .Selection bias includes Time Interval, Attribute, Data and Sampling Bias.
12. Give me an example of a data set with a non-Gaussian (non normal) distribution?
Many examples can be given of a non normal distribution. Bacteria growth naturally follows an exponential distribution and is a good example of non-Gaussian distribution.
Source :https://en.wikipedia.org/wiki/Bacterial_growth#/media/File:Bacterial_growth.png
13. What is the central limit theorem?
The Central Limit Theorem (CLT) is a theory saying that if there is a sufficiently large sample size from a data set with a defined level of spread (variance), the mean of any sample from the same data set will be approximately equal its average. Specifically, as the sample sizes get bigger, the distribution of means from repeated sampling will approach the normal curve.
14. Given a dataset, how does Euclidean Distance works in three dimensions?
Euclidean distance is the straight-line distance between two points in Euclidean space. In three-dimensional Euclidean space, the distance between two data points p(p1, p2, p3) and q(q1, q2, q3) is
d(p,q) =
15. What are the differences between overfitting and underfitting?
In overfitting, a statistical model is prone to showing random error in place of the actual relationship. Overfitting occurs when a model is too complex, such as having excessive number of parameters with respect to the number of observations. A model that shows overfitting cannot predict well, as it overreacts to minor noise in the input data.
Underfitting happens when a model cannot show the actual trend of the data. Underfitting happens, for example, when fitting a straight line model to non-linear data. Such a model too shows poor performance in prediction.
16. How can one avoid over fitting when making a statistical model?
One should identify the key variables and think about the relationship that is likely to be specified. Then the plan should be to collect a large enough sample to handle all predictors, interactions, and polynomial terms the response variable might need.
17. What is sampling in statistics? How many sampling methods are there?
In statistics, a sample is a portion of collected or processed data from a statistical dataset by a defined procedure. The elements which are contained in the sample are known as sample points.
The sampling methods are:
· Cluster Sampling: Population will be divided into groups or clusters.
· Simple Random: This sampling method follows the random division.
· Stratified: Data is divided into groups or strata.
· Systematical: We pick every k th member of the data.
18. What is the difference between type I vs type II error?
A type I error is to conclude the presence of something falsely that is not there , whereas a type II error is to conclude the presence of something falsely that exists.
19. What is the Binomial Probability Formula?
The binomial distribution comprises of the probabilities of the possible numbers of victories on N trials for independent events that individually have a probability of π of occurring. The formula for the binomial distribution is:
where N is the number of trials, P(x) is the probability of x successes out of N trials, and π is the probability of success on a given trial.
20. What is correlation and covariance in statistics?
Correlation is described as the best technique for measuring and also for estimating the quantitative relationship between two variables. It measures strength of relationship between two variables.
Covariance is a measure that denotes the degree to which two random variables change in repetition. It is a term explaining the relation between two random variables, wherein changes in one variable are accompanied by a corresponding change in another variable.
21. What is cross-validation?
It is a technique used for model validation, i.e. to find out how the results of a statistical analysis will generalize to an independent population. It is mainly used in scenarios where the aim is prediction and one wants to evaluate how accurately a model will perform in practice. The aim of cross-validation is to name a data set to test the model in the training phase in order to limit problems like overfitting and find out how the model will generalize to an independent data set.
22. What is heteroscedasticity? How can we solve it?
Heteroscedasticity is the circumstance in which the variability of a variable is not equal across the range of values of a second variable that predicts it. We can solve it by re-building the model with new predictors, or use variable transformations such as Box-Cox transformation.
23. What do you understand by statistical power? How is it calculated?
Statistical power is the probability that a study will find out an effect when there is an effect to be detected. If statistical power is high, the chances of making a Type II error, or concluding there is no effect even if there is one, goes down. The power of any such test is governed by four main parameters:
· the effect size
· the alpha significance criterion (α)
· the sample size (N)
· statistical power, or implied beta (β)
24. What is Poisson distribution ?
Poisson distribution is used to find out the number of events that may happen in a continuous time interval. For instance, how many emailsmay occur at any particular time duration or how many people show up in a queue.
forx = 0,1,2…
25. What is your favourite statistical software? State three positive and negative aspects of it.
Minitab is general-purpose and designed for easy interactive use. As a software package, Minitab is well suited for teaching applications, but can also be easily adapted for analyzing research data.
Advantages :
· Smart Data Import: Easily corrects for case mismatches, properly displays missing data, removes extra spaces, and makes column lengths equal when data is imported from Excel and other file types.
· Automatic Graph Updating:Graphs and control charts get updated spontaneously when you add or edit data .
· Seamless Data Manipulation:Format columns to instantly identify and subset the most frequent values, outliers, out-of-spec measurements, and more.
Disadvantages
· Range of Functions: The range of statistical analyses that can be handled by Minitab is not as varied as in other packages such as SPSS and SAS. This means that for applied research fields with specialized techniques, such as economics, Minitab is not the best choice
· Ease of Use:Although Minitab is generally considered easy to use, and operates through an intuitive interface, it has some drawbacks in this area. Like the SPSS data view, the worksheet window in Minitab uses a fixed structure that is more difficult to manipulate than in spreadsheet programs like Microsoft Excel.
· Weak Mathematics Features:Minitab is a data analysis package, and so a weaker choice for pure mathematical uses, with less usability to perform numerical analyses.