1. In the babynames dataset, we computed the number of distinct names per capita, as well as the total number of distinct names, for each year. In plain English, what would be the questions that could be answered with those kinds of measures? What are the arguments for each of the measures?

2. Explain the effect of scaling the x-axis on a log-scale on a scatterplot. Give specific examples of transformations of curves plotted in regular cartesian coordinates due to changing the x-axis and y-axis to a log scale.

3. Explain the use of “training sets” for building models for predicting, for example, house prices.

4. How to obtain predictions from a linear regression model?

5. How can you measure how well a linear regression model is working? Be precise.

6. State the criterion according to which lm() selects the best coefficients.

7. Why might transforming inputs (e.g., using the log function) lead to better predictions?

8. Explain the difference between categorical variables and continuous variables. Explain why some variables might be plausibly thought of as either categorical or continuous.

9. In the case of categorical variables, state the criterion for the optimal coefficients that lm() uses.

10. Write down lm()’s cost function when using several categorical variables.

11. When doing prediction using logistic regression, how do we obtain a probability? How do we obtain guesses (about 0 or 1)?

12. What is the false positive rate? In the context of detecting disease, why are false positive rates important?

13. What is the false negative rate? In the context of detecting disease, why are false positive rates important?

14. What is the positive predictive value? In the context of detecting disease, why are false positive rates important?

15. Why is it important to use a test set (rather than simply use the training set for everything)?

16. What is overfitting? Explain overfitting by analogy with “teaching to the test”.

17. Explain the difference between stat = "identity" and stat = "count" when using geom_bar

18. What are two ways to display to curves on the same graph at the same time using ggplot? Explain with reference to the kinds of data frames with which those two ways are compatible.

19. Suppose the prediction for lifeExp is made using $$\text{lifeExp}\approx 12.26+5.34\log(gdpPercap)+a_{asia}I_{asia}+....$$ Do the algebra to explain why an increase by 1 in log(gdpPercap) corresponds to an increase by a factor of approximately 2.7 in gdpPercap.

20. If we observe an association between A and B, list the reasons the relationship might not be causal. Tell “stories” about the potential reasons that longer time spent on studying is associated with higher GPAs.

21. How to compute the fair odds of an event?

22. Suppose the fair odds against candidate A winning are 2:5. Explain this in terms of the terms for a bet. Compute the probability that candidate A will win.

23. Explain how to obtain all the numbers on slide 7 from the R output.

24. For all of the following, define the criterion, and explain whether a violation of the criterion would constitute disparate impact, disparate treatment, both, or neither: demographic parity; accuracy parity; true positive parity; predictive value parity; fairness through unawareness.

25. What are the two ways we discussed of equalizing accuracy parity?

26. Explain the importance of the distinction between reoffending and being re-arrested in the context of model fairness.

27. Why might sample size disparity contribute to unfairness?

28. Suppose we are trying to estimate the bias (a.k.a weight) of a coint by tossing it many times. Sketch the graph of estimated bias versus the number of tosses look like. Explain the intuition for why it looks the way it does.

29. What is the pmf for the number of Heads we would get if we were tossing a fair coin twice?

30. What is the cumulative mass function of the number of Heads we would get if we were tossing a fair coin twice?

31. Define the p-value. Why is a p-value only defined in reference to a particular null-hypothesis?

32. Give an example of a p-value calculation where we assume the Binomial distribution. Explain how in the case of a Binomial model, the Gaussian approximation can be used to compute the p-value.

33. Explain the idea behind using fake data simulation to approximate p-values.

34. What is a null hypothesis?

35. Without using rt, generate datapoints from a t-distribution with 10 degrees of freedom.

36. What does a low p-value mean? What does a high p-value mean?

37. Suppose we have a sample of 5 heights of Princeton students x = c(7.1, 6.2, 5.5, 5.4, 6.0). Assuming that the heights of Princeton students are normally distributed, how can we test the hypothesis that the average height of a Princeton student is 5.9?

38. What is the null hypothesis in the Darwin’s Finches example? If the null hypothesis is true, what quantity would be t-distributed? How can we use pt to compute the p-value in the Darwin’s Finches example? (Still )

39. Sketch a boxplot. Explain all the components of the boxplot.

40. Define Type I and Type II errors

41. For a scenario of your choice where the null hypothesis is true, sketch a graph of the probability of Type I error and sample size. Explain the choices you made.

42. What is the point of pre-registering null hypotheses?

43. What is the problem with the “file drawer effect”? How would publishing negative results help the scientific community?

44. What is p-hacking?

45. Explain what happened in Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon paper

46. Explain why a Type II error would occur very often if you toss a coin 5 times and have a null hypothesis about the probability of Heads. Support your argument using outputs of pnorm.

47. What are Type M and Type S errors?

48. What are some appropriate uses of Q-tips?