--- title: "Precept 11" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk\$set(echo = TRUE) library(tidyverse) ``` Run the following to load a dataset that records various data about mammals, including brain weight. The brain weight is given in grams, the body weight in kilograms, and the gestation weight in days. ```{r} brains <- read.csv("http://guerzhoy.princeton.edu/201s20/brains.csv") ``` ### Problem 1: Linear Regression #### Part 1(a) Suppose you want to use linear regression to investigate the relationship between brain weight and body weight. Find a way to transform the variables that would allow you to do that. (Hint: try taking the log of *both* variables. See Tuesday's lecture where we explored the relationship between gdp per capita and life expectancy). Use a scatterplot to assess whether a relationship is linear. #### Solution A plot where we take the log of both variables works nicely. ```{r} ggplot(brains, mapping = aes(x = log(Body), y = log(Brain))) + geom_point() + geom_smooth(method = "lm") ``` #### Part 1(b) Produce the diagnostic plots. Display and investigate outliers, if any. (See Tuesday's lecture on the relationship between gdp per capita and life expectancy) Let's now plot the diagnostic plots #### Solutions ```{r} library(ggfortify) fit <- lm(log(Brain) ~ log(Body), data = brains) autoplot(fit) ``` Let's look at the outliers in more detail: ```{r} brains[c(58, 25, 48),] ``` Interestingly, dolphins and hippos are closely related phylogenetically (but the residulas have different signs, so there is no big insight here.) Removing lemurs (and hippos) as big outlier might make sense, but would be tough to justify. (Here is how to remove datapoints: ```{r} brains.no.hippos <- brains[-48,] ``` ) There are not too many outliers, and the Q-Q plot is approximately linear, so we can run the regression. #### Part (c) Run the regression. What conclusions can you draw? ##### Solution ```{r} summary(fit) ``` There is a positive association between body weight and brain weight: the p-value for the coefficient of `log(Body)` is very small, so we can conclude that the coefficient is not zero. ### Problem 2: Failing to meet assumptions Suppose we want to know whether the size of the litter is related to body weight. Produce diagnostic plots for any variable transformations you can think of. Do not expect the linear regression model assumption to hold. #### Solution ```{r} fit <- lm(Litter ~ log(Body), data = brains) autoplot(fit) ``` ### Problem 3: Litter size as categorical Treat rounded litter size as categorical (you will need to convert litter size to categorical). Plot the appropriate diagnostics. Are the model assumptions satisfied? #### Solution ```{r} library(tidyverse) ggplot(brains) + geom_boxplot(mapping = aes(y = log(Body), x = as.factor(round(Litter)))) ``` Those are not generally symmetrical and the variance of the residuals is not constant. ### Problem 4: Litter size as ever more categorical Create a new variable: litter size is greater than 5. Check the model assumptions. Now, use `lm` to test the hypothesis that the body weight is related to the litter size being greater than 1. What conclusions can you draw? ```{r} brains <- brains %>% mutate(L5 = Litter > 5) fit <- lm(log(Body) ~ L5, data = brains) ggplot(brains) + geom_boxplot(mapping = aes(x = L5, y = log(Body))) ``` Seems OK! ```{r} summary(fit) ``` We can reject (barely) the hypothesis that there is no relationship between having a litter greater than 5 and body weight. ### Problem 5: the F-test Create another new variable with the categories: litter size up to 2, litter size up 7, litter over 7. Produce appropriate diagnostic plots, and use an F-test to compute a p-value. What is the null hypothesis? What is the conclusion? ```{r} brains[, "Litter.size"] <- "Small" brains\$Litter.size[brains\$Litter > 2] <- "Medium" brains\$Litter.size[brains\$Litter > 7] <- "Large" ggplot(brains) + geom_boxplot(mapping = aes(x = Litter.size, y = log(Body))) ``` ```{r} summary(lm(log(Body)~Litter.size, data = brains)) ``` We can reject the hypothesis that there is no relationship between body weight and litter size.