---
title: "Precept 5 Solutions"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```
## Problem 1: Extracting the Last Name with `strsplit` and `sapply`
### Problem 1(a)
Make a function that takes in a data frame in the same format as Titanic, and returns the percentage (i.e., a number between 0 and 100) of people who did not survive
#### Solution
```{r}
precent.n.survived <- function(titanic){
100*(1-mean(titanic$Survived))
}
titanic <- read.csv("http://guerzhoy.princeton.edu/201s20/titanic.csv")
precent.n.survived(titanic)
```
*Learning goal*: use the fact that the mean of a vector of logicals is the proportion of `TRUE`s there. (And functions.)
### Problem 1(b)
You can use `strsplit` to split character strings into words. For example, the following splits a character string into words assuming that the words are separated by a space
```{r}
words <- strsplit("Go Tigers", " ")[[1]]
words
words[1]
words[2]
```
Write a function that takes in the name of a person (as it appears in the Titanic dataset) and returns the persons last name
#### Solution
```{r}
get.last.name <- function(full.name){
words <- strsplit(as.character(full.name), " ")[[1]]
words[length(words)]
}
```
*Learning goal*: understand the specifications for a new function, and use that function to achieve a task
### Problem 1(c)
Add a column to the `titanic` dataset that contains just the last name for each row. You should use `sapply` and your function from Problem 1(b)
#### Solution
```{r}
titanic <- titanic %>% mutate(Last.Name = sapply(X = Name, FUN = get.last.name))
```
*Learning goal*: use `sapply` to compute a new column of a dataframe by applying a function to each element of an existing column. That is: combine using `sapply` to compute a function of each element of a vector with extracting and adding columns from/to data frames.
## Problem 2: Predicting using the last name
### Problem 2(a)
Use logistic regression to predict survival using the last name of a person. Are you able to obtain a better accuracy than the baseline classifier? Compute and compare the false positive rate (FPR), false negative rate (FNR), and the positive predictive value (PPV).
The definitions are as follows:
$$FPR = \frac{\text{# of times the model said "positive" and was wrong}}{\text{# of negatives }}$$
$$FNR = \frac{\text{# of times the model said "negative" and was wrong}}{\text{# of positives }}$$
$$PPV = \frac{\text{# of times the model said "positive" and was correct}}{\text{# of times the model said "positive"}}$$
Can you come up with a theory that would explain why you can predict survival using the last name?
#### Solution
We'll build the model using the entire dataset.
```{r}
fit <- glm(Survived ~ Last.Name, family = binomial, data = titanic)
pred <- predict(fit, newdata = titanic, type = "response") > .5
mean(pred == titanic$Survived)
```
Most of what's going on is that the last name identifies just one person in the dataset, for whom the answer is known. So it's not surprising that the predictions are good.
*Learning goal*: understand that when predicting using a categorical variable, we are predicting the average for each category, and apply this fact to a new situation. Be able to run logistic regression in R.
```{r}
FPR <- sum(pred == T & titanic$Survived == 0)/sum(titanic$Survived==0)
FNR <- sum(pred == F & titanic$Survived == 1)/sum(titanic$Survived==1)
PPV <- sum(pred == T & titanic$Survived == 1)/sum(pred == T)
```
*Learning goal*: operate on logical vectors with boolean operators. Translate the definitions of FPR/FNR/PPV/... into code.
### Problem 2(b)
One idea is to try to predict using the last name, but for different people than the ones in the training set.
```{r}
names.test <- titanic %>% group_by(Last.Name) %>%
mutate(num = n()) %>%
filter(num >= 2) %>%
summarize(Name = Name[1]) %>%
arrange(Last.Name)
names.train <- titanic %>% group_by(Last.Name) %>%
mutate(num = n()) %>%
filter(num >= 2) %>%
summarize(Name = Name[2]) %>%
arrange(Last.Name)
test.set <- titanic %>% filter(Name %in% names.test$Name)
train.set <- titanic %>% filter(Name %in% names.train$Name)
fit <- glm(Survived ~ Last.Name, family = binomial, data = train.set)
mean((predict(fit, newdata = train.set, type = "response") > 0.5) == train.set$Survived)
mean((predict(fit, newdata = test.set, type = "response") > 0.5) == test.set$Survived)
```
Note that there actually is a correlation between the survival of people with the same name, since they are usually related.
*Learning goal*: demonstrate an intuitive understanding of overfitting.
### Problem 3
### Problem 3: `ggplot`
Use `ggplot` to visualize the relationship between the fare and the class, as well as the sex of the passenger. Do you see any patterns?
#### Solution
```{r message = F, warning=F}
ggplot(data = titanic, mapping = aes(x = Pclass, y = Fare)) +
geom_smooth(mapping = aes(color = Sex), method = "loess")
```
*Learning goal*: display the relationship between three variables on one plot using `ggplot`.
We see that the average fare for male passengers was smaller than for female passengers, in first class. If you look further, you'll find that the fare is actually recorded *per family*.
### Problem 4: `ggplot`
Use `ggplot` to visualize the relationship between the age of a passenger and their probability of survival, based on a model the uses the passenger's age, sex, class, and fare, as well as based on a model that uses just the passenger's age. Superimpose the two plots.
#### Solution
```{r}
fit.big <- glm(Survived ~ Age + Sex + Pclass, family = binomial, data = titanic)
fit.small <- glm(Survived ~ Age, family = binomial, data = titanic)
titanic[, "pred.big"] <- predict(fit.big, newdata = titanic, type = "response")
titanic[, "pred.small"] <- predict(fit.small, newdata = titanic, type = "response")
ggplot(data = titanic, mapping = aes(x = Age)) +
geom_smooth(mapping = aes(y = pred.big), method = "loess", color = "blue") +
geom_smooth(mapping = aes(y = pred.small), method = "loess", color = "red")
```
*Learning goal*: use `ggplot` to display the predictions of a model; display several curves on one plot using `ggplot`.
### Problem 5: Insurance rates
Write a function that will compute the total profit (or loss) if the if the insurance agent uses a logistic regression model that takes that uses the sex, class, and age of a person to predict whether they will survive, for the population of people on the titanic (note: this is "cheating," since in reality an agent would not have access to the data to fit their model; there are also other issues here, which we will discuss). The policy the insurance agent uses the following procedure:
* If the person's probability of survival is less than `p`, turn them away.
* If the person's probability of survival is greater than or equal to `p`, sell them insurance for `premium`.
* If the person does not end up surviving, pay `benefit` to the estate.
Find reasonable `p`, `premium`, and `benefit` which would yield a profit in the case of the Titanic. You should do that by just trying to call your function manually using different values.
(Again, note: this is an exercise, and not a realistic example. Most liners did not sink; no insurer would sell insurance if they knew the liner would sink. In fact, insurers often refuse to sell insurance when the probability of a bad outcome is not very small.)
#### Solution
* If the person’s probability of survival is less than p, turn them away.
* If the person’s probability of survival is greater than or equal to p, sell them insurance for premium.
* If the person does not end up surviving, pay benefit to the estate.
```{r}
total.profit <- function(fit, p, premium, benefit, dat){
dat[, "pred"] <- predict(fit, newdata=titanic, type = "response")
n.accepted <- sum(dat$pred > p)
n.paid.benefit <- sum( (dat$pred > p) & (dat$Survived == 0) )
total.revenue <- n.accepted * premium
total.expediture <- n.paid.benefit * benefit
total.revenue - total.expediture
}
p <- 0.90
premium <- 1600
benefit <- 50000
fit <- glm(Survived ~ Age + Sex + Pclass, family=binomial, data = titanic)
total.profit(fit, p, premium, benefit, titanic)
```
*Learning goal*: use techniques similar to the ones used for computing FPR/PPV/etc. to compute a quantity of interest. Understand the use of quantities like FPR etc. to compute the cost of a classifier.