In this assignment, you will be exploring data related to COVID-19. Specifically, you will be exploring the SIR model, and connecting it to actual data about COVID-19.

General guidelines

You will submit an Rmd file and the pdf file that was knit from it. Your code should be general enough that if the dataset were changed, the code would still run.

You will be graded on the correctness of your code as well as on the professionalism of the presentation of your report. The text of your report should be clear; the figures should do a good job of visualizing the data, and should have appropriate labels and legends. Overall, the document you produce should be easy to read and understand.

When you are asked to compute something, use code (rather than compute things outside of R and then input the answer), and include the code that you used in the report. (you may deviate from this if the computation is really trivial.)

The SIR Model

The SIR (Susceptible-Infectious-Recovered) model is a simple epdimiological model. Despite (and because) of its simplicity, the SIR model has been used to model the spread of SARS-Cov-2.

In the SIR model, the population is divided into three subsets:

  • Susceptible (S people): people who may be infected
  • Infected (I people): people who are currently infected
  • Recovered (R people): people who recovered and can no longer be infected

The spread of the disease is characterized by the parameters \(\beta\) and \(\gamma\).

  • If the infection rate in the population is \(\rho\), the fraction of the susceptible population that will be infected in the next day is \(\beta\times \rho\).
  • The fraction of infected people who recover every day is \(\gamma\).

Here is an R implementation of a run of the model. We start with a population of 100,000, with 1000 infected people. By the end of 100 days, we see that no one is infected anymore.

S <- 99000
I <- 1000
R <- 0
N <- S + I + R
pop <- c(0, S, I, R)
beta <- 1.5
gamma <- 0.1

# Repeat 100 times
for(t in 1:100){
  inf.prop <- I/N 
  new.infected <- min(S, floor(beta * S * inf.prop))
  new.recovered <- min(I, ceiling(gamma * I))
  I <- I + new.infected - new.recovered
  S <- S - new.infected
  R <- R + new.recovered
  # Add another row tp pop, with the current day t, and the current values
  # of S, I, and R
  pop <- rbind(pop, c(t, S, I, R))
colnames(pop) <- c("day", "S", "I", "R")

Grading scheme:

  • 10 points for a professional and well-formatted report. Points will be taken off if the report is not well-formatted and is difficult to read.

  • 5 points for submitting the Rmd file on Gradescope and for submitting the pdf file on Gradescope. Pages on Gradescope should be assigned to questions.

Part 1 (10 pts): Investigating the SIR model

Experiment with different values of \(\beta\) and \(\gamma\). Make plots demonstrating the effect of increasing and decreasing \(\beta\) and \(\gamma\).

Part 2 (5 pts): Basic Reproduction Ratio

The basic reproduction ratio is defined as \(R_0 = \frac{\beta}{\gamma}\). Investigate how the curves change for a given \(R_0\), but different values of \(\beta\) and \(\gamma\). The basic reproduction ratio should roughly correspond to the expected number of new infections from a single infected individual, when nearly the entire population is susceptible.

Demonstrate this empirically by trying different \(R_0\)s, and also the same \(R_0\) that was produced with different (\(\beta, \gamma\)) combinations. (For example \(R_0 = \frac{0.5}{0.25} = \frac{0.1}{0.05}\)).

Part 3 (10 pts): Herd Immunity

Herd immunity means that if an infection is introduced into the population (i.e., one individual is infected), the disease will not spread to most of the susceptible population.

Find a setting of the model (i.e., values for \(\beta\) and \(\gamma\)) such that a population that consists of 50% recovered people has herd immunity.

Document the process by which you found those values.

Make sure that you are showing when herd immunity obtains when about 50% of the population is Recovered. That is, it’s not acceptable to just set the \(\beta\) and \(\gamma\) to 0. Check that with the \(\beta\) and \(\gamma\) that you set, the recovered population goes up to at least 40% if you start from 0 Recovered.

Part 4 (15 pts): Reading in Current Data

Read in the Johns Hopkins COVID-19 data here:

confirmed <- read.csv("")

You now need to use pivot_longer in order to convert the data to a usable format.

The function pivot_longer requires you to specify the name of the columns from which you want to take the data that will be put into the Confirmed column. One way to do it is to get the names of all the columns:

(I will display only the first 20)

##  [1] "Province.State" "Country.Region" "Lat"            "Long"          
##  [5] "X1.22.20"       "X1.23.20"       "X1.24.20"       "X1.25.20"      
##  [9] "X1.26.20"       "X1.27.20"       "X1.28.20"       "X1.29.20"      
## [13] "X1.30.20"       "X1.31.20"       "X2.1.20"        "X2.2.20"       
## [17] "X2.3.20"        "X2.4.20"        "X2.5.20"        "X2.6.20"

You can select the names of the columns from which you want to take the data using indices other than 1:20.

You will want to compute the number of new cases every day as well. Here is a trick you can use:

v <- c(1, 5, 10, 12, 15, 30)
c(v, 0) - c(0, v)
## [1]   1   4   5   2   3  15 -30

Similarly to Test 1, you will need a “days since Jan. 22” column. You may use the following function with sapply

days.since.jan.22 <- function(date.str){
  date.str <- substr(date.str, 2, str_length(date.str))
  comps <- strsplit(date.str, "[.]")[[1]]
  month <- as.numeric(comps[1])
  day <- as.numeric(comps[2])

  if (month == 1){
    day - 22
  }else if(month == 2){
    9 + day
  }else if(month == 3){
    9 + 29 + day
  }else if(month == 4){
    9 + 29 + 31 + day
  }else if(month == 5){
    9 + 29 + 31 + 30 + day 

Part 5 (30 pts): The SIR model and Country Data

Write a function that displays, on the same graph, the number of new cases in a given country, and the number of new cases according to the SIR model, with parameters \(\beta\) and \(\gamma\). The function should take in the population of the country.

For the US, Italy, China, and South Korea, attempt to find values of \(\beta\) and \(\gamma\) that predicts the reported number of confirmed cases as well as possible. Use the sum of the squared differences between the log(cofirmed) and log(predicted confirmed).

Explain how you have done that. The easiest way to do this is to display the curves for various \(\beta\)s and \(\gamma\)s and in order to match the shape to those of the US, Italy, China, and South Korea, experimentally. (Note that enacting social distancing policies changes the \(\beta\), so you will want to fit the curves separately pre-social distancing and post-social distancing).

A different approach (which you do not have to take) is to automatically look through different combinations of values of \(\beta\) and \(\gamma\), similarly to, for example, Precept 4. This is more challenging than the approach suggested above.

(Note: since the death rate is small (as is the birth rate etc.), you can assume that the population consists only of the S, I, and R components, and that the total population does not change).

Discuss the range of the predictions that one might make based on what the plausible values of \(\beta\) and \(\gamma\) are.

For full credit, it is enough to estimate the \(\beta\) and \(\gamma\) after social distancing measures were taken.