SML201 Project 1 – Predictive Modelling and Fairness

General guidelines

You will submit an Rmd file and the pdf file that was knit from it. Your code should be general enough that if the dataset contents were changed (with the same column names, of course), the code would still run.

You will be graded on the correctness of your code as well as on the professionalism of the presentation of your report. The text of your report should be clear; the figures should do a good job of visualizing the data, and should have appropriate labels and legends. Overall, the document you produce should be easy to read and understand.

You should use ggplot for plotting and tidyverse/dplyr (where needed) for wrangling the data.

When you are asked to compute something, use code, and include the code that you used in the report.

Auditing the COMPAS score

ProPublica obtained the public records on over 10,000 criminal defendants in Broward County, Florida. They also computed a variable that indicates whether each person was arrested within two years of being assessed. The data is available here. You can read the data in using compas <- read.csv("")

The COMPAS scores you will be analyzing is the “decile scores” (the column decile_score). We are interested in predicting the recidivism of individuals. We’ll be lookin at the is_recid column in the dataset.


Correct submission (5%)

To earn these 5%, you must

  • Submit the Rmd file and the pdf file on Gradescope correctly

  • Have one partner submit the files, and then add the other partner to the team on Gradescope. After this initial step, both partners can re-submit a new version as many times as needed. The latest submission will be graded.

  • When submitting the pdf file, mark out all the parts on Gradescope correctly. Gradescope will prompt you to do that when you submit

Report neatness/readability (5%)

Your pdf report must be neat and readable. Some tips:

  • Do not include multiple pages of printed out numbers (as for example would happen if you display an entire data frame in the report)

  • Briefly (usually a setence or two is enough) say what output your are presenting. It should be possible to read your report without having to refer back to the project handout

  • If you are writing English text, it should usually be outside of the R chunks. An exception is when you are explaining a particular line of code, in which case using comments (i.e. # this is a comment) in the R code is appropriate. Note that it is not a requirement that every line of code is commented, and usually comments are not needed.

  • You English text should be properly formatted – make use if headers, write in paragraphs, etc. Your R code should be properly indented, similarly to what we do in lecture.

Part 1: Comparing the scores of black and white defendants (10%)

Make two histograms: one with the decile scores for white defendants, and one with the decile scores for black defendants. Make sure to include appropriate captions and labels.

Part 2: Initial evaluation the COMPAS scores (15%)

Suppose that defendants with scores that are greater or equal to 5 are considered to be “high-risk,” and other defendants are considered to be “low-risk.”

Compute the false positive rate, the false negative rate, and the correct classification rate for the entire population, for the population of white defendants separately, and for the population of black defendants separately. State the tentative conclusions that you can draw about the fairness of the COMPAS scores.

To obtain the context for the potential informativeness of the scores, compute the overall recidivism rate in the dataset. Comment on the difference between the overall recidivism rate and the correct classification rate using the score.

Part 3: Altering the threshold (15%)

You can obtain a prediction by thresholding the COMPAS score. For example, if the threshold is 5, you would predict “will reoffend” for score >= 5, and “will not reoffend” otherwise.

For the possible thresholds [0.5, 1, 1.5, 2, 2.5, 3, ..., 9.5], compute the FPR, FNR, and correct classification rate for the entire population, for white defendants, and for black defendants. Plot the results using ggplot. For each demographic, you should produce one plot with three curves, with the thresholds being on the x-axis. Make sure that you display the appropriate labels and captions.

Part 4: Trying to reproduce the score (15%)

Fit a logistic regression model that predicts the probability of recidivism using the age and the number of priors of the defendent.

State the interpretations of the coefficients of the model.

For the threshold of 0.5 on the probability, obtain the FPR, FNR, and correct classification rates for the model for the entire population, for black defendants, and for white defendants, on both the training and the validation sets.

For a 30-year-old, what is the effect on the predicted probability of a re-arrest of one more prior offense?

Part 5: Adding variables (10%)

There is more data available. For at least 4 more input variables, try including each in the model, and use the validation set to decide whether the variable should be included in the model. Your report should include documentation of the process you used, and the final model you decided on using the validation set.

Part 6: Adjusting thresholds (15%)

One appealing definition of a fair model is that the model has the same probabiliy of labelling a defendant low-risk regardless of demographics if the defendant will not end up being re-arrested. Build such a model by finding a combination of thresholds (which can vary by demographics) that produces such a result. Try to keep the CCR as high as possible while satisfying the fairness constraint.

Report the FNR, FPR, and the correct classification rate of the system on the validation set, for the whole population.

For this part, you may manually try different threshold (while documenting your process) rather than writing code to do that for you.

(You may fit two models separately – one for white defendants and one for black defendants – or use a single model and simply find two thresholds.)

Part 7: Data visualization (10%)

Pick two variables in the dataset, and produce a piece of data visualization that shows the relationship between the two variables and the COMPAS risk scores.

Explain what trends you are observing and briefly justify your choices regarding how you made the visualization.