SML201: Introduction to Data Science

Spring 2020

Course team

Course description   Introduction to Data Science provides a practical introduction to the burgeoning field of data science. The course introduces students to the essential tools for conducting data-driven research, including the fundamentals of programming techniques and the essentials of statistics. Students will work with real-world datasets from various domains; write computer code to manipulate, explore, and analyze data; use basic techniques from statistics and machine learning to analyze data; learn to draw conclusions using sound statistical reasoning; and produce scientific reports. No prior knowledge of programming or statistics is required.

Course assignments

Problem sets
Problem set 1: Vectors and Data Frames (3%). Due: Feb. 21 Feb. 24 9 p.m. Solutions.
Problem set 2: Wrangling Data Frames and Linear Regresion (3%) Due: March 2 March 5 9p.m. Solutions.
Projects (topics are tentative)
Project 1: Auditing the COMPAS score (11%). Due: March 30 April 6 9p.m.
Project 2: Cancer and Gene Expressign Data (11%). Due: April 13 April 20 9p.m.
Project 3: Risk Prediction for ICU Patients (12%). Due: May 11 9p.m.

Every student has a total of 6 grace days they can use throughout the term (except for Project 3) to avoid a lateness penalty of 10% per 24 hours, rounded up to the nearest whole number of days. You cannot use more than three grace days at a time.

In-class tests
Term test 1: Thursday March 12. Reference sheet. Study guide. 2019 paper + solutions
Term test 2: Tuesday April 28

Precept Assignments
Week of Feb. 3: Precept 1: Intro, functions (Rmd source). Solutions (Rmd source)
Week of Feb. 10: Precept 2: Vectors and Data Frames (Rmd source). Solutions. Q4 video solutions.
Week of Feb. 17: Precept 3: Wrangling Data Frames (Rmd source). Solutions. Video solutions: Q1, Q2, Q3, Q4
Week of Feb. 24: Precept 4: Linear Regression and sapply (Rmd source). Solutions (Rmd source). Video solutions: Q1, Q2, Q3.
Week of Mar. 2: Precept 5: Logistic regression and ggplot. Solutions (Rmd source). Video solutions: Q1, Q2, Q2b-end, Q3, Q4, Q5.
Week of Mar. 9: No new assignment, but course staff available to answer questions during precept time.
Week of Mar. 23: Precept 6: Overfitting, a preview of tidy data. Solutions for Q1(Rmd source).
Week of Mar. 30: Precept 7: R Markdown, tidy data, and fairness criteria


Class meetings

The morning section meets at McComick Hall 101 on Zoom on Tues 11:00am-12:20pm and Thurs 11:00am-12:20pm Eastern Time.

The afternoon section meets at Robertson Hall 001 on Zoom on Tues 3:00pm-4:20pm and Thurs 3:00pm-4:20pm Eastern Time.

Precept meetings

See here for precept logistics/assignments and here for links

Instructor office hours
Mondays and Fridays 2:30-3:30, or email for an appointment. See the link below for Zoom links.
Preceptor office hours
Preceptors will be available during scheduled office hours or by appointment.

Course information

35%: Projects
6%: Problem Sets
32%: Tests
5%: iClicker quizzes
22%: In-precept assignments



Please install R and RStudio as soon as possible.


Statistical Thinking for the 21st Century by Russell A. Poldrack (Free e-boom at the book website)
Data Visualization: A practical introduction by Kieran Healy (Free e-book draft available from the book website)
R for Data Science by Garrett Grolemund and Hadley Wickham. (Free e-book available from the book website)
SML201 students will have access to online DataCamp courses for free, courtesy of DataCamp. See the course Piazza for details on how to sign up.

An inclusive environment

We strive to build and maintain an inclusive environment in class — an environment that allows every student to reach their full potential. Please do not hesitate to contact me and/or your preceptor to let us know if you need special accommodation or with any concerns.

Design credit: CS229, Jan 2019.