SML310: Research Projects in Data Science

Fall 2019

Course staff


Course description   This seminar course will support studens as they work on a data science project with a dataset that they selected. The course introduces several core techniques in data science, in lectures and in mini-projects. Students will select a dataset of interest to them and produce an analysis or a data product, and a project report. Students will combine domain knowledge and technical expertise to produce their analyses and/or data products.

Course assignments

Mini-Projects
Mini-Project 1: Statistical Inference and Hierarchical Models (7%) due Oct 9 Oct. 13 at 11PM
Mini-Project 2: Python and Data Representations (6%) due Oct. 21 at 11PM
Mini-Project 3: NLP (7%) due Nov. 25 at 11PM
Mini-Project 4: Image data and PyTorch (7%) due Dec. 9 at 11PM

Assignments are only accepted up to 72 hours (3 days) after the deadline.

Course project
Initial project proposal (2%) due Sept. 30 Oct. 2 at 11PM
Revised project proposal (10%) due Nov. 12 at 11PM
Project proposal presentation (10%), mid-November
Course project (40%), due on the Dean's Date at 11PM
Seminar participation
Piazza project proposal feedback (1%)

Logistics

Lectures
Mon 3:00pm-4:20pm, Wed 3:00pm-4:20pm, CSML 103
Precepts
Wed 7:30pm-8:20pm (P01) or Thurs 3:00pm-4:20pm (P02), CSML 103
Contact Information
Please ask questions on Piazza if they are relevant to everyone.
Office Hours
Monday and Wednesday 1pm-2pm in CSML 202. Or email for an appointment. Or drop by to see if I'm in. Feel free to chat with me after lecture.

Course information

Expected Data Science Background
The course strives to accommodate students with a variety of backgrounds in data science. Some prior experience with programming and statistical analysis is expected, and extra support for learning Python in the beginning of the course will be provided.

Resources

Software

We will be using the Python NumPy/SciPy stack in this course. Python 2 and Python 3 are both acceptable.

The most convenient Python distribution to use is Anaconda. If you are using an IDE and download Anaconda, be sure to have your IDE use the Anaconda Python.

I recommend the Pyzo IDE available here. Jupyter Notebooks are favored by some people, though I recommend developing using an IDE.

We will be using PyTorch and Stan/Stan/RStan towards the end of the course.

Cloud computing

If your project requires a substantial amount of compute power, I recommend signing up for AWS Educate to obtain $100 in free credits for AWS. Instructions for running RStudio Server on AWS Educate are here. GCP and Microsoft Azure also offer free credits for students.

Reading

Textbooks
Data analysis using regression and multilevel/hierarchical models by Andrew Gelman, Jennifer Hill. (Free e-book from the PU library)
Advanced Data Analysis from an Elementary Point of View by Cosma Shalizi (free pdf online from the author)
Pattern Recognition and Machine Learning by Christopher M. Bishop is a very detailed and thorough book on the foundations of machine learning. A good textbook to buy to have as a reference (free pdf from the author)
The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman is also an excellent reference book, available on the web for free at the link.
An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani is a more accessible version of The Elements of Statistical Learning.
Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville is an advanced textbook with good coverage of deep learning and a brief introduction to machine learning.
Learning Deep Architectures for AI by Yoshua Bengio is in some ways better than Goodfellow et al., in my opinion.
Introduction to Deep Learning by Eugene Charniak is a from-the-ground-up introductory textbook. (If you'd like to buy the print book from The MIT Press directly, they provided a discount code you can use to get 30% off the price: MTSR20). Uses TensorFlow rather than PyTorch. Not as authoritative or comprehensive or free as Goodfellow et al., but an easier read.

Python Scientific Lecture Notes by Valentin Haenel, Emmanuelle Gouillart, and Gaël Varoquaux (eds) contains material on NumPy and working with image data in SciPy. (Free on the web.)
Online courses
Geoffrey Hinton's Coursera course contains great explanations for the intution behind neural networks.
The CS229 Lecture Notes by Andrew Ng are a concise introduction to machine learning.
Andrew Ng's Coursera course contains excellent explanations of basic topics (note: registration is free).
Pedro Domingos's CSE446 at UW (slides available here) is a somewhat more theorically-flavoured machine learning course. Highly recommended.
CS231n: Convolutional Neural Networks for Visual Recognition at Stanford (archived 2015 version) is an amazing advanced course, taught by Fei-Fei Li and Andrej Karpathy. The course website contains a wealth of materials.
CS224d: Deep Learning for Natural Language Processing at Stanford, taught by Richard Socher. CS231, but for NLP rather than vision. More details on RNNs are given here.
Python exercises
Online beginner exercises in Python are available at CodingBat
Papers
A list of papers that use mostly observational data is here

An inclusive environment

We strive to build and maintain an inclusive environment in class — an environment that allows every student to reach their full potential. Please do not hesitate to contact me and/or your preceptor to let us know if you need special accommodation or with any concerns.

Design credit: CS229, Jan 2019.