Course Description

Course Description¶

CS328 aims to introduce students to the various statistical and algorithmic ideas that underlie the field of data science. At the end of the course, students will be familiar with algorithms to extract patterns from medium to large data sets. They should also be comfortable in the various model statistical model fitting techniques as well as understand how to argue about statistical significance. Students will also get exposed to practical tools (e.g. numpy/scipy/matplotlib/sklearn if using python, or analogous tools in other languages). This course will be valuable to anyone interested in continuing studies/working in data analytics or machine learning.

Instructor: Anirban Dasgupta, Office: AB 6/407c. Please email for appointment.

Teaching Assistant: Shrutimoy Das, Sachin Yadav

Lecture Schedule¶

There are roughly 40 lecture hours in the calendar. The following is a tentative order in which the topics will be covered.

Foundations

  • Data representation, distance measures

  • Central limit theorem

  • Random variables and tail inequalities, hashing, balls and bins.

  • Practical example of hashing– MinHash

Clustering and low-rank approximations

  • k-means, k-center, Lloyds algorithm, k-means++

  • Clustering in graphs – expansion, conductance, modularity.

  • Spectral algorithms for expansion and conductance.

  • Louvain algorithm for modularity.

  • Learning mixture models – Gaussians.

  • SVD and its applications, other matrix factorizations.

Dealing with massive data

  • Efficient data summaries – Bloom filters, bit arrays.

  • Streaming model: samples and sketches – reservoir sampling, counting distinct elements, heavy hitter data structure(Misra-Gries, Count-Min, Count-Sketch)

Random Walks

  • Random walks and convergence, connection with eigenvalues

  • PageRank, HITS and their interpretations

  • Gibbs sampling and Markov chains

Drawing inference from the data

  • Sampling, estimation, confidence intervals, bootstrapping

  • Hypothesis testing and its variants– multiple hypothesis testing, Bayes Factor

  • Linear regression and its generalizations, model evaluation, goodness of fit tests