Syllabus

This new class on Data Engineering will cover the principles and practices of managing data at scale, with a focus on use cases in data analysis and machine learning. We will cover the entire life cycle of data management and science, ranging from data preparation to exploration, visualization and analysis, to machine learning and collaboration.

The class will balance foundational concerns with exposure to practical languages, tools, and real-world concerns. We will study the foundations of prevalent data models in use today, including relations, tensors, and dataframes, and mappings between them. We will study SQL as a means to query and manipulate data at scale, including performance concerns like views and indexes, query processing and optimization, and transactions, all from a user perspective. We will study the foundations and realities of data preparation, including hands-on work with real-world data using standard Python and SQL frameworks. We will explore data exploration modalities for non-programmers, including the fundamentals behind spreadsheet systems and interactive visual analytics packages. We will look at approaches for managing the machine learning lifecycle of data preparation, model selection and training, model serving and monitoring. Time permitting we will look at technologies for moving, sharing, and caching data including event streaming systems, key-value/document stores, log analytics, and search engines.

For this year alone, if you have already taken CS (W)186 (or are planning to take it concurrently) or INFO 290 (human-in-the-loop data management), you will not be permitted to take this class.

Prerequisites

COMPSCI C100/DATA C100/STAT C100 or COMPSCI 189 or INFO 251 or DATA 144/INFO 254 or equivalent upper-division course in data science. COMPSCI 61A or COMPSCI 88 or INFO 206B or equivalent courses in programming. This class will not assume deep experience with databases or big data solutions.

Enrollment

The class is currently full with a long waitlist. Sadly due to teaching budget cuts we won’t be able to expand the class or take concurrent enrollment students this semester, and we will let the waitlist play out on its own.

Communications

Please make sure you are enrolled on Piazza and Gradescope. Piazza is our primary method of communication and making announcements, and you are responsible for checking it frequently. We plan on using bCourses only for lecture webcasts. Gradescope is where all assignments are submitted.

Grading

64% of your grade will come from projects, and 36% of your grade will come from multivitamins. Each project and multivitamin will be weighted equally. There will be no exams in the course for this semester. Please note that we will be tolerant in grading because of the experimental nature of the course.

Projects

Throughout the semester, we will release 5 programming assignments via Piazza and the website. The 5th project will be extra credit for undergraduate students enrolled in CS 194-35 and required for graduate students enrolled in Info 290T-2. For CS 194-35 students, the first four projects are each worth 16% of your grade, and you can earn up to 8% extra credit from the 5th project. For Info 290T-2, each of the five projects is worth 12.8% of your grade.

Multivitamins

Multivitamins are short written assignments designed to keep you on schedule and check your understanding of the basics from lecture. They will mostly consist of multiple choice questions covering material that is not covered in the projects. If you are struggling with any of the questions on the multivitamin, you are encouraged to come to office hours for help. We will have 5 multivitamins throughout the semester. Each multivitamin will be worth 7.2% of your grade.

Office Hours

Office hours are a great place to go for help with multivitamins, projects, or any other content-related questions. You can find a list of office hours under the Staff tab on this page. The course calendar under the Calendar tab also shows the office hours for the week. We will be using an online office hours queue for all office hours besides professor office hours.

Late Policy

You will get 4 slip days for projects and 4 slip days for multivitamins. Note that these are separate, so you will not receive extra project slip days if you do not use all of your multivitamin slip days. Likewise, using a slip day for a project will not use up one of your multivitamin slip days. Slip days are automatically used in the manner that will optimize your score the most. After using all of your slip time for a particular assignment category, you’ll be docked 25% of your score for the assignment each extra late day on your submission. This applies to both projects and multivitamins. Note that submission times are rounded up to the next day. That is, 2 minutes late = 1 day late.

Collaboration Policy

We do not allow for collaboration on assignments since we expect you to complete all assignments individually; however, you are free (and encouraged!) to discuss concepts from lecture. We will follow the EECS departmental policy on academic honesty, so be sure you are familiar with it. And hey — don’t cheat. Not cool.

Extensions

For administrative and logistics issues, deadline extension requests, alternate exam requests, DSP accommodations, or special accommodations (for emergencies or personal issues), please make a private post on Piazza. If you need an extension, please include your reason for requesting an extension and any relevant documentation if applicable. If you are a DSP student and your accommodation letter allows for extensions on assignments, you will be given 2 extra days per assignment deadline on top of your slip day allocation. If you require any extensions beyond that, please make a private post on Piazza. For issues that you do not feel comfortable with posting on Piazza, feel free to email both Allen and Mantej (emails on the staff page) or the professors. However, we would recommend posting on Piazza if possible to ensure a quicker response.