CS 104: Data Engineering for Emerging Markets
| Course Code | CS 104 |
| Course Name | Data Engineering for Emerging Markets |
| Department | Computer Science |
| Semester Offered | Even (Term 2 - India) |
| Tuition Hours | 30 hours |
| Course Level | Foundational to Intermediate |
| Pre-requisite | CS 101: Introduction to Computational Thinking |
| Co-requisite | - |
| Course Objective | Most real-world data is messy, incomplete, multilingual, and unreliable. Especially in emerging markets, systems must work despite missing data, inconsistent formats, and unstable infrastructure. This course teaches students how to collect, clean, store, and move data reliably so that downstream systems like machine learning models can actually function. The emphasis is on building practical data pipelines, not theoretical systems. A strong focus is placed on SQL as a thinking tool, not just a query language. Students will learn to reason about data, transform it efficiently, and extract meaningful insights from it. By the end of the course, students will be able to design and deploy data systems that work in imperfect conditions, directly supporting their Term 2 goal of building production-ready AI solutions for real businesses. |
| Course Philosophy | This course emphasizes
|
| Course Learning Outcomes | Upon successful completion of this course, students will be able to:
|
| Course Author | Sagar Udasi MSc Statistics and Data Science with Computational Finance from The University of Edinburgh. Contact: sagar.l.udasi@gmail.com |
| Course Organiser | TBD Details will be updated before course commencement. |
| No. | Lecture Title | Concepts Covered | Lecture Objective |
|---|---|---|---|
| 01 | Why Your Model Is Only As Good As Your Data | Role of data engineering in ML systems | Shift focus from models to data pipelines. |
| 02 | What Does Real Data Look Like? | Messy data, missing values, inconsistencies | Prepare students for real-world datasets. |
| 03 | Databases: Where Data Actually Lives | Relational databases, tables, schemas | Introduce structured data storage. |
| 04 | SQL Is Not Just A Query Language | SQL fundamentals, SELECT, filtering | Build foundation for data reasoning. |
| 05 | Joining Worlds Together | Joins, relationships, normalization | Teach how different datasets connect. |
| 06 | Thinking In SQL | Aggregations, grouping, nested queries | Develop SQL as a problem-solving tool. |
| 07 | When Queries Get Serious | Window functions, advanced SQL | Enable complex data transformations. |
| 08 | Designing Databases That Don’t Break | Schema design, constraints | Build reliable and scalable data systems. |
| 09 | Cleaning Data Is The Real Work | Data cleaning techniques, preprocessing | Prepare data for real use cases. |
| 10 | Automating Data Movement | ETL pipelines, workflows | Build repeatable data systems. |
| 11 | APIs As Data Sources | Fetching and storing API data | Integrate external systems into pipelines. |
| 12 | Handling Multilingual Data | Encoding, text normalization | Work with diverse real-world datasets. |
| 13 | Sparse and Incomplete Data | Imputation, handling missing data | Deal with imperfect datasets effectively. |
| 14 | When The Internet Is Not Reliable | Offline-first systems, retries | Build systems for unstable environments. |
| 15 | Storing Data Efficiently | Indexing, query optimization | Improve performance of data systems. |
| 16 | Debugging Data Pipelines | Logging, monitoring, failure handling | Ensure reliability in production systems. |
| 17 | From Data To Dashboard | Basic analytics, reporting | Turn data into actionable insights. |
| 18 | Case Study: Data Pipeline For A Real Business | End-to-end pipeline design | Directly connect to Term 2 enterprise projects. |
| 19 | Integrating With ML Systems | Data pipelines for ML workflows | Prepare data for modeling and deployment. |
| 20 | Demo Day: Does Your Data Actually Work? | Presentations, system validation | Students showcase working pipelines under constraints. |
| Component | Weightage |
|---|---|
| SQL Assignments (5 total) | 35% |
| Data Cleaning Project | 20% |
| Final Project: End-to-End Data Pipeline | 30% |
| Viva + Query Design Evaluation | 15% |
| Type | Resource | Provider |
|---|---|---|
| Lecture | Databases Course | Stanford (Jennifer Widom) |
| Lecture | SQL for Data Science | UC Davis (Coursera) |
| Reading | Designing Data-Intensive Applications | Martin Kleppmann |
| Reading | SQL Cookbook | Anthony Molinaro |
| Practice | SQLBolt | sqlbolt.com |
| Practice | Mode Analytics SQL Tutorial | mode.com |