Skip to content

CS 104: Data Engineering for Emerging Markets

Course Code CS 104
Course Name Data Engineering for Emerging Markets
Department Computer Science
Semester Offered Even (Term 2 - India)
Tuition Hours 30 hours
Course Level Foundational to Intermediate
Pre-requisite CS 101: Introduction to Computational Thinking
Co-requisite -
Course Objective Most real-world data is messy, incomplete, multilingual, and unreliable. Especially in emerging markets, systems must work despite missing data, inconsistent formats, and unstable infrastructure.

This course teaches students how to collect, clean, store, and move data reliably so that downstream systems like machine learning models can actually function. The emphasis is on building practical data pipelines, not theoretical systems.

A strong focus is placed on SQL as a thinking tool, not just a query language. Students will learn to reason about data, transform it efficiently, and extract meaningful insights from it.

By the end of the course, students will be able to design and deploy data systems that work in imperfect conditions, directly supporting their Term 2 goal of building production-ready AI solutions for real businesses.
Course Philosophy This course emphasizes
  • Data first, models later
  • SQL as a core skill, not an afterthought
  • Robustness over ideal assumptions
Students will spend more time working with bad data than clean datasets, because that is what reality looks like. The goal is to make them comfortable operating in uncertainty and building systems that still work.
Course Learning Outcomes Upon successful completion of this course, students will be able to:
  • Write complex SQL queries to extract, join, and transform data efficiently.
  • Design relational database schemas suited for real-world applications.
  • Clean and preprocess messy datasets, including missing and inconsistent values.
  • Build ETL pipelines to move data from raw sources to usable formats.
  • Work with APIs as data sources, integrating external systems into pipelines.
  • Handle multilingual and sparse data, common in emerging markets.
  • Design systems that tolerate unreliable infrastructure, including intermittent connectivity.
  • Support machine learning workflows with reliable and well-structured data pipelines.
Course Author Sagar Udasi
MSc Statistics and Data Science with Computational Finance from The University of Edinburgh.
Contact: sagar.l.udasi@gmail.com
Course Organiser TBD
Details will be updated before course commencement.
No. Lecture Title Concepts Covered Lecture Objective
01 Why Your Model Is Only As Good As Your Data Role of data engineering in ML systems Shift focus from models to data pipelines.
02 What Does Real Data Look Like? Messy data, missing values, inconsistencies Prepare students for real-world datasets.
03 Databases: Where Data Actually Lives Relational databases, tables, schemas Introduce structured data storage.
04 SQL Is Not Just A Query Language SQL fundamentals, SELECT, filtering Build foundation for data reasoning.
05 Joining Worlds Together Joins, relationships, normalization Teach how different datasets connect.
06 Thinking In SQL Aggregations, grouping, nested queries Develop SQL as a problem-solving tool.
07 When Queries Get Serious Window functions, advanced SQL Enable complex data transformations.
08 Designing Databases That Don’t Break Schema design, constraints Build reliable and scalable data systems.
09 Cleaning Data Is The Real Work Data cleaning techniques, preprocessing Prepare data for real use cases.
10 Automating Data Movement ETL pipelines, workflows Build repeatable data systems.
11 APIs As Data Sources Fetching and storing API data Integrate external systems into pipelines.
12 Handling Multilingual Data Encoding, text normalization Work with diverse real-world datasets.
13 Sparse and Incomplete Data Imputation, handling missing data Deal with imperfect datasets effectively.
14 When The Internet Is Not Reliable Offline-first systems, retries Build systems for unstable environments.
15 Storing Data Efficiently Indexing, query optimization Improve performance of data systems.
16 Debugging Data Pipelines Logging, monitoring, failure handling Ensure reliability in production systems.
17 From Data To Dashboard Basic analytics, reporting Turn data into actionable insights.
18 Case Study: Data Pipeline For A Real Business End-to-end pipeline design Directly connect to Term 2 enterprise projects.
19 Integrating With ML Systems Data pipelines for ML workflows Prepare data for modeling and deployment.
20 Demo Day: Does Your Data Actually Work? Presentations, system validation Students showcase working pipelines under constraints.
Component Weightage
SQL Assignments (5 total) 35%
Data Cleaning Project 20%
Final Project: End-to-End Data Pipeline 30%
Viva + Query Design Evaluation 15%
Type Resource Provider
Lecture Databases Course Stanford (Jennifer Widom)
Lecture SQL for Data Science UC Davis (Coursera)
Reading Designing Data-Intensive Applications Martin Kleppmann
Reading SQL Cookbook Anthony Molinaro
Practice SQLBolt sqlbolt.com
Practice Mode Analytics SQL Tutorial mode.com