CS 104: Data Engineering for Emerging Markets

Course OverviewLecture PlanAssessment PlanSelf-Study Resources


Course Code	CS 104
Course Name	Data Engineering for Emerging Markets
Department	Computer Science
Semester Offered	Even (Term 2 - India)
Tuition Hours	30 hours
Course Level	Foundational to Intermediate
Pre-requisite	CS 101: Introduction to Computational Thinking
Co-requisite	-
Course Objective	Most real-world data is messy, incomplete, multilingual, and unreliable. Especially in emerging markets, systems must work despite missing data, inconsistent formats, and unstable infrastructure. This course teaches students how to collect, clean, store, and move data reliably so that downstream systems like machine learning models can actually function. The emphasis is on building practical data pipelines, not theoretical systems. A strong focus is placed on SQL as a thinking tool, not just a query language. Students will learn to reason about data, transform it efficiently, and extract meaningful insights from it. By the end of the course, students will be able to design and deploy data systems that work in imperfect conditions, directly supporting their Term 2 goal of building production-ready AI solutions for real businesses.
Course Philosophy	This course emphasizes Data first, models later SQL as a core skill, not an afterthought Robustness over ideal assumptions Students will spend more time working with bad data than clean datasets, because that is what reality looks like. The goal is to make them comfortable operating in uncertainty and building systems that still work.
Course Learning Outcomes	Upon successful completion of this course, students will be able to: Write complex SQL queries to extract, join, and transform data efficiently. Design relational database schemas suited for real-world applications. Clean and preprocess messy datasets, including missing and inconsistent values. Build ETL pipelines to move data from raw sources to usable formats. Work with APIs as data sources, integrating external systems into pipelines. Handle multilingual and sparse data, common in emerging markets. Design systems that tolerate unreliable infrastructure, including intermittent connectivity. Support machine learning workflows with reliable and well-structured data pipelines.
Course Author	Sagar Udasi MSc Statistics and Data Science with Computational Finance from The University of Edinburgh. Contact: sagar.l.udasi@gmail.com
Course Organiser	TBD Details will be updated before course commencement.

No.	Lecture Title	Concepts Covered	Lecture Objective
01	Why Your Model Is Only As Good As Your Data	Role of data engineering in ML systems	Shift focus from models to data pipelines.
02	What Does Real Data Look Like?	Messy data, missing values, inconsistencies	Prepare students for real-world datasets.
03	Databases: Where Data Actually Lives	Relational databases, tables, schemas	Introduce structured data storage.
04	SQL Is Not Just A Query Language	SQL fundamentals, SELECT, filtering	Build foundation for data reasoning.
05	Joining Worlds Together	Joins, relationships, normalization	Teach how different datasets connect.
06	Thinking In SQL	Aggregations, grouping, nested queries	Develop SQL as a problem-solving tool.
07	When Queries Get Serious	Window functions, advanced SQL	Enable complex data transformations.
08	Designing Databases That Don’t Break	Schema design, constraints	Build reliable and scalable data systems.
09	Cleaning Data Is The Real Work	Data cleaning techniques, preprocessing	Prepare data for real use cases.
10	Automating Data Movement	ETL pipelines, workflows	Build repeatable data systems.
11	APIs As Data Sources	Fetching and storing API data	Integrate external systems into pipelines.
12	Handling Multilingual Data	Encoding, text normalization	Work with diverse real-world datasets.
13	Sparse and Incomplete Data	Imputation, handling missing data	Deal with imperfect datasets effectively.
14	When The Internet Is Not Reliable	Offline-first systems, retries	Build systems for unstable environments.
15	Storing Data Efficiently	Indexing, query optimization	Improve performance of data systems.
16	Debugging Data Pipelines	Logging, monitoring, failure handling	Ensure reliability in production systems.
17	From Data To Dashboard	Basic analytics, reporting	Turn data into actionable insights.
18	Case Study: Data Pipeline For A Real Business	End-to-end pipeline design	Directly connect to Term 2 enterprise projects.
19	Integrating With ML Systems	Data pipelines for ML workflows	Prepare data for modeling and deployment.
20	Demo Day: Does Your Data Actually Work?	Presentations, system validation	Students showcase working pipelines under constraints.

Component	Weightage
SQL Assignments (5 total)	35%
Data Cleaning Project	20%
Final Project: End-to-End Data Pipeline	30%
Viva + Query Design Evaluation	15%

Type	Resource	Provider
Lecture	Databases Course	Stanford (Jennifer Widom)
Lecture	SQL for Data Science	UC Davis (Coursera)
Reading	Designing Data-Intensive Applications	Martin Kleppmann
Reading	SQL Cookbook	Anthony Molinaro
Practice	SQLBolt	sqlbolt.com
Practice	Mode Analytics SQL Tutorial	mode.com