Data Engineering

Overview

Week1: Pre-Requisites (Docker, Terraform, SQL)

Docker + PostgreSQL

Using Docker to run a PostgreSQL database locally.

Terraform + Google Cloud Platform

Using Terraform to operate Google Cloud Platform (GCP) resources.

Week2: Data ETL(Extract, Transform, Load) with Mage

Mage Workflow Orchestration Tool

Extract Data From Various Sources

Extract the last quarter of 2020's NYC Green Taxi Data from DataTalksClub

Transform Data For Consistency

Remove rows where the passenger count is equal to 0 or the trip distance is equal to zero.
Create a new column lpep_pickup_date by converting lpep_pickup_datetime to a date.
Rename columns in Camel Case to Snake Case, e.g. VendorID to vendor_id
Assertion:
1. vendor_id is one of the existing values in the column (currently)
2. passenger_count is greater than 0
3. trip_distance is greater than 0

Load Data Into the Warehouse

Export Cleaned Data to PostgreSQL in our Docker Container
Export Cleaned Data to Google BigQuery
Export Cleaned Data to Google Cloud Storage by partitioning the data by month

Week3: Data Warehouse with BigQuery

Create a Data Warehouse in Google BigQuery
Create External Table from Google Cloud Storage
Create Non-partitioned and Partitioned Tables from External Table
Query Data from Non-partitioned and Partitioned Tables

Workshop 1: Data Ingestion with dlt

dlt Pipeline Building Tool

Create a generator to extract data from a data source
Create a pipeline to ingest data into an in-memory database (e.g. DuckDB) or a data warehouse (e.g. BigQuery)
Replace or merge data in the in-memory database or data warehouse
Query the data in the in-memory database or data warehouse

Week4: Analytics Engineering with Dbt Cloud and BigQuery

Create a Dbt project with BigQuery as a source and target.
Build up staging models from BigQuery tables.
Compose a fact table from staging models and load it into BigQuery.
Create a dashboard in Data Studio (Looker) with the fact table.

Week5: Batch Processing with Spark on GCP VM Instance

Setup a GCP VM Instance
Install Anaconda, Java and Spark, setup PySpark
Utilize Jupyter Notebook to execute batch processing tasks using PySpark.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Week1		Week1
Week2		Week2
Week3		Week3
Week4		Week4
Week5		Week5
Week6		Week6
Workshop1		Workshop1
Workshop2		Workshop2
images		images
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering

Overview

Week1: Pre-Requisites (Docker, Terraform, SQL)

Docker + PostgreSQL

Terraform + Google Cloud Platform

Week2: Data ETL(Extract, Transform, Load) with Mage

Mage Workflow Orchestration Tool

Extract Data From Various Sources

Transform Data For Consistency

Load Data Into the Warehouse

Week3: Data Warehouse with BigQuery

Workshop 1: Data Ingestion with dlt

dlt Pipeline Building Tool

Week4: Analytics Engineering with Dbt Cloud and BigQuery

Week5: Batch Processing with Spark on GCP VM Instance

About

Releases

Packages

Languages

max870701/Data-Engineering

Folders and files

Latest commit

History

Repository files navigation

Data Engineering

Overview

Docker + PostgreSQL

Terraform + Google Cloud Platform

Extract Data From Various Sources

Transform Data For Consistency

Load Data Into the Warehouse

About

Resources

Stars

Watchers

Forks

Languages