Skip to content

max870701/Data-Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering

Overview

Docker + PostgreSQL

Using Docker to run a PostgreSQL database locally.

Terraform + Google Cloud Platform

Using Terraform to operate Google Cloud Platform (GCP) resources.

Extract Data From Various Sources

  • Extract the last quarter of 2020's NYC Green Taxi Data from DataTalksClub

Transform Data For Consistency

  • Remove rows where the passenger count is equal to 0 or the trip distance is equal to zero.
  • Create a new column lpep_pickup_date by converting lpep_pickup_datetime to a date.
  • Rename columns in Camel Case to Snake Case, e.g. VendorID to vendor_id
  • Assertion:
    1. vendor_id is one of the existing values in the column (currently)
    2. passenger_count is greater than 0
    3. trip_distance is greater than 0

Load Data Into the Warehouse

  • Export Cleaned Data to PostgreSQL in our Docker Container
  • Export Cleaned Data to Google BigQuery
  • Export Cleaned Data to Google Cloud Storage by partitioning the data by month
  • Create a Data Warehouse in Google BigQuery
  • Create External Table from Google Cloud Storage
  • Create Non-partitioned and Partitioned Tables from External Table
  • Query Data from Non-partitioned and Partitioned Tables
  • Create a generator to extract data from a data source
  • Create a pipeline to ingest data into an in-memory database (e.g. DuckDB) or a data warehouse (e.g. BigQuery)
  • Replace or merge data in the in-memory database or data warehouse
  • Query the data in the in-memory database or data warehouse
  • Create a Dbt project with BigQuery as a source and target.
  • Build up staging models from BigQuery tables.
  • Compose a fact table from staging models and load it into BigQuery.
  • Create a dashboard in Data Studio (Looker) with the fact table.
  • Setup a GCP VM Instance
  • Install Anaconda, Java and Spark, setup PySpark
  • Utilize Jupyter Notebook to execute batch processing tasks using PySpark.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages