Using Docker to run a PostgreSQL database locally.
Using Terraform to operate Google Cloud Platform (GCP) resources.
- Extract the last quarter of 2020's NYC Green Taxi Data from DataTalksClub
- Remove rows where the passenger count is equal to 0 or the trip distance is equal to zero.
- Create a new column lpep_pickup_date by converting lpep_pickup_datetime to a date.
- Rename columns in Camel Case to Snake Case, e.g. VendorID to vendor_id
- Assertion:
- vendor_id is one of the existing values in the column (currently)
- passenger_count is greater than 0
- trip_distance is greater than 0
- Export Cleaned Data to PostgreSQL in our Docker Container
- Export Cleaned Data to Google BigQuery
- Export Cleaned Data to Google Cloud Storage by partitioning the data by month
- Create a Data Warehouse in Google BigQuery
- Create External Table from Google Cloud Storage
- Create Non-partitioned and Partitioned Tables from External Table
- Query Data from Non-partitioned and Partitioned Tables
- Create a generator to extract data from a data source
- Create a pipeline to ingest data into an in-memory database (e.g. DuckDB) or a data warehouse (e.g. BigQuery)
- Replace or merge data in the in-memory database or data warehouse
- Query the data in the in-memory database or data warehouse
- Create a Dbt project with BigQuery as a source and target.
- Build up staging models from BigQuery tables.
- Compose a fact table from staging models and load it into BigQuery.
- Create a dashboard in Data Studio (Looker) with the fact table.
- Setup a GCP VM Instance
- Install Anaconda, Java and Spark, setup PySpark
- Utilize Jupyter Notebook to execute batch processing tasks using PySpark.