Skip to content

A complete pipeline to pull data from Scryfall's "Magic: The Gathering"-API, via Prefect orchestration and dbt transformation.

Notifications You must be signed in to change notification settings

VincenzoGalante/magic-the-gathering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TL;DR: This project is analyzing MTG card data. Follow the steps mentioned under How to make it work? to set it up.

What is this about?

If you have never seen the logo above, here a very quick intro: Magic: The Gathering (Magic or MTG) is the first collectible card game of its kind, created in 1993. It can be played by two or more players. The exact rules differ a bit by the format played but in general: players battle against each other with their card decks, trying to defeat the other players by casting spells, artifacts and creatures. Curious? Learn how to play...

With this project I aim to built an end-to-end orchestrated data pipeline. I am calling the Scryfall API to get a complete export of the latest card information available, and save the data to my Google Cloud Storage. Afterwards, I filter the data on the needed columns, enforce a data type and push it to BigQuery. In there, I use DBT to derive new columns and make sure that only the most up-to-date dataset is used for visualization in the dashboard.

What questions am I trying to answer?

As this project looks at MTG from a meta-perspective we are diving into the following:

  1. How many cards are being released over time?
  2. What is the color distribution?
  3. Which set has the highest count of cards?
  4. Who are the most common artists?
  5. What are the most expensive cards?

What technologies are being used?

  • Cloud: Google Cloud
  • Infrastructure: Terraform
  • Orchestration: Prefect
  • Data lake: Google Cloud Storage
  • Data transformation: DBT
  • Data warehouse: BigQuery
  • Data visualization: Google Looker Studio

Dashboard example

Click here to see my Looker dashboard.

What is the structure of the production table?

Column Description
primary_key Unique surrogate key from card_id and released_at data points
card_id Card ID in database, IDs can be repeated due to reprintings
name The name of this card
released_at The date this card was first released
color_identity This card’s color identity
color_category Based on the color_identity: Black, Blue, White, Green, Red, Colorless or Mixed
set_name This card’s full set name
artist The name of the illustrator of this card face
price Price information of this card in US Dollar
data_update Timestamp when the data was updated in the database
  • Here the dbt lineage graph
  • Partitioned on the released_at column - in favor of question 1 and 3 - assuming that in most cases, cards with the same release date are from the same set
  • Clustered on the color_category column - in favor of question 2 - assuming that within one set the number of colors is lower than the numbers of unique prices and artists

How to make it work?

  1. Setup your Google Cloud environment
export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
gcloud auth application-default login
  1. Install all required dependencies into your environment
pip install -r requirements.txt
  1. Setup your infrastructure
  • Assuming you are using Linux AMD64 run the following commands to install Terraform - if you are using a different OS please choose the correct version here and exchange the download link and zip file name
sudo apt-get install unzip
cd ~/bin
wget https://releases.hashicorp.com/terraform/1.4.1/terraform_1.4.1_linux_amd64.zip
unzip terraform_1.4.1_linux_amd64.zip
rm terraform_1.4.1_linux_amd64.zip
  • To initiate, plan and apply the infrastructure, adjust and run the following Terraform commands
cd terraform/
terraform init
terraform plan -var="project=<your-gcp-project-id>"
terraform apply -var="project=<your-gcp-project-id>"
  1. Setup your orchestration
  • If you do not have a prefect workspace, sign-up for the prefect cloud and create a workspace here
  • Create the prefect blocks via the cloud UI or adjust the variables in /prefect/prefect_blocks.py and run
python magic-the-gathering/prefect/prefect_blocks.py
  • Adjust the keyfile location at dbt/profiles.yml to the path of your Google Cloud credentials JSON
  • To execute the flow, run the following commands in two different CL terminals
prefect agent start -q 'default'
python prefect/api_to_gcs_to_bq.py
  1. Data deep dive
  • The data will be available in BigQuery at mtg_card_data_dbt.dbt_mtg_latest_data
  • Query the data in-place or build a dashboard

Potential next steps

With a growing database, I would be able to further explore the following:

  • What is the color distribution over time?
  • What is the price development of specific cards / colors / sets over time?