Imagine you work at a singular News Paper whos is wiling to use Technology to take advantage on the competition.
Your News Boss want you to use tech and your recently gained skills in data engineering to accomplish this goal. He/She wants you to first get some inside information on other News Paper Companies, Editorials and all you can find. The whole idea after this is to create an AI model to develop some solution to write news, opinions and even gossip articles.
- First, you will need to dig into the whole news domain aka your competition.
- Second, you need a lot of data from multiple year to have a good general idea on what is relevant for this subject.
- Third, your News Boss wants you to have a dashboard with all this information with at least 5 years information.
The image next explaint the technology used and the project diagram
In order to have the dataset available in a more granular manner. After downloading the dataset from Hugging Face and creating granular files. Then I was able to continue with the process. Dataset From Component One
For reference you can check the Granularization of the dataset inside the Notebook
folder
The dataset was uploaded first in the following repo
This dataset will be used as an API endpoint to use Mage
and start the ingestion processing into the Google Cloud Platform ecosystem.
Data ingestion was made with Mage
, creating two pipelines as explained in this repository. And that is actually linked here as a submodule.
The resulting bucket in cloud should look something like the next picture
As in the picture, there are two tables all_news_data_authors
and all_news_data_articles
. This comes from two separate pipelines in the Mage
instance that handle different processes that create two different tables to be sent into GCS
From Google Cloud Storage to Google BigQuery, just a couple of queries were run in the BQ interface to create all the necessary tables to take advantage of dbt
. Use queries.sql
file to create all the tables.
Creating the tables with partitions
and clustering
to enhance the flow of the pipeline.
After creating the tables BigQuery should look similar to the next picture
The transformation step was made with dbt
where tables from BigQuery were used and joined with dbt
to create a fact table.
The dbt
models include two sources because of the quantity of data in the initial dataset. It was impossible to process all the articles and articles titles to be stored in a single run and hold everything in BigQuery, that is way I split the data in different tables one with author information and the other with aggregated information of the articles.
Finally, to show all your work for your News Boss, you set up a looker with the information needed. Some of it include:
- Table of the author who writes the most.
- How much word does publication written over the years.
- The amount of percentage that each publication covers in the date ranges.
- The time-series of the quantity of articles written by each publication.
The Dashboard presented to you News Boss
By the end of the project you present the Dashboard to your News Boss, he/she seems very pleased to have the dashboard available. He/She thinks that you could develop a ML model after analyzing the data. (Now you think you could learn MLOps from #DataTalksClub or even the Machine Learning Course)
- You used the same stack as in the course with another dataset
- You were able to learn IaC with a practical example and probably many more questions and interest arise
- You were able to see the data of over 2 million data rows, and figure that there is a lot of space for cleaning and enhancing the whole data pipeline
- You can use the data pipeline for other pipelines and interest around the dataset
- You have a more granular presentation of the dataset shared in All component dataset
- Clone the repository:
git clone --recurse-submodules https://github.com/Cerpint4xt/data-engineering-all-news-project
- Create the project in GCP : ex. all_news_project
- Install docker, python, terraform (if want to deploy the terraform template for
Mage
) - Create your service accounts for the docker or terraform deployment (as in youtube videos)
- Follow instruction in submodule repository to run the pipeline in
Mage
- Setup the
dbt cloud
with the repository and run the commands (as in youtube videos) - Setup looker to start playing with the data.