Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare for migration of report queries #30

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from
Draft

Conversation

max-ostapenko
Copy link
Contributor

@max-ostapenko max-ostapenko commented Nov 16, 2024

We want to replace legacy reports script https://github.com/HTTPArchive/bigquery/tree/master/sql
This is the implementation that follows previous discussion.

TODO list:

  • trigger a tag that kicks out reports preparation on crawl_complete
  • run aggregated table updates with fresh reports data (dataset reports)
  • create reports configuration file with one timeseries and one histogram
  • trigger GCS upload whenever data is updated in BQ

Supports features:

  • Run monthly histograms SQLs when crawl is finished
  • [?] Run longer term time series SQLs when crawl is finished
  • Be able to run the time series in an incremental fashion
  • [?] Handle different lenses (Top X, WordPress, Drupal, Magento)
  • [?] Handle CrUX reports (monthly histograms and time series) having to run later.
  • Be able to upload to cloud storage in GCP to allow it to be hosted on our CDN
  • Be able to run and only run reports missing (histograms) or missing dates (time series)
  • Be able to force rerun (to override any existing reports).
  • Be able to run a subset of reports.

@max-ostapenko max-ostapenko changed the title Preparing data for reports Prepare for migration of report queries Nov 16, 2024
Comment on lines +110 to +111
bytesTotal: {
name: 'Total Kilobytes',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found reports config file - seems a good idea to keep all the configs in one place (more transparent for future contributors).

I copied it over here (to experiment with) and added the queries.
I wouldn't be able to add the queries unless the format supports multiline strings - so just saved in JS.
Actually it is required to be readable with python - YAML?

Comment on lines 10 to 14
publish(sql.type, {
type: 'table',
schema: 'reports',
tags: ['crawl_reports']
}).query(ctx => constants.fillTemplate(sql.query, params))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reports dataset we could store intermediate aggregated data - it's easier to check for data issues in BQ than in GCS.
Cloud Function then will pick fresh rows and save them to GCS.

I think we could be fine with a table per chart type, e.g httparchive.reports.timeseries:

  • date (partition)
  • metric (cluster)
  • timestamp
  • client (cluster)
  • p10
  • p25
  • p50
  • p75
  • p90

Comment on lines 2 to 5
const params = {
date: constants.currentMonth,
rankFilter: constants.devRankFilter
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query parameters.
I found only date.
Please list all the required and add the queries to test them with.

@max-ostapenko
Copy link
Contributor Author

@tunetheweb here is a demo version that needs to be discussed.
Once we see that it covers all the requirements and agree on feasibility of the 3 topics in comments above - I'll finalise the part with uploading to GCS.

And I have no idea what to do with lenses and 2 more requests (see in description)..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant