Data missing due to exceeding quota #36

dimbleby · 2024-08-17T12:19:35Z

looking at the history, it seems as though the intention is to publish updated data on the first of each month.

Last update was 1st July.

edgarrmondragon · 2024-08-17T13:13:04Z

hugovk · 2024-08-18T20:28:10Z

Strange, still hitting the limit two weeks later:

google.api_core.exceptions.Forbidden: 403 Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas

That was for 7,000 projects.

hugovk · 2024-08-30T10:32:40Z

Google BigQuery gives 1 TB free quota per month.

Here's some data on past runs. The gaps indicate when switching to fetch data for more projects.

The 5k growth seemed fairly linear, so when switching to 8k I made a rough calculation that the 0.5 TB available would last about 40 months. Instead, since the turn of 2024 the growth has increased rapidly and is now over 1 TB for 8k.

hugovk · 2024-08-30T11:26:46Z

I've started the 90-day, $300 (€277.46) free trial to get things back on track and fetched the July data:

pypinfo -v --json --indent 0 --limit 8000 -sd 2024-07-02 -ed 2024-07-31 "" project > top-pypi-packages-30-days.json
jq -c . < top-pypi-packages-30-days.json > top-pypi-packages-30-days.min.json
jq -r '.rows[] | [.download_count, .project] | @csv' top-pypi-packages-30-days.json >> top-pypi-packages-30-days.csv

./deploy.sh

https://github.com/hugovk/top-pypi-packages/releases/tag/2024.08

This indeed is over 1 TB, it's 1.023 TB:

At some point whilst on the free trial, I'll do some experiments to pick a lower number of projects that's still under 1 TB, but it's concerning it grew to use ~0.5 TB in ~0.5 year.

Let's keep this open until that is done, and to check the September update goes out as planned on Sunday.

hugovk · 2024-10-13T20:02:14Z

Here's a couple of pypinfo commands, limiting to just the top 10 and 1 day, otherwise identical except for the --all flag.

--all means "Show downloads by all installers, not only pip".

By default, it only fetches downloads from pip, and that's what has been used in this repo (since day 1? Or perhaps since that flag was introduced? Would have to check. At least, the default has been been used here for a long time).

Only pip

❯ pypinfo --limit 10 --days 1 "" project
Served from cache: False
Data processed: 58.21 GiB
Data billed: 58.21 GiB
Estimated cost: $0.29

project	download_count
boto3	37,251,744
aiobotocore	16,252,824
urllib3	16,243,278
botocore	15,687,125
requests	13,271,314
s3fs	12,865,055
s3transfer	12,014,278
fsspec	11,982,305
charset-normalizer	11,684,740
certifi	11,639,584
Total	158,892,247

All installers

❯ pypinfo --all --limit 10 --days 1 "" project
Served from cache: False
Data processed: 46.63 GiB
Data billed: 46.63 GiB
Estimated cost: $0.23

project	download_count
boto3	39,495,624
botocore	17,281,187
urllib3	17,225,121
aiobotocore	16,430,826
requests	14,287,965
s3fs	12,958,516
charset-normalizer	12,781,405
certifi	12,647,098
setuptools	12,608,120
idna	12,510,335
Total	168,226,197

So we can see the default pip-only costs an extra 25% data processed and data billed, and costs an extra 25% in dollars.

Unsurprisingly, the actual download counts are higher for all installers. The ranking has changed a bit, but I expect we're still getting more-or-less the same packages in the top thousands of results.

Queries

Here's the same commands, but run with --test: "--test simply prints the query".

❯ pypinfo --test --limit 10 --days 1 "" project

SELECT
  file.project as project,
  COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
  AND details.installer.name = "pip"
GROUP BY
  project
ORDER BY
  download_count DESC
LIMIT 10

❯ pypinfo --test --all --limit 10 --days 1 "" project

SELECT
  file.project as project,
  COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
  project
ORDER BY
  download_count DESC
LIMIT 10

These queries are the same, except the default has an extra AND details.installer.name = "pip" condition.

It sounds reasonable it would cost more to do extra filtering work.

Installers

Let's look at the installers:

❯ pypinfo --all --limit 100 --days 1 "" installer
Served from cache: False
Data processed: 29.49 GiB
Data billed: 29.49 GiB
Estimated cost: $0.15

installer_name	download_count
pip	1,121,198,711
uv	117,194,833
requests	29,828,272
poetry	23,009,454
None	8,916,745
bandersnatch	6,171,555
setuptools	1,362,797
Bazel	1,280,271
Browser	1,096,328
Nexus	593,230
Homebrew	510,247
Artifactory	69,063
pdm	62,904
OS	13,108
devpi	9,530
conda	2,272
pex	194
Total	1,311,319,514

pip still by far the most popular, and unsurprising uv is up there too, with about 10% of pip's downloads.

The others are 10% of uv or less. A lot of them are mirroring services, that we wanted to exclude before.

I think given uv's importance, and my expectation that it will continue to take a bigger share of the pie, plus especially the extra cost for filtering by just pip, means that we should switch to fetching data for all downloaders. Plus the others don't account for that much of the pie.

Next time, I'll also do some comparison between different totals: should we keep 8,000, go back to 5,000 or some other number?

hugovk · 2024-11-10T14:57:49Z

Next, let's see how number of projects affects the cost. (This data fetched yesterday, 2024-11-09.)

Days = 1, limit = 1000...8000

Let's begin checking just one day for different limits: 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000. We'll use --all, so not filtering for just pip.

Here's the basic query:

❯ pypinfo --test --all -v --json --indent 0 --days 1 --limit 8000 "" project
Credentials location set to "/Users/hugo/Dropbox/bin/data/pypinfo-ea8f23d28f2b.json".
SELECT
  file.project as project,
  COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
  project
ORDER BY
  download_count DESC
LIMIT 8000

Let's use a helper script like this:

import os

days = 1

for limit in (1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000):
    outfile = f"days-{days}-limit-{limit}-all.json"
    if os.path.exists(outfile):
        print(f"{outfile} exists, skipping")
        continue
    cmd = (
        f"pypinfo --all --json --indent 0 --days {days} --limit {limit} '' "
        f"project > {outfile}"
    )
    print(cmd)
    os.system(cmd)

Result: interestingly, the cost is the same for all limits (1000-8000): 0.31.

Days = 1, limit = 1000...8000, pip

Let's repeat but removing --all, so it's filters for only pip:

Result: cost again the same for all limits, and cost increased: 0.39.

Days = 30, limit = 8000...1000

Next, back to --all, repeat for 30 days, and start with 8000 first in case we were only paying for incremental changes:

days = 30

for limit in (8000, 7000, 6000, 5000, 4000, 3000, 2000, 1000):
    ...

Result: Again, the cost is the same regardless of limit: 4.89

Days = 1, 30, limit = 1

Let's also check for limit = 1

Result: Same cost as above, 0.31 for 1 day, and 4.89 for 30 days.

Days = 1...30, limit = 1

Now we know the cost is the same regardless of the limit (at least up to 8000), let's see how the cost varies over the number of days.

import os

import os

# for limit in (1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000):
for days in range(1, 31):
    for limit in (1,):
        outfile = f"days-{days}-limit-{limit}-all.json"
        if os.path.exists(outfile):
            print(f"{outfile} exists, skipping")
            continue

        cmd = (
            f"pypinfo --all --json --indent 0 --days {days} --limit {limit} '' "
            f"project > {outfile}"
        )
        print(cmd)
        os.system(cmd)

Fetch results from JSONs into CSV:

import json

print("days\testimated_cost\tbytes_billed\tbytes_processed")

for days in range(1, 31):
    for limit in (1,):
        outfile = f"days-{days}-limit-{limit}-all.json"

        with open(outfile) as f:
            data = json.load(f)

        out = [
            days,
            data["query"]["estimated_cost"],
            f"{data["query"]["bytes_billed"]:,}",
            f"{data["query"]["bytes_processed"]:,}",
        ]
        print("\t".join(map(str, out)))

Gives:

days	estimated_cost	bytes_billed	bytes_processed
1	0.31	67,423,436,800	67,423,031,967
2	0.49	106,777,542,656	106,776,936,726
3	0.66	144,589,193,216	144,588,849,934
4	0.84	183,362,387,968	183,362,056,051
5	0.94	205,844,905,984	205,844,089,492
6	1.05	230,423,527,424	230,423,239,221
7	1.20	263,483,031,552	263,482,009,301
8	1.37	301,120,618,496	301,119,942,981
9	1.56	340,849,065,984	340,848,563,030
10	1.73	379,594,997,760	379,594,187,471
11	1.91	419,779,575,808	419,778,888,347
12	2.02	442,795,819,008	442,795,580,469
13	2.11	463,927,771,136	463,927,460,363
14	2.28	500,829,257,728	500,828,317,940
15	2.47	541,425,926,144	541,424,988,775
16	2.64	580,075,388,928	580,075,257,785
17	2.83	621,216,268,288	621,215,221,560
18	3.01	660,446,642,176	660,445,832,190
19	3.13	686,769,045,504	686,768,435,724
20	3.24	710,389,268,480	710,388,303,159
21	3.41	747,798,265,856	747,798,009,119
22	3.59	789,304,049,664	789,303,120,956
23	3.78	830,582,292,480	830,581,546,594
24	3.97	871,580,565,504	871,579,537,799
25	4.14	908,331,057,152	908,330,312,456
26	4.25	933,782,093,824	933,781,258,405
27	4.35	956,512,075,776	956,511,698,396
28	4.52	992,670,121,984	992,669,344,697
29	4.70	1,032,543,272,960	1,032,542,495,766
30	4.89	1,073,809,981,440	1,073,809,418,894

Result: it's the number of days that affects cost.

Google: "The first 1 TiB of query data processed per month is free." 1 TiB = 1,099,511,627,776 bytes, and all of the values in the above table are below this limit. Note, this is with --all.

The last run in this repo, without --all, was billed at 1,376,442,646,528 bytes, over the limit:

top-pypi-packages/top-pypi-packages-30-days.json

Lines 2 to 7 in 81c3f01

    
           "last_update": "2024-11-01 12:27:36", 
        
           "query": { 
        
           "bytes_billed": 1376442646528, 
        
           "bytes_processed": 1376442377548, 
        
           "cached": false, 
        
           "estimated_cost": "6.26"

So we can remove --all and we'll be okay... for now. But as we can see from #36 (comment), as time goes on, the line goes up.

We're at 1,073,809,981,440 and the limit is not far off at 1,099,511,627,776.

In the future, to stay under 1 TiB we'd need to reduce the number of days by some unclear amount.

Well, let's add --all for now: #39.

Here's the data collected from these tests: data.zip

hugovk · 2024-11-10T19:24:50Z

To repeat with the limit increasing by powers of ten, up to 1,000,000, getting data for all 531,022 on PyPI.

limit	projects count	estimated_cost	bytes_billed	bytes_processed
1	1	0.20	43,447,746,560	43,447,720,943
10	10	0.20	43,447,746,560	43,447,720,943
100	100	0.20	43,447,746,560	43,447,720,943
1000	1,000	0.20	43,447,746,560	43,447,720,943
8000	8,000	0.20	43,447,746,560	43,447,720,943
10000	10,000	0.20	43,447,746,560	43,447,720,943
100000	100,000	0.20	43,447,746,560	43,447,720,943
1000000	531,022	0.20	43,447,746,560	43,447,720,943

Result: Again, same cost for 1 project or 531,022 projects.

Next thought, perhaps the query can be improved:

SELECT
  file.project as project,
  COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
  project
ORDER BY
  download_count DESC
LIMIT 1000

Data fetched today, 2024-11-10: data2.zip

hugovk · 2024-11-10T20:18:10Z

Omitting the LIMIT is the same as fetching all, effectively the same as LIMIT 1000000, and the same cost (0.20):

SELECT
  file.project as project,
  COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
  project
ORDER BY
  download_count DESC

hugovk · 2024-11-24T20:48:16Z

I've switched to fetching data for all installers: #39

And written this up at https://dev.to/hugovk/a-surprising-thing-about-pypis-bigquery-data-2g9o

hugovk changed the title ~~no August release~~ Data missing due to exceeding quota Nov 10, 2024

hugovk mentioned this issue Nov 10, 2024

Fetch data for all installers, not only pip #39

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data missing due to exceeding quota #36

Data missing due to exceeding quota #36

dimbleby commented Aug 17, 2024

edgarrmondragon commented Aug 17, 2024

hugovk commented Aug 18, 2024

hugovk commented Aug 30, 2024

hugovk commented Aug 30, 2024 •

edited

Loading

hugovk commented Oct 13, 2024

hugovk commented Nov 10, 2024 •

edited

Loading

hugovk commented Nov 10, 2024 •

edited

Loading

hugovk commented Nov 10, 2024

hugovk commented Nov 24, 2024

Data missing due to exceeding quota #36

Data missing due to exceeding quota #36

Comments

dimbleby commented Aug 17, 2024

edgarrmondragon commented Aug 17, 2024

hugovk commented Aug 18, 2024

hugovk commented Aug 30, 2024

hugovk commented Aug 30, 2024 • edited Loading

hugovk commented Oct 13, 2024

Only pip

All installers

Queries

Installers

hugovk commented Nov 10, 2024 • edited Loading

Days = 1, limit = 1000...8000

Days = 1, limit = 1000...8000, pip

Days = 30, limit = 8000...1000

Days = 1, 30, limit = 1

Days = 1...30, limit = 1

hugovk commented Nov 10, 2024 • edited Loading

hugovk commented Nov 10, 2024

hugovk commented Nov 24, 2024

hugovk commented Aug 30, 2024 •

edited

Loading

hugovk commented Nov 10, 2024 •

edited

Loading

hugovk commented Nov 10, 2024 •

edited

Loading