-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data missing due to exceeding quota #36
Comments
Strange, still hitting the limit two weeks later:
That was for 7,000 projects. |
Google BigQuery gives 1 TB free quota per month. Here's some data on past runs. The gaps indicate when switching to fetch data for more projects. The 5k growth seemed fairly linear, so when switching to 8k I made a rough calculation that the 0.5 TB available would last about 40 months. Instead, since the turn of 2024 the growth has increased rapidly and is now over 1 TB for 8k. |
I've started the 90-day, $300 (€277.46) free trial to get things back on track and fetched the July data: pypinfo -v --json --indent 0 --limit 8000 -sd 2024-07-02 -ed 2024-07-31 "" project > top-pypi-packages-30-days.json
jq -c . < top-pypi-packages-30-days.json > top-pypi-packages-30-days.min.json
jq -r '.rows[] | [.download_count, .project] | @csv' top-pypi-packages-30-days.json >> top-pypi-packages-30-days.csv
./deploy.sh https://github.com/hugovk/top-pypi-packages/releases/tag/2024.08 This indeed is over 1 TB, it's 1.023 TB: At some point whilst on the free trial, I'll do some experiments to pick a lower number of projects that's still under 1 TB, but it's concerning it grew to use ~0.5 TB in ~0.5 year. Let's keep this open until that is done, and to check the September update goes out as planned on Sunday. |
Here's a couple of pypinfo commands, limiting to just the top 10 and 1 day, otherwise identical except for the
By default, it only fetches downloads from pip, and that's what has been used in this repo (since day 1? Or perhaps since that flag was introduced? Would have to check. At least, the default has been been used here for a long time). Only pip
All installers❯ pypinfo --all --limit 10 --days 1 "" project
Served from cache: False
Data processed: 46.63 GiB
Data billed: 46.63 GiB
Estimated cost: $0.23
So we can see the default pip-only costs an extra 25% data processed and data billed, and costs an extra 25% in dollars. Unsurprisingly, the actual download counts are higher for all installers. The ranking has changed a bit, but I expect we're still getting more-or-less the same packages in the top thousands of results. QueriesHere's the same commands, but run with ❯ pypinfo --test --limit 10 --days 1 "" project SELECT
file.project as project,
COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
AND details.installer.name = "pip"
GROUP BY
project
ORDER BY
download_count DESC
LIMIT 10 ❯ pypinfo --test --all --limit 10 --days 1 "" project SELECT
file.project as project,
COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
project
ORDER BY
download_count DESC
LIMIT 10 These queries are the same, except the default has an extra It sounds reasonable it would cost more to do extra filtering work. InstallersLet's look at the installers: ❯ pypinfo --all --limit 100 --days 1 "" installer
Served from cache: False
Data processed: 29.49 GiB
Data billed: 29.49 GiB
Estimated cost: $0.15
pip still by far the most popular, and unsurprising uv is up there too, with about 10% of pip's downloads. The others are 10% of uv or less. A lot of them are mirroring services, that we wanted to exclude before. I think given uv's importance, and my expectation that it will continue to take a bigger share of the pie, plus especially the extra cost for filtering by just pip, means that we should switch to fetching data for all downloaders. Plus the others don't account for that much of the pie. Next time, I'll also do some comparison between different totals: should we keep 8,000, go back to 5,000 or some other number? |
Next, let's see how number of projects affects the cost. (This data fetched yesterday, 2024-11-09.) Days = 1, limit = 1000...8000Let's begin checking just one day for different limits: 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000. We'll use Here's the basic query: ❯ pypinfo --test --all -v --json --indent 0 --days 1 --limit 8000 "" project
Credentials location set to "/Users/hugo/Dropbox/bin/data/pypinfo-ea8f23d28f2b.json".
SELECT
file.project as project,
COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
project
ORDER BY
download_count DESC
LIMIT 8000 Let's use a helper script like this: import os
days = 1
for limit in (1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000):
outfile = f"days-{days}-limit-{limit}-all.json"
if os.path.exists(outfile):
print(f"{outfile} exists, skipping")
continue
cmd = (
f"pypinfo --all --json --indent 0 --days {days} --limit {limit} '' "
f"project > {outfile}"
)
print(cmd)
os.system(cmd) Result: interestingly, the cost is the same for all limits (1000-8000): 0.31. Days = 1, limit = 1000...8000, pipLet's repeat but removing Result: cost again the same for all limits, and cost increased: 0.39. Days = 30, limit = 8000...1000Next, back to days = 30
for limit in (8000, 7000, 6000, 5000, 4000, 3000, 2000, 1000):
... Result: Again, the cost is the same regardless of limit: 4.89 Days = 1, 30, limit = 1Let's also check for limit = 1 Result: Same cost as above, 0.31 for 1 day, and 4.89 for 30 days. Days = 1...30, limit = 1Now we know the cost is the same regardless of the limit (at least up to 8000), let's see how the cost varies over the number of days. import os
import os
# for limit in (1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000):
for days in range(1, 31):
for limit in (1,):
outfile = f"days-{days}-limit-{limit}-all.json"
if os.path.exists(outfile):
print(f"{outfile} exists, skipping")
continue
cmd = (
f"pypinfo --all --json --indent 0 --days {days} --limit {limit} '' "
f"project > {outfile}"
)
print(cmd)
os.system(cmd) Fetch results from JSONs into CSV: import json
print("days\testimated_cost\tbytes_billed\tbytes_processed")
for days in range(1, 31):
for limit in (1,):
outfile = f"days-{days}-limit-{limit}-all.json"
with open(outfile) as f:
data = json.load(f)
out = [
days,
data["query"]["estimated_cost"],
f"{data["query"]["bytes_billed"]:,}",
f"{data["query"]["bytes_processed"]:,}",
]
print("\t".join(map(str, out))) Gives: days estimated_cost bytes_billed bytes_processed
1 0.31 67,423,436,800 67,423,031,967
2 0.49 106,777,542,656 106,776,936,726
3 0.66 144,589,193,216 144,588,849,934
4 0.84 183,362,387,968 183,362,056,051
5 0.94 205,844,905,984 205,844,089,492
6 1.05 230,423,527,424 230,423,239,221
7 1.20 263,483,031,552 263,482,009,301
8 1.37 301,120,618,496 301,119,942,981
9 1.56 340,849,065,984 340,848,563,030
10 1.73 379,594,997,760 379,594,187,471
11 1.91 419,779,575,808 419,778,888,347
12 2.02 442,795,819,008 442,795,580,469
13 2.11 463,927,771,136 463,927,460,363
14 2.28 500,829,257,728 500,828,317,940
15 2.47 541,425,926,144 541,424,988,775
16 2.64 580,075,388,928 580,075,257,785
17 2.83 621,216,268,288 621,215,221,560
18 3.01 660,446,642,176 660,445,832,190
19 3.13 686,769,045,504 686,768,435,724
20 3.24 710,389,268,480 710,388,303,159
21 3.41 747,798,265,856 747,798,009,119
22 3.59 789,304,049,664 789,303,120,956
23 3.78 830,582,292,480 830,581,546,594
24 3.97 871,580,565,504 871,579,537,799
25 4.14 908,331,057,152 908,330,312,456
26 4.25 933,782,093,824 933,781,258,405
27 4.35 956,512,075,776 956,511,698,396
28 4.52 992,670,121,984 992,669,344,697
29 4.70 1,032,543,272,960 1,032,542,495,766
30 4.89 1,073,809,981,440 1,073,809,418,894 Result: it's the number of days that affects cost. Google: "The first 1 TiB of query data processed per month is free." 1 TiB = 1,099,511,627,776 bytes, and all of the values in the above table are below this limit. Note, this is with The last run in this repo, without top-pypi-packages/top-pypi-packages-30-days.json Lines 2 to 7 in 81c3f01
So we can remove We're at 1,073,809,981,440 and the limit is not far off at 1,099,511,627,776. In the future, to stay under 1 TiB we'd need to reduce the number of days by some unclear amount. Well, let's add Here's the data collected from these tests: data.zip |
To repeat with the limit increasing by powers of ten, up to 1,000,000, getting data for all 531,022 on PyPI.
Result: Again, same cost for 1 project or 531,022 projects. Next thought, perhaps the query can be improved: SELECT
file.project as project,
COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
project
ORDER BY
download_count DESC
LIMIT 1000 Data fetched today, 2024-11-10: data2.zip |
Omitting the SELECT
file.project as project,
COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
project
ORDER BY
download_count DESC |
I've switched to fetching data for all installers: #39 And written this up at https://dev.to/hugovk/a-surprising-thing-about-pypis-bigquery-data-2g9o |
looking at the history, it seems as though the intention is to publish updated data on the first of each month.
Last update was 1st July.
The text was updated successfully, but these errors were encountered: