back to legacy lot ids & add more original scrapers #4

defgsus · 2021-12-30T13:04:59Z

All lots need to have the same ID as it was generated in the original ParkAPI by the geojson wrapper. (As discussed in issue #1)

In essence that means:

start with legacy city name (e.g. frankfurt instead of frankfurt-am-main)
no spaces
use utils/strings/name_to_legacy_id to convert the name strings
new scrapers can use utils/strings/name_to_id which allows - separators

branch: feature/legacy-ids

The text was updated successfully, but these errors were encountered:

…ts constructors + patch Aarhus scraper

+ move all original scrapers to `original/` path + move all new scrapers to `new/`

…l geojson files from github

…riginal geojson file and scraped data)

defgsus · 2021-12-31T14:27:21Z

Dear @jklmnn,
adding the original scrapers is some work. It can take more than an hour for one city. However, it's progressing. I'm testing everything properly, replacing http with https and for the meta-infos i usually do a merge of scraped data and the original geojson files. E.g. in the Freiburg scraper (in get_lot_infos) the original ParkAPI geojson is downloaded from github and combined with the geojson of the Freiburg server. (Once the new geojson file is written, the get_lot_infos method is not called anymore and the code goes into an obsolete state - not needed until maybe a new lot appears within the pool. Though, once the geojson file is edited by hand this becomes more complicated..)

I also update addresses if the website supplies more complete addresses, like an added zip code. And add public or source urls for each lot where available.

Also i'm a bit more strict about the nodata status, or rather the collected numbers. If there is no free spaces or capacity number that can be scraped the values are set to None instead of zero.

For Dresden i scraped the geo coordinates from the website when available and used the ParkAPI geojson if no coords are listed. The website coordinates have more digits so i thought this might be a good thing. But i guess it's possible that you and other contributors have picked more useful coordinates by hand, so this needs to be reviewed (not only for Dresden).

Anyways, I do my best (to the best of my knowledge) to integrate the original scrapers and upgrade the meta-info where possible.

Also wrote the Frankfurt opendata people about their outage (It stopped working on 2021/12/17)

Boy, i'm really looking forward to get this project in production!

Best regards and a happy new fear

the original lot_ids did only consist of digits. Now they are "hamburg-1234". For consistency we might go back to digits only for this particular lot. We'll see.

… ;-P

+ add Scraper.ALLOW_SSL_FAILURE variable to allow, e.g., expired certificates (parken.heidelberg.de cert expired 2021/12/31)

… to scrape from the site so the original geojson was just ported and finito

the sub-pages for each parking lot do actually contain the lot-timestamp and num_free/capacity values even when the lot is closed. It takes a couple of extra requests but it's worth it i think. Attention: original lot ids contained characters "(", ")" and "&". Changed the name_to_legacy_id() function to remove these characters as they are potentially bad for filenames.

are "koeln-x-y123" (taken from the Köln feature identifier) We might change this back to the original legacy IDs but it was a bit more difficult here as the lot names in the live data and the names in the original ParkAPI geojson are quite different at times.

… for live capacity and full addresses. Attention: The lot "Byk Gulden Str." is currently out of order, does not have a linked page and is not in the geojson file!

…y-name / nominatim-name)

jklmnn · 2022-01-03T11:58:09Z

Great work!

Also i'm a bit more strict about the nodata status, or rather the collected numbers. If there is no free spaces or capacity number that can be scraped the values are set to None instead of zero.

This is generally a good idea. However I can't say for sure if we can keep this if it goes into production. It might cause problems with legacy clients.

…llowed lot_id letters (facepalm)

defgsus · 2022-01-03T15:51:08Z

However I can't say for sure if we can keep this if it goes into production. It might cause problems with legacy clients.

Yes, replacing Nones with zeros in v1 api should be no problem. In the dumps, snapshots with None can probably just be skipped.

There are incompatibilities with some lot_ids, though. And other tricky stuff ;) I'll implement the remaining scrapers and then do a scripted comparison with api.parkendd.de

Then, we certainly have some stuff to discuss and find compromises

… "Parkhaus"

defgsus · 2022-01-05T11:26:12Z

The Frankfurt case: https://www.offenedaten.frankfurt.de/blog/aktualisierungverkehrsdaten

From the email: ... Sobald vom Hersteller ein entsprechender Sicherheitspatch eingespielt wurde,..

hehe

defgsus · 2022-01-05T17:23:36Z

Okaaayyyhhh, here is the first attempt to compare api.parkendd.de against ParkAPI2/api/v1. I got used to call the former ParkAPI1 (or pa1) and the latter ParkAPI2.

https://github.com/defgsus/ParkAPI2/wiki/v1-api-comparison

Just compared the 'city' metadata, not the lots. It's complex enough already. You can have a look if you like. I'm still preparing a more readable document with specific compatibility issues.

One thing is sure. Using names for IDs will remain to be problematic. They do actually change occasionally.

jklmnn · 2022-01-17T10:01:18Z

Sorry for the late reply. The problem with the lot IDs is that not all sources have real IDs, so we need to keep some kind of fallback. In the end, if there is no unique persistent ID and the data source decides to change the name, there isn't really anything we can do.
We could use the location of some sort though. This is based on the assumption that a parking lot can't easily relocate itself and if it does we can safely assume that it is a different one. This would also be useful if someone wants to use this data for analysis, since a different location might have implications to the traffic around the parking lot.

defgsus · 2022-01-17T17:16:20Z

Yes, it's complicated with those IDs. I'm really just picky because of later
statistics use. Your location idea sounds quite good in this regard.

For daily use it's probably no problem if a lot name changes. Apart from the fact
that it is not associated to it's former .geojson entry anymore which,
in ParkAPI2, would exclude it from the API v1, because it has no location and
therefor no associated city.

With the right measures and follow-up maintenance this can be somewhat managed.

When porting your scrapers i found permanent IDs on some websites but with
the current data structure it's not possible to switch to those IDs for
identification while keeping the original IDs (from the lot names) for
compatibility.

I found so many little compatibility challenges during the port that it
felt like real work. Well, at least i spent a couple of real working hours ;)

In the midst of it i started writing the following overview. There are things i
wanted to add later but i simply forgot them.

General changes to scrapers

(no specific order, numbers are just for communication)

added lot_type "bus" which should be excluded in api by default. Just for statistics..
all http urls are changed to https. Even scraped links to individual lot pages are adjusted if needed.
removed Pool's public urls that just point to www.<city>.de. If possible, they got replaced by something like www.<city>.de/parken/.
General url logic: if a Pool's public_url is scraped, source_url is left empty.
Added public_url to all lots that have an individual public webpage.
City names are queried from nominatim reverse search using each lot's coordinates. The coordinates of a city in api.parkendd.de/ are the centers of the city polygon as returned by nominatim. The original values from the ParkAPI1 geojson files are ignored because there is not a particular pool -> city mapping.
A new property of a lot is live_capacity. It simply means: If there is a capacity number on the website it will be scraped with every snapshot. If not, the static capacity from the .geojson file is used and live_capacity should be False to signal that the capacity number is static and might not reflect the true capacity at any point in time.
Some cities have fewer lots now, judging the API comparison. Need to find out for each lot what is going on there... The problem is that the missing lots might not be in the new .geojson files (if they have been re-rendered by scraping the page).

Individual scraper changes

Hamburg: The original lot_ids were plain numbers. Added a hamburg- prefix. (We can switch back to the original IDs though, if needed)
Karlsruhe: Now scraping one webpage for every lot. (K04, S07, ...). That way we can read the true update-timestamp and live capacity
Köln: There were so many lot name changes that the scraper is using the IDs now. They look like koeln-d-p001, koeln-ph03, ... (see here)
Magdeburg: Removed lots "Endstelle Diesdorf", "Milchhof", "Lange Lake" (from second table here) because they do not list number of free spaces (only capacity).
Münster: Explicitly flagged the "Busparkplatz" as lot_type bus which will remove it from default api responses.
Nürnberg: "Tiefgarage Findelgasse" changed to "Parkhaus Findelgasse". So lot_id changed as well. Should check if that is really what happened and maybe relocate the entrance.
Wiesbaden: One web request per lot. They changed the website and the only way i found is to request individual geoportal urls.

That's it for now.

Please let me know what you think and let us progress, slowly..

jklmnn · 2022-07-07T20:16:21Z

I just checked the available cities after our current outage, and the only city I can see missing is Hanau. So after we add this I'd say we can close this issue.

defgsus added a commit that referenced this issue Dec 30, 2021

issue #4: add name_to_legacy_id to utils + do not modify ids in struc…

a74e3c1

…ts constructors + patch Aarhus scraper

defgsus mentioned this issue Dec 30, 2021

Setup and documentation #3

Closed

defgsus added a commit that referenced this issue Dec 30, 2021

issue #4: allow scrapers to be found in subpaths as well

4e12f85

+ move all original scrapers to `original/` path + move all new scrapers to `new/`

defgsus added a commit that referenced this issue Dec 30, 2021

(#4) add Basel scraper

e0b02fe

defgsus added the compatibility Compatibility with old ParkAPI implementation label Dec 30, 2021

defgsus added a commit that referenced this issue Dec 31, 2021

(#4) remove obsolete Basel code

013d92a

defgsus added a commit that referenced this issue Dec 31, 2021

(#4) add Bonn scraper which is a complete rewrite of the original one

77fd189

defgsus added a commit that referenced this issue Dec 31, 2021

(#4) add Dortmund scraper + add some helper code to parse the origina…

b8c0696

…l geojson files from github

defgsus added a commit that referenced this issue Dec 31, 2021

(#4) add Freiburg scraper (where get_lot_infos is a nice mixture of o…

66dc536

…riginal geojson file and scraped data)

defgsus changed the title ~~back to legacy lot ids~~ back to legacy lot ids & add more original scrapers Dec 31, 2021

defgsus added a commit that referenced this issue Dec 31, 2021

(#4) add Hamburg scraper (nice work jk) - Attention:

eabf6ae

the original lot_ids did only consist of digits. Now they are "hamburg-1234". For consistency we might go back to digits only for this particular lot. We'll see.

defgsus added a commit that referenced this issue Dec 31, 2021

(#4) add Hanau scraper - now that's definitely the last one this year…

4159adb

… ;-P

defgsus added a commit that referenced this issue Jan 1, 2022

(#4) add Heidelberg scraper

30bb0f9

+ add Scraper.ALLOW_SSL_FAILURE variable to allow, e.g., expired certificates (parken.heidelberg.de cert expired 2021/12/31)

defgsus added a commit that referenced this issue Jan 2, 2022

(#4) add Heilbronn scraper

0188800

defgsus added a commit that referenced this issue Jan 2, 2022

(#4) add Ingolstadt scraper - that one was easy, there's nothing more…

dbb1770

… to scrape from the site so the original geojson was just ported and finito

defgsus added a commit that referenced this issue Jan 2, 2022

(#4) add Kaiserslautern

aefce3b

defgsus added a commit that referenced this issue Jan 2, 2022

(#4) add Limburg scraper - fidgeted with the status and numbers a bit

6a8bda8

defgsus added a commit that referenced this issue Jan 2, 2022

(#4) add a couple more city names to the v1 api conversion map (legac…

bd10ca1

…y-name / nominatim-name)

defgsus added a commit that referenced this issue Jan 3, 2022

(#4) add Lübeck scraper - a rewrite because there is a new website

c8266e4

defgsus added a commit that referenced this issue Jan 3, 2022

(#4) add Magdeburg scraper - removed non-live lots + added "é" to a…

3587878

…llowed lot_id letters (facepalm)

defgsus added a commit that referenced this issue Jan 3, 2022

(#4) add Mannheim scraper

c6cbfd4

defgsus added a commit that referenced this issue Jan 3, 2022

(#4) add Münster scraper - Busparkplatz explicitely flagged as "bus"

53a90dc

defgsus added a commit that referenced this issue Jan 3, 2022

(#4) add Nürnberg scraper - one lot-name changed from "Tiefgarage" to…

48df464

… "Parkhaus"

defgsus added a commit that referenced this issue Jan 3, 2022

(#4) add Oldenburg scraper

3e1b29b

defgsus added a commit that referenced this issue Jan 4, 2022

(#4) add Regensburg scraper - one lot changed name

b8112db

defgsus added a commit that referenced this issue Jan 4, 2022

(#4) add Rosenheim scraper - two names changed

8b5238b

defgsus added a commit that referenced this issue Jan 4, 2022

(#4) add Ulm scraper

bac7070

defgsus added a commit that referenced this issue Jan 4, 2022

(#4) add Wiesbaden scraper - rewrite because website moved and changed

745f9b4

defgsus added a commit that referenced this issue Jan 4, 2022

(#4) add Zürich scraper - the final one, dude!

113719b

defgsus mentioned this issue Jan 18, 2022

Deployment and test server #5

Open

defgsus pinned this issue Mar 1, 2022

defgsus mentioned this issue May 3, 2023

Update rosenheim scraper and geojson ParkenDD/ParkAPI2-sources#4

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

back to legacy lot ids & add more original scrapers #4

back to legacy lot ids & add more original scrapers #4

defgsus commented Dec 30, 2021 •

edited

Loading

defgsus commented Dec 31, 2021 •

edited

Loading

jklmnn commented Jan 3, 2022

defgsus commented Jan 3, 2022

defgsus commented Jan 5, 2022

defgsus commented Jan 5, 2022

jklmnn commented Jan 17, 2022

defgsus commented Jan 17, 2022

jklmnn commented Jul 7, 2022

back to legacy lot ids & add more original scrapers #4

back to legacy lot ids & add more original scrapers #4

Comments

defgsus commented Dec 30, 2021 • edited Loading

defgsus commented Dec 31, 2021 • edited Loading

jklmnn commented Jan 3, 2022

defgsus commented Jan 3, 2022

defgsus commented Jan 5, 2022

defgsus commented Jan 5, 2022

jklmnn commented Jan 17, 2022

defgsus commented Jan 17, 2022

General changes to scrapers

Individual scraper changes

jklmnn commented Jul 7, 2022

defgsus commented Dec 30, 2021 •

edited

Loading

defgsus commented Dec 31, 2021 •

edited

Loading