Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

back to legacy lot ids & add more original scrapers #4

Open
defgsus opened this issue Dec 30, 2021 · 8 comments
Open

back to legacy lot ids & add more original scrapers #4

defgsus opened this issue Dec 30, 2021 · 8 comments
Labels
compatibility Compatibility with old ParkAPI implementation

Comments

@defgsus
Copy link
Collaborator

defgsus commented Dec 30, 2021

All lots need to have the same ID as it was generated in the original ParkAPI by the geojson wrapper. (As discussed in issue #1)

In essence that means:

  • start with legacy city name (e.g. frankfurt instead of frankfurt-am-main)
  • no spaces
  • use utils/strings/name_to_legacy_id to convert the name strings
  • new scrapers can use utils/strings/name_to_id which allows - separators

branch: feature/legacy-ids

defgsus added a commit that referenced this issue Dec 30, 2021
defgsus added a commit that referenced this issue Dec 30, 2021
+ move all original scrapers to `original/` path
+ move all new scrapers to `new/`
defgsus added a commit that referenced this issue Dec 30, 2021
@defgsus defgsus added the compatibility Compatibility with old ParkAPI implementation label Dec 30, 2021
defgsus added a commit that referenced this issue Dec 31, 2021
defgsus added a commit that referenced this issue Dec 31, 2021
defgsus added a commit that referenced this issue Dec 31, 2021
@defgsus defgsus changed the title back to legacy lot ids back to legacy lot ids & add more original scrapers Dec 31, 2021
@defgsus
Copy link
Collaborator Author

defgsus commented Dec 31, 2021

Dear @jklmnn,
adding the original scrapers is some work. It can take more than an hour for one city. However, it's progressing. I'm testing everything properly, replacing http with https and for the meta-infos i usually do a merge of scraped data and the original geojson files. E.g. in the Freiburg scraper (in get_lot_infos) the original ParkAPI geojson is downloaded from github and combined with the geojson of the Freiburg server. (Once the new geojson file is written, the get_lot_infos method is not called anymore and the code goes into an obsolete state - not needed until maybe a new lot appears within the pool. Though, once the geojson file is edited by hand this becomes more complicated..)

I also update addresses if the website supplies more complete addresses, like an added zip code. And add public or source urls for each lot where available.

Also i'm a bit more strict about the nodata status, or rather the collected numbers. If there is no free spaces or capacity number that can be scraped the values are set to None instead of zero.

For Dresden i scraped the geo coordinates from the website when available and used the ParkAPI geojson if no coords are listed. The website coordinates have more digits so i thought this might be a good thing. But i guess it's possible that you and other contributors have picked more useful coordinates by hand, so this needs to be reviewed (not only for Dresden).

Anyways, I do my best (to the best of my knowledge) to integrate the original scrapers and upgrade the meta-info where possible.

Also wrote the Frankfurt opendata people about their outage (It stopped working on 2021/12/17)

Boy, i'm really looking forward to get this project in production!

Best regards and a happy new fear

defgsus added a commit that referenced this issue Dec 31, 2021
the original lot_ids did only consist of digits.
Now they are "hamburg-1234". For consistency we might go
back to digits only for this particular lot. We'll see.
defgsus added a commit that referenced this issue Jan 1, 2022
+ add Scraper.ALLOW_SSL_FAILURE variable to allow, e.g., expired certificates
(parken.heidelberg.de cert expired 2021/12/31)
defgsus added a commit that referenced this issue Jan 2, 2022
defgsus added a commit that referenced this issue Jan 2, 2022
… to scrape from the site

so the original geojson was just ported and finito
defgsus added a commit that referenced this issue Jan 2, 2022
defgsus added a commit that referenced this issue Jan 2, 2022
the sub-pages for each parking lot do actually contain the
lot-timestamp and num_free/capacity values even when the lot is
closed. It takes a couple of extra requests but it's worth it i think.

Attention: original lot ids contained characters "(", ")" and "&".
Changed the name_to_legacy_id() function to remove these characters as
they are potentially bad for filenames.
defgsus added a commit that referenced this issue Jan 2, 2022
are "koeln-x-y123" (taken from the Köln feature identifier)

We might change this back to the original legacy IDs but it was
a bit more difficult here as the lot names in the live data and
the names in the original ParkAPI geojson are quite different at times.
defgsus added a commit that referenced this issue Jan 2, 2022
… for live capacity

and full addresses.

Attention: The lot "Byk Gulden Str." is currently out of order, does not have a linked page and
is not in the geojson file!
defgsus added a commit that referenced this issue Jan 2, 2022
@jklmnn
Copy link
Collaborator

jklmnn commented Jan 3, 2022

Great work!

Also i'm a bit more strict about the nodata status, or rather the collected numbers. If there is no free spaces or capacity number that can be scraped the values are set to None instead of zero.

This is generally a good idea. However I can't say for sure if we can keep this if it goes into production. It might cause problems with legacy clients.

defgsus added a commit that referenced this issue Jan 3, 2022
@defgsus
Copy link
Collaborator Author

defgsus commented Jan 3, 2022

However I can't say for sure if we can keep this if it goes into production. It might cause problems with legacy clients.

Yes, replacing Nones with zeros in v1 api should be no problem. In the dumps, snapshots with None can probably just be skipped.

There are incompatibilities with some lot_ids, though. And other tricky stuff ;) I'll implement the remaining scrapers and then do a scripted comparison with api.parkendd.de

Then, we certainly have some stuff to discuss and find compromises

@defgsus
Copy link
Collaborator Author

defgsus commented Jan 5, 2022

The Frankfurt case: https://www.offenedaten.frankfurt.de/blog/aktualisierungverkehrsdaten

From the email: ... Sobald vom Hersteller ein entsprechender Sicherheitspatch eingespielt wurde,..

hehe

@defgsus
Copy link
Collaborator Author

defgsus commented Jan 5, 2022

Okaaayyyhhh, here is the first attempt to compare api.parkendd.de against ParkAPI2/api/v1. I got used to call the former ParkAPI1 (or pa1) and the latter ParkAPI2.

https://github.com/defgsus/ParkAPI2/wiki/v1-api-comparison

Just compared the 'city' metadata, not the lots. It's complex enough already. You can have a look if you like. I'm still preparing a more readable document with specific compatibility issues.

One thing is sure. Using names for IDs will remain to be problematic. They do actually change occasionally.

@jklmnn
Copy link
Collaborator

jklmnn commented Jan 17, 2022

Sorry for the late reply. The problem with the lot IDs is that not all sources have real IDs, so we need to keep some kind of fallback. In the end, if there is no unique persistent ID and the data source decides to change the name, there isn't really anything we can do.
We could use the location of some sort though. This is based on the assumption that a parking lot can't easily relocate itself and if it does we can safely assume that it is a different one. This would also be useful if someone wants to use this data for analysis, since a different location might have implications to the traffic around the parking lot.

@defgsus
Copy link
Collaborator Author

defgsus commented Jan 17, 2022

Yes, it's complicated with those IDs. I'm really just picky because of later
statistics use. Your location idea sounds quite good in this regard.

For daily use it's probably no problem if a lot name changes. Apart from the fact
that it is not associated to it's former .geojson entry anymore which,
in ParkAPI2, would exclude it from the API v1, because it has no location and
therefor no associated city.

With the right measures and follow-up maintenance this can be somewhat managed.

When porting your scrapers i found permanent IDs on some websites but with
the current data structure it's not possible to switch to those IDs for
identification while keeping the original IDs (from the lot names) for
compatibility.

I found so many little compatibility challenges during the port that it
felt like real work. Well, at least i spent a couple of real working hours ;)

In the midst of it i started writing the following overview. There are things i
wanted to add later but i simply forgot them.

General changes to scrapers

(no specific order, numbers are just for communication)

  1. added lot_type "bus" which should be excluded in api by default. Just for statistics..
  2. all http urls are changed to https. Even scraped links to individual lot pages are adjusted if needed.
  3. removed Pool's public urls that just point to www.<city>.de. If possible, they got replaced by something like www.<city>.de/parken/.
    General url logic: if a Pool's public_url is scraped, source_url is left empty.
  4. Added public_url to all lots that have an individual public webpage.
  5. City names are queried from nominatim reverse search using each lot's coordinates. The coordinates of a city in api.parkendd.de/ are the centers of the city polygon as returned by nominatim. The original values from the ParkAPI1 geojson files are ignored because there is not a particular pool -> city mapping.
  6. A new property of a lot is live_capacity. It simply means: If there is a capacity number on the website it will be scraped with every snapshot. If not, the static capacity from the .geojson file is used and live_capacity should be False to signal that the capacity number is static and might not reflect the true capacity at any point in time.
  7. Some cities have fewer lots now, judging the API comparison. Need to find out for each lot what is going on there... The problem is that the missing lots might not be in the new .geojson files (if they have been re-rendered by scraping the page).

Individual scraper changes

  • Hamburg: The original lot_ids were plain numbers. Added a hamburg- prefix. (We can switch back to the original IDs though, if needed)
  • Karlsruhe: Now scraping one webpage for every lot. (K04, S07, ...). That way we can read the true update-timestamp and live capacity
  • Köln: There were so many lot name changes that the scraper is using the IDs now. They look like koeln-d-p001, koeln-ph03, ... (see here)
  • Magdeburg: Removed lots "Endstelle Diesdorf", "Milchhof", "Lange Lake" (from second table here) because they do not list number of free spaces (only capacity).
  • Münster: Explicitly flagged the "Busparkplatz" as lot_type bus which will remove it from default api responses.
  • Nürnberg: "Tiefgarage Findelgasse" changed to "Parkhaus Findelgasse". So lot_id changed as well. Should check if that is really what happened and maybe relocate the entrance.
  • Wiesbaden: One web request per lot. They changed the website and the only way i found is to request individual geoportal urls.

That's it for now.

Please let me know what you think and let us progress, slowly..

@jklmnn
Copy link
Collaborator

jklmnn commented Jul 7, 2022

I just checked the available cities after our current outage, and the only city I can see missing is Hanau. So after we add this I'd say we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compatibility Compatibility with old ParkAPI implementation
Projects
None yet
Development

No branches or pull requests

2 participants