Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup and documentation #3

Closed
jklmnn opened this issue Dec 20, 2021 · 12 comments
Closed

Setup and documentation #3

jklmnn opened this issue Dec 20, 2021 · 12 comments

Comments

@jklmnn
Copy link
Collaborator

jklmnn commented Dec 20, 2021

I finally found the time to take a deeper look. I tried to run it on Debian 11 without using Docker (as this is what we run our infrastructure on). These are my findings so far:

  • libgdal-dev is needed to run, we maybe should add that to the documentation if Docker is not used (together with libpq-dev)
  • We also need postgresql-13-postgis-3-scripts (or the respective package for other postgresql versions)
  • I don't like the use of default users (okay) and passwords (definitely not okay). Not only may it lead people to use them. It also creates the danger of accidentally forgetting to set/reconfigure any of the default values which may cause security problems. We should not provide default passwords and enforce the explicit setting of any value required for authentication.
  • Making the database user a super user just to enable the extension is overkill. I think we should manually enable the extension when the database is created and the database user should not be a super user (neither in production nor testing).
  • I tried to scrape but Frankfurt is currently broken (if you just converted our old modules that's no surprise, it has been broken for a while now). I noticed that this seems to break the whole scraping process. This should not happen. Sometimes there are regularly problems with some data sources, this should not impact the scraping of any other sources.
  • The jsonschema version referenced in web/scrapers/builtin/requirements.txt doesn't seem to exist. According to PyPI, only 3.1.1 and 3.2.0 exist, but not 3.2.1. Is there anything that prevents us from using the latest version (4.3.2)? I tried that and for the things I tested, it worked.
@defgsus
Copy link
Collaborator

defgsus commented Dec 21, 2021

Hi jk, glad you eventually jumped in ;)

  • Yes, the gdal and other utilities are needed. The README links to the install docs on the django page but i'll add the mentioned libraries explicitly.

  • I'll remove default user/password settings and actually will rewrite the configuration using python-decouple. They support little .env files and overwriting via command-line params.

  • Yeah, i don't like the postgres superuser thing either. For the real database, whether in production or local dev, it's no problem to enable the postgis extension once and the leave it. The only purpose for the superuser was to ease the unittest run. It's possible to run ./manage.py test --keepdb though. I'll check that and update the REAMDE.

  • I copied the Frankfurt code and made some adjustments, mainly the source url. I think they are rotating the resource uuid or something. Possibly the scraper needs to get current url from their webpage.

    The scraping of other pools should not be affected, though. When running ./manage.py pa_scrape scrape the stacktrace is just printed to the console but the other scrapers work. You can verify with ./manage.py pa_stats -t 2m to see the number of errors and successful data collections within the last 2 minutes (-t 2m ).

    Btw. the ffh pool does also currently fail and i think it can be removed completely as they only show cities which are already covered. And without any metadata.

  • jsonschema could sure be the newest version, don't remember how i came to the version number in the requirements. Actually now that i see that my venv uses 4.2.1 it's quite obviously a typo.

  • One more thing. If docker is not used in our production i might as well remove the Dockerfile and docs for now. I'm not happy with the ubuntu image and there should also be a setup to dockercompose with the database container. It takes some work and is not immediately relevant.

defgsus added a commit that referenced this issue Dec 21, 2021
- use python-decouple for environment settings
- add .env.example file to get started
- remove default db user and password mentionings
- update README documentation
@defgsus
Copy link
Collaborator

defgsus commented Dec 21, 2021

Like the commits say:

  • decouple is now managing the environment settings and there's a little example file with helpful comments.
  • no default user and password settings (i will remember this advice)
  • no postgres superuser required (because developer, not django, is responsible for creating the test database)

The setup requires to copy and adjust a few lines of postgres statements. Maybe we should supply a setup file similar to your ./bin/parkapi-setupdb

I also removed the Docker part because i'm in no mood right now to test and verify the changes

@defgsus
Copy link
Collaborator

defgsus commented Dec 22, 2021

The Frankfurt data source simply does not deliver data right now (https://offenedaten.frankfurt.de/dataset/parkdaten-dynamisch).

Please check if everything regarding this issue is working so far and then close it if you like. Next time i'll put everything on a branch.

Wünsche frohe Festtage und so :D

🌌

defgsus added a commit that referenced this issue Dec 30, 2021
defgsus added a commit that referenced this issue Dec 30, 2021
defgsus added a commit that referenced this issue Dec 30, 2021
defgsus added a commit that referenced this issue Dec 30, 2021
@defgsus
Copy link
Collaborator

defgsus commented Dec 30, 2021

Haha, have put the wrong number on the commits, they are all supposed to relate to issue #4.

@jklmnn
Copy link
Collaborator Author

jklmnn commented Feb 24, 2022

So, after a long time, I tried to run the server again ;)
I'm able to run the server now. I like the solution to use environment variables instead of default credentials. Yet I had to figure may way through the error messages of missing parameters so it might make sense to have them documented somewhere. E.g. I didn't really know what DJANGO_SECRET_KEY is used for.

Unfortunately I wasn't able to run the scraper itself. After installing all dependencies from the requirements.txt files it failed with the following error:

./manage.py pa_scrape scrape
module 'builtin' scraping all pools


ERROR in module builtin:
 Traceback (most recent call last):
  File "/.../ParkAPI2/web/scrapers/builtin/scraper.py", line 187, in <module>
    main(**parse_args())
  File "/.../ParkAPI2/web/scrapers/builtin/scraper.py", line 135, in main
    scrapers = get_scrapers(pool_filter=pools)
  File "/.../ParkAPI2/web/scrapers/builtin/scraper.py", line 77, in get_scrapers
    module = importlib.import_module(module_name)
  File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 790, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/.../ParkAPI2/web/scrapers/builtin/new/bahn.py", line 12, in <module>
    from decouple import config
ModuleNotFoundError: No module named 'decouple'

I tried installing decouple from pip but then got the following error:

Traceback (most recent call last):
  File "/.../ParkAPI2/venv/lib/python3.9/site-packages/django/core/management/__init__.py", line 237, in fetch_command
    app_name = commands[subcommand]
KeyError: 'pa_scrape'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/.../ParkAPI2/web/./manage.py", line 22, in <module>
    main()
  File "/.../ParkAPI2/web/./manage.py", line 18, in main
    execute_from_command_line(sys.argv)
  File "/.../ParkAPI2/venv/lib/python3.9/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
    utility.execute()
  File "/.../ParkAPI2/venv/lib/python3.9/site-packages/django/core/management/__init__.py", line 413, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/.../ParkAPI2/venv/lib/python3.9/site-packages/django/core/management/__init__.py", line 244, in fetch_command
    settings.INSTALLED_APPS
  File "/.../ParkAPI2/venv/lib/python3.9/site-packages/django/conf/__init__.py", line 82, in __getattr__
    self._setup(name)
  File "/.../ParkAPI2/venv/lib/python3.9/site-packages/django/conf/__init__.py", line 69, in _setup
    self._wrapped = Settings(settings_module)
  File "/.../ParkAPI2/venv/lib/python3.9/site-packages/django/conf/__init__.py", line 170, in __init__
    mod = importlib.import_module(self.SETTINGS_MODULE)
  File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 790, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/.../ParkAPI2/web/park_api/settings.py", line 14, in <module>
    from decouple import config
ImportError: cannot import name 'config' from 'decouple' (/.../ParkAPI2/venv/lib/python3.9/site-packages/decouple/__init__.py)

Do you have any idea what I'm missing?

@defgsus
Copy link
Collaborator

defgsus commented Feb 25, 2022

Okay..., what might be the problem here?
The requirements.txt lists python-decouple which is the correct package. Also all versions are pinned so when installing the env with pip install -r requirements.txt it should be more or less the same on all systems.

For the .env parameters: The .env.example file contains links to the django documentation to explain what the django-specific settings are.

Just from outside i'd argue you did not follow the steps in the README exactly (:wink:)

I did a clean install just to test it. Apart from the wrong version number for jsonschema (which we discussed earlier) all seems fine. i fixed the requirements.txt file.

Some scrapers have problems though (heilbronn, hanau, hamburg, and frankfurt still!).

@jklmnn
Copy link
Collaborator Author

jklmnn commented Feb 27, 2022

Okay, I just checked what is installed and the error is really weird. I actually have python-decouple installed and I can import it in a python shell. I even added the import statement to manage.py and that doesn't generate an error. It seems that specifically the import in ParkAPI2/web/scrapers/builtin/new/bahn.py generates this error however it's not clear to me why. I also checked the interpreter and it's using the correct one (I'm working in a virtual env). Which python version do you use?

@defgsus
Copy link
Collaborator

defgsus commented Feb 28, 2022

Weird indeed! I'm using python 3.8 locally. The only thing i can imagine is that pa_scrape is calling the wrong python executable. There is a new option to show the executed commands:

If calling ./manage.py pa_scrape scrape -p bahn -v 2 it displays:

running .../ParkAPI2/env/bin/python3.8 scraper.py scrape --pools bahn in directory .../ParkAPI2/web/scrapers/builtin

which is the correct enved python in my case. It's determined in park_api/management/commands/pa_scrape.py:159 using sys.executable.

My env was created with virtualenv -p python3.8 env (it's an older ubuntu where i have to state python3 explicitly).

Generally, without calling source env/bin/activate, i'm still able to use the environment (e.g, from a cronjob) via
ParkAPI2/web$ ../env/bin/python manage.py pa_scrape scrape -p bahn -v 2 and it works.

Can you check that at least the raw scraper is working (without the subprocess inside pa_scrape.py), e.g. with:
ParkAPI2/web/scrapers/builtin$ python scraper.py scrape -p bahn

@jklmnn
Copy link
Collaborator Author

jklmnn commented Feb 28, 2022

Okay, I just tested your command and it's indeed calling /usr/bin/python3.9. I'll take a look why that happens. I just tested which python python3 python3.9 and all of them point to my virtual env.

@defgsus
Copy link
Collaborator

defgsus commented Mar 1, 2022

I've started a github actions test (PR #8) and it can reproduce the problem.

The wrong executable thing has been reported a while ago

Seemed to be version specific at the time but i tested python 3.6 through 3.10 on github and they all behave equally awful.

(of course, the test passes nicely when not using a virtualenv)

UPDTE:
https://bugzilla.mozilla.org/show_bug.cgi?id=1436777 kind of holds the answer - simply don't trust sys.executable. I'll try to work around it.

defgsus added a commit that referenced this issue Mar 1, 2022
defgsus added a commit that referenced this issue Mar 1, 2022
* start ghactions tests

(this is very likely not going to work)

* fix typo

* change the file

* install required debian packages

* fix apt-get command

* remove this strange package

* add python 3.8, 3.9, 3.10 to test matrix

:thumbsup: github, this works!

* move tests into virtual env

(because that's how it's usually run on a server)

* pa_scrape list command fails with exit-code 1 when list could not be retrieved

* just test python 3.6

* use sys.prefix before trying sys.executable to find venv python executable (#3)

* try "which python" command before attempting to use the useless sys.prefix/executable (#3)

* test against python 3.8 - 3.10
@defgsus
Copy link
Collaborator

defgsus commented Mar 1, 2022

Actually fell back to calling "which python" via subprocess. Very strange! sys.prefix did not work either.

Anyhow. The master branch is ready to be tried again on your server. And we have a git workflow now!

(gh actions is f*cking cool. i hope the free access will remain for a long time)

@jklmnn
Copy link
Collaborator Author

jklmnn commented May 19, 2023

After setting the project up twice now for testing by only using the README the only issue I encountered is fixed in #15. For the time being I don't see any deal breakers so I think we can close this issue.

@jklmnn jklmnn closed this as completed May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants