In this document, you'll find information and instructions about my solution to the data challenge.
This is the structure of the project.
.
├── .dockerignore
├── .env.example
├── .python-version
├── .sqlfluff
├── .sqlfluffignore
├── Dockerfile
├── Makefile
├── README.md
├── data-challenge_sfrpt
│ ├── README.md
│ ├── crimes.sql
│ ├── loader.py
│ ├── queries.sql
│ ├── sf_crime_reports.jsonl
│ ├── sf_police_data_dictionary.pdf
│ └── view.sql
├── docker-compose.yaml
├── mypy.ini
├── poetry.lock
├── pyproject.toml
└── scripts
└── postgres_init.sh
I've placed the queries directly inside the data-challenge_sfrpt
directory.
This solution is containerized, so you'll need to install docker and docker-compose.
Also, it's recommended to have a desktop SQL client like DBeaver.
Let's dive into the setup process.
Open a shell in your machine, and navigate to this directory. Then run:
make generate-dotenv
This will generate the .env
file. Please, go ahead and open it! It contains all the necessary environment variables. If you want to modify some values, just take into account that this may break some things.
Run:
make build
This will build all the required images.
Run:
make up
This will create a PostgreSQL database, the crimes
table, and the load-data
service that populates the table with the provided information.
Open DBeaver, and set up the connection to the database. If you didn't modified the .env
file, you can use these credentials:
- User:
latch
- Password:
latch
- Host:
localhost
- Port:
5440
- DB:
sf_crime_reports
Then, please open the queries.sql
and view.sql
files and run queries in DBeaver to verify the results.
If you don't have DBeaver, you can run the queries from PostgreSQL's terminal with psql. To do this, please run:
make execute-sql
Then you can run the queries from the terminal.
Run:
make down
Run:
make help
I've used poetry to manage the project's dependencies. If you want to install it in your local machine, please run:
make install-poetry
And then run:
make install-project
Then you'll have all the dependencies installed, and a virtual environment created in this very directory. This is useful, for example, if you're using VS Code and want to explore the code. Also, you might want to use pyenv to install Python 3.10.12.
All the code in this project has been linted and formatted with these tools:
You'll see that I've used pandas and sqlalchemy to easily handle the data population in the crimes
table, as there were no restrictions to use it.
You'll notice that I've created some indexes in the crimes
table. These indexes are useful for improving query performance, and the only reason I've created them is to show my understanding of how indexes work.