db_engines_ranking_table_crawling

Crawling data from DB-Engines, and auto update new changes into my manually labeled datasets as much as possible.

1. Crawling ranking table

Crawling the DBMS ranking data from DB-Engines with the beautifulsoup package. save as ranking_crawling_202211_raw.csv

2. Crawling DBMS information

Crawling the DBMS information from the db-engines DBMS_insitelink, which has crawled by step "Crawling ranking table".

3. join ranking_table and dbms_info on 'DBMS'

Join ranking_table and dbms_info on 'DBMS' of ranking_table and 'Name' of dbms_info. Set the key name alias to 'DBMS' after joined. Default set use_cols_ranking_table = None to use all fields of ranking_table, and set use_cols_dbms_infos = ["Developer", "Name", "Description", "Initial release", "Current release", "License", "Cloud-based only"] to use part of dbms_info.

4. recalc ranking_table_dbms_info

The table joined by ranking_table and dbms_info is marked as ranking_table_dbms_info. Some fields should be re-calculated as other data formats. Default set recalc_cols = ["Initial release", "Current release", "License", "Cloud-based only"] and a correspond function must be implemented in class RecalcFuncPool() for each re-calculate filed.

5. reuse existing tagging info

Reuse existing tagging information manually labeled DB_EngRank_tophalf_githubprj_summary.csv. Keep the manually labeled items of each record, update the new scores and new ranks, or insert new records. Results will be saved as ranking_crawling_202211_automerged.csv, and with a by-product df_category_labels_updated.csv point out the mapping relations between the values of 'category_label' and 'Multi_model_info'.

Introduce to the crawling task

Target: 基于github_log的数据库开源软件生态调查

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.idea		.idea
data		data
docs		docs
script		script
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

db_engines_ranking_table_crawling

1. Crawling ranking table

2. Crawling DBMS information

3. join ranking_table and dbms_info on 'DBMS'

4. recalc ranking_table_dbms_info

5. reuse existing tagging info

Introduce to the crawling task

About

Releases 19

Packages

Languages

License

birdflyi/db_engines_ranking_table_crawling

Folders and files

Latest commit

History

Repository files navigation

db_engines_ranking_table_crawling

1. Crawling ranking table

2. Crawling DBMS information

3. join ranking_table and dbms_info on 'DBMS'

4. recalc ranking_table_dbms_info

5. reuse existing tagging info

Introduce to the crawling task

About

Resources

License

Stars

Watchers

Forks

Releases 19

Packages 0

Languages

Packages