DOC: How to read data into pandas / dask / xarray #145

westurner · 2019-09-12T09:47:32Z

Is there a good reference or a ckanapi function on how to read datasets from a CKAN instance into pandas and/or dask and/or xarray?

Pandas

pandas.read_json("https://url.to/dataset.json")
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#json
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html
https://pandas.pydata.org/pandas-docs/stable/ecosystem.html#api
- API: Fix CKAN API Interface sinhrks/pyopendata#3
- pandas-datareader, pandaSDMX, fredapi, quandl (Toronto)

Dask

dask.dataframe.read_json("https://url.to/dataset.json")
https://docs.dask.org/en/latest/remote-data-services.html
- https://filesystem-spec.readthedocs.io/en/latest/features.html#instance-caching , arrow
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_json
https://docs.dask.org/en/latest/dataframe.html#common-uses-and-anti-uses
https://docs.dask.org/en/latest/bag.html
https://examples.dask.org/bag.html
https://examples.dask.org/applications/json-data-on-the-web.html
https://ml.dask.org/ (dask + {scikit-learn, TensorFlow, XGBoost})

xarray

Caching

https://github.com/reclosedev/requests-cache (sqlite)
- https://pandas-datareader.readthedocs.io/en/latest/cache.html
https://github.com/ionrock/cachecontrol (dict, file, redis, sqlite)

The text was updated successfully, but these errors were encountered:

kumaranil02 · 2021-12-03T11:14:40Z

I am trying to pull data from "https://opendata.nhsbsa.net/dataset/english-prescribing-dataset-epd-with-snomed-code/resource/374ee7ac-fd8e-4c3f-b7a9-6ea27cc16d63" website.

The website provides API to scrape large dataset. The data I am pulling is 17Million records.

API to pull the data :

'https://opendata.nhsbsa.net/api/3/action/datastore_search?offset=0&resource_id=EPD_SNOMED_202109&limit=5000'

Below is the code I am running.

import requests
import json

offset = 70000

for i in range(0,17000000,offset):
    url = 'https://opendata.nhsbsa.net/api/3/action/datastore_search?offset=' + str(i+1) + '&resource_id=EPD_SNOMED_'+ str(202109) +'&limit=' + str(offset)
    r= requests.get(url).json()
    df=pd.DataFrame(r['result']['records'])
    if i == 0:
      df.to_csv('data_pull.csv',mode='a', header=True,index=False)
    else:
      df.to_csv('data_pull.csv',mode='a', header=False,index=False)

The above code is taking more than 3hours and also gives duplicate values. There are no duplicates present in the actual data.

Please provide a better answer to below question:

https://stackoverflow.com/questions/70209859/web-scaping-in-python-for-large-data-set-from-api

suggestion needed on a better library or process to do this.

westurner mentioned this issue Sep 12, 2019

API: Fix CKAN API Interface sinhrks/pyopendata#3

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: How to read data into pandas / dask / xarray #145

DOC: How to read data into pandas / dask / xarray #145

westurner commented Sep 12, 2019 •

edited

Loading

kumaranil02 commented Dec 3, 2021 •

edited

Loading

DOC: How to read data into pandas / dask / xarray #145

DOC: How to read data into pandas / dask / xarray #145

Comments

westurner commented Sep 12, 2019 • edited Loading

kumaranil02 commented Dec 3, 2021 • edited Loading

westurner commented Sep 12, 2019 •

edited

Loading

kumaranil02 commented Dec 3, 2021 •

edited

Loading