Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files w/ quoted values that have commas throw excetion #38

Open
greghall76 opened this issue Aug 24, 2023 · 0 comments
Open

Files w/ quoted values that have commas throw excetion #38

greghall76 opened this issue Aug 24, 2023 · 0 comments

Comments

@greghall76
Copy link

Describe the bug
File contains quoted numbder "2,126,000,000"....
Throws off index alignment between types extracted in headers and data....

File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 397, in run_inference
schemas_result = prl.parallel(records = lines,obj=dtype, d_schema = self.__schema)
File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in parallel
return [p.get() for p in results]
File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in
return [p.get() for p in results]

To Reproduce
Steps to reproduce the behavior:

  1. See example below...
    "id","country","year","sex","age","suicides_no","population","country-year","HDI for year"," gdp_for_year","gdp_per_capita","generation"
    0,"Albania",1987,"male","15-24 years",21,312900,"Albania1987",,"2,156,624,900",796,"Generation X"
    1,"Albania",1987,"male","35-54 years",16,308000,"Albania1987",,"2,156,624,900",796,"Silent"
    2,"Albania",1987,"female","15-24 years",14,289700,"Albania1987",,"2,156,624,900",796,"Generation X"
    3,"Albania",1987,"male","75+ years",1,21800,"Albania1987",,"2,156,624,900",796,"G.I. Generation"
    4,"Albania",1987,"male","25-34 years",9,274300,"Albania1987",,"2,156,624,900",796,"Boomers"
    5,"Albania",1987,"female","75+ years",1,35600,"Albania1987",,"2,156,624,900",796,"G.I. Generation"

  2. See code below...
    from multiprocessing import freeze_support, Process
    from csv_schema_inference import csv_schema_inference

def main():
#if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT
conditions = {"INTEGER":"FLOAT"}
pathfile = "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/suicide_data.csv"

csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions)
aprox_schema = csv_infer.run_inference(pathfile)
csv_infer.pretty(aprox_schema)

if name == 'main':
freeze_support()
Process(target=main).start()

Expected behavior
Should have made it to some kind of schema inference.
e.g.
0
name
Username; Identifier;One-time password;Recovery code;First name;Last name;Department;Location
type
STRING
nullable
False
....

Desktop (please complete the following information):

  • OS: Ubuntu 22.04 and Python 3.10.12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant