-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling text fragments in the first few rows of a CSV file #29
Comments
Hi @RahulSinghYYC, thanks for your question. This depends a bit on whether you're reading the file as a list of list or as a dataframe. If you're using the Hope this helps, please let me know whether this answers your question. |
Hi @GjjvdBurg , Thank you for your reply but we get CSV files with multiple lines of text fragments so there is no way of telling how many rows we can skip, it's very dynamic, I am hoping to implement where it can be done by auto-detection. |
Hi @RahulSinghYYC, CleverCSV doesn't currently have support for detecting the table area automatically. I know there is some research on this problem (see, e.g. hypoparsr and Pytheas), but there are no ready-to-use Python solutions that I'm aware of. This is certainly a feature that would not be out of place in CleverCSV, so I suggest we leave this issue open so I/we can hopefully come back to it in the future. Thanks for letting me know there's interest in this :) |
@RahulSinghYYC For a preliminary workaround, you might want to consider reading the file and trying to determine the first line of the table. This could e.g. be a search for multiple occurrences of the column delimiter (I could envision using regex here) or to look for values that are certainly part of our table (like header names etc) to return the first line number of data. This number can then be used with the methods @GjjvdBurg suggested. But of course, of of this highly depends on your csv data. Hope I could help! |
Thanks for offering a suggestion @lcnittl, very nice of you to help! 👍 Just to offer another work-around: one of the main approaches that CleverCSV takes in detecting the dialect is looking at what we call "patterns" of row lengths. You can probably assume that the text before (and after) the table is not formatted with the same number of delimiters as the table itself. If this is the case, then you can use the row patterns to guess at the start and end of the table. Take, for example, this dataset of CO2 measurements, which has text before and after the table. We could extract the row patterns as follows: >>> from clevercsv import Sniffer
>>> from clevercsv.detect_pattern import make_abstraction
>>> from urllib.request import urlopen
>>>
>>> # get the data
>>> req = urlopen('https://cdiac.ess-dive.lbl.gov/ftp/trends/co2/maunaloa.co2')
>>> data = req.read()
>>>
>>> # detect the dialect
>>> dialect = Sniffer().sniff(data)
>>>
>>> # Compute the row patterns
>>> A = make_abstraction(data, dialect)
>>> row_patterns = A.split('R') This gives (abbreviating) |
Thank you @GjjvdBurg and @lcnittl for your advice and recommendation , I have been banging my head on this issue for a quite long time with no success , and its a big challenge for us to solve as we get lots of csv files with text fragments on top and bottom from various systems that we have no control over pattern. I will try your @GjjvdBurg recommendation and see how far I can get. Thanks |
Hi
I am trying to solve an issue in CSV files where there are explanations and text fragments in the first few rows of a table. How can I handle this issue using CleverCSV.
Any help would be greatly appreciated.
Thanks
The text was updated successfully, but these errors were encountered: