Handling text fragments in the first few rows of a CSV file #29

RahulSinghYYC · 2020-11-05T18:00:42Z

Hi
I am trying to solve an issue in CSV files where there are explanations and text fragments in the first few rows of a table. How can I handle this issue using CleverCSV.

Any help would be greatly appreciated.

Thanks

GjjvdBurg · 2020-11-05T18:11:18Z

Hi @RahulSinghYYC, thanks for your question. This depends a bit on whether you're reading the file as a list of list or as a dataframe. If you're using the read_table function, then you can simply remove the unnecessary rows. If you're using the read_dataframe method, then you can pass the skiprows argument to read_dataframe, as this will be passed on to pandas.read_csv.

Hope this helps, please let me know whether this answers your question.

RahulSinghYYC · 2020-11-05T19:20:37Z

Hi @GjjvdBurg , Thank you for your reply but we get CSV files with multiple lines of text fragments so there is no way of telling how many rows we can skip, it's very dynamic, I am hoping to implement where it can be done by auto-detection.
Any help is greatly appreciated.
Thanks

GjjvdBurg · 2020-11-06T16:23:49Z

Hi @RahulSinghYYC, CleverCSV doesn't currently have support for detecting the table area automatically. I know there is some research on this problem (see, e.g. hypoparsr and Pytheas), but there are no ready-to-use Python solutions that I'm aware of.

This is certainly a feature that would not be out of place in CleverCSV, so I suggest we leave this issue open so I/we can hopefully come back to it in the future. Thanks for letting me know there's interest in this :)

lcnittl · 2020-11-09T22:26:02Z

@RahulSinghYYC For a preliminary workaround, you might want to consider reading the file and trying to determine the first line of the table. This could e.g. be a search for multiple occurrences of the column delimiter (I could envision using regex here) or to look for values that are certainly part of our table (like header names etc) to return the first line number of data. This number can then be used with the methods @GjjvdBurg suggested. But of course, of of this highly depends on your csv data.

Hope I could help!

GjjvdBurg · 2020-11-09T23:11:15Z

Thanks for offering a suggestion @lcnittl, very nice of you to help! 👍

Just to offer another work-around: one of the main approaches that CleverCSV takes in detecting the dialect is looking at what we call "patterns" of row lengths. You can probably assume that the text before (and after) the table is not formatted with the same number of delimiters as the table itself. If this is the case, then you can use the row patterns to guess at the start and end of the table.

Take, for example, this dataset of CO2 measurements, which has text before and after the table. We could extract the row patterns as follows:

>>> from clevercsv import Sniffer
>>> from clevercsv.detect_pattern import make_abstraction
>>> from urllib.request import urlopen
>>>
>>> # get the data
>>> req = urlopen('https://cdiac.ess-dive.lbl.gov/ftp/trends/co2/maunaloa.co2')
>>> data = req.read()
>>>
>>> # detect the dialect
>>> dialect = Sniffer().sniff(data)
>>> 
>>> # Compute the row patterns
>>> A = make_abstraction(data, dialect)
>>> row_patterns = A.split('R')

This gives (abbreviating) row_patterns = ['C'] * 13 + ['CDCDCDCDCDCDCDCDCDCDCDCDCDC'] * 53 + ['C'] * 3, so we can be quite sure that we can skip the first 13 rows and remove the last 3. Note that it's not in general guaranteed that the longest/most frequent row pattern is that of the table, but this may be the case in your scenario.

RahulSinghYYC · 2020-12-02T22:41:59Z

Thank you @GjjvdBurg and @lcnittl for your advice and recommendation , I have been banging my head on this issue for a quite long time with no success , and its a big challenge for us to solve as we get lots of csv files with text fragments on top and bottom from various systems that we have no control over pattern.

I will try your @GjjvdBurg recommendation and see how far I can get.

Thanks

RahulSinghYYC changed the title ~~Handling Extra Text before header starts~~ Handling text fragments in the first few rows of a CSV file Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling text fragments in the first few rows of a CSV file #29

Handling text fragments in the first few rows of a CSV file #29

RahulSinghYYC commented Nov 5, 2020

GjjvdBurg commented Nov 5, 2020

RahulSinghYYC commented Nov 5, 2020

GjjvdBurg commented Nov 6, 2020

lcnittl commented Nov 9, 2020

GjjvdBurg commented Nov 9, 2020

RahulSinghYYC commented Dec 2, 2020

Handling text fragments in the first few rows of a CSV file #29

Handling text fragments in the first few rows of a CSV file #29

Comments

RahulSinghYYC commented Nov 5, 2020

GjjvdBurg commented Nov 5, 2020

RahulSinghYYC commented Nov 5, 2020

GjjvdBurg commented Nov 6, 2020

lcnittl commented Nov 9, 2020

GjjvdBurg commented Nov 9, 2020

RahulSinghYYC commented Dec 2, 2020