I have a CSV file and I want to check if the first row has only strings in it (ie a header). I'm trying to avoid using any extras like pandas etc. I'm thinking I'll use an if statement like if row[0] is a string print this is a CSV but I don't really know how to do that :-S any suggestions?
-
1That really depends on how you define a 'header'TurtleIzzy– TurtleIzzy2016-10-22 14:49:15 +00:00Commented Oct 22, 2016 at 14:49
-
1Thanks for your suggestions everyone, I think I've found a way to do it.plshelp– plshelp2016-10-22 15:32:40 +00:00Commented Oct 22, 2016 at 15:32
-
@plshelp can you share how you do it?Inês Martins– Inês Martins2017-12-11 13:00:33 +00:00Commented Dec 11, 2017 at 13:00
-
stackoverflow.com/users/4787949/in%c3%aas-martins -SunnyAk– SunnyAk2020-01-05 03:49:22 +00:00Commented Jan 5, 2020 at 3:49
8 Answers
Python has a built in CSV module that could help. E.g.
import csv
with open('example.csv', 'rb') as csvfile:
sniffer = csv.Sniffer()
has_header = sniffer.has_header(csvfile.read(2048))
csvfile.seek(0)
# ...
5 Comments
2048 but not any other number?2048 is an entirely arbitrary number. It just needs to be big enough to read in at least two or three CSV rows. You could instead read in a few lines to a string and pass that to has_header.Here is a function I use with pandas in order analyze whether header should be set to 'infer' or None:
def identify_header(path, n=5, th=0.9):
df1 = pd.read_csv(path, header='infer', nrows=n)
df2 = pd.read_csv(path, header=None, nrows=n)
sim = (df1.dtypes.values == df2.dtypes.values).mean()
return 'infer' if sim < th else None
Based on a small sample, the function checks the similarity of dtypes with and without a header row. If the dtypes match for a certain percentage of columns, it is assumed that there is no header present. I found a threshold of 0.9 to work well for my use cases. This function is also fairly fast as it only reads a small sample of the csv file.
I'd do something like this:
is_header = not any(cell.isdigit() for cell in csv_table[0])
Given a CSV table csv_table, grab the top (zeroth) row. Iterate through the cells and check if they contain any pure digit strings. If so, it's not a header. Negate that with a not in front of the whole expression.
Results:
In [1]: not any(cell.isdigit() for cell in ['2','1'])
Out[1]: False
In [2]: not any(cell.isdigit() for cell in ['2','gravy'])
Out[2]: False
In [3]: not any(cell.isdigit() for cell in ['gravy','gravy'])
Out[3]: True
Comments
For files that are not necessarily in '.csv' format, this is very useful:
built-in function in Python to check Header in a Text file
def check_header(filename):
with open(filename) as f:
first = f.read(1)
return first not in '.-0123456789'
Answer by: https://stackoverflow.com/users/908494/abarnert
Post link: https://stackoverflow.com/a/15671103/7763184
Comments
Well i faced exactly the same problem with the wrong return of has_header for sniffer.has_header and even made a very simple checker that worked in my case
has_header = ''.join(next(some_csv_reader)).isalpha()
I knew that it wasn't perfect but it seemed it was working...and why not it was a simple replace and check if the the result was alpha or not...and then i put it on my def and it failed.... :( and then i saw the "light"
The trouble is not with the has_header the trouble was with my code because i wanted to also check the delimiter before i parse the actual .csv ...but all the sniffing has a "cost" as they advance one line at a time in the csv. !!!
So in order to have has_header working as it should you should make sure you have reset everything before using it.
In my case my method is :
def _get_data(self, filename):
sniffer = csv.Sniffer()
training_data = ''
with open(filename, 'rt') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(2048))
training_data = csv.reader(csvfile, delimiter=dialect.delimiter)
csvfile.seek(0)
has_header=csv.Sniffer().has_header(csvfile.read(2048))
#has_header = ''.join(next(training_data)).isalpha()
csvfile.seek(0)
Comments
An updated version of ChrisD's answer with fallback for empty files:
with open(filename, "r") as f:
try:
has_headings = csv.Sniffer().has_header(f.read(1024))
except csv.Error:
# The file seems to be empty
has_headings = False
https://docs.python.org/3/library/csv.html#csv.Sniffer.has_header