15

I have a CSV file and I want to check if the first row has only strings in it (ie a header). I'm trying to avoid using any extras like pandas etc. I'm thinking I'll use an if statement like if row[0] is a string print this is a CSV but I don't really know how to do that :-S any suggestions?

4
  • 1
    That really depends on how you define a 'header' Commented Oct 22, 2016 at 14:49
  • 1
    Thanks for your suggestions everyone, I think I've found a way to do it. Commented Oct 22, 2016 at 15:32
  • @plshelp can you share how you do it? Commented Dec 11, 2017 at 13:00
  • stackoverflow.com/users/4787949/in%c3%aas-martins - Commented Jan 5, 2020 at 3:49

8 Answers 8

11

Python has a built in CSV module that could help. E.g.

import csv
with open('example.csv', 'rb') as csvfile:
    sniffer = csv.Sniffer()
    has_header = sniffer.has_header(csvfile.read(2048))
    csvfile.seek(0)
    # ...
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks. It worked well for me. But can you please explain why did you pass 2048 but not any other number?
@AzharKhan 2048 is an entirely arbitrary number. It just needs to be big enough to read in at least two or three CSV rows. You could instead read in a few lines to a string and pass that to has_header.
sniffer.has_header always return True... I tested several csv files... :/
It's working but for a large file, it takes too much time.
In python3, you might need to change 'rb' to newline='' (python3 gets a lot more specific on bytes vs strings, but changing to 'r' may assume a newline delimiter)
4

Here is a function I use with pandas in order analyze whether header should be set to 'infer' or None:

def identify_header(path, n=5, th=0.9):
    df1 = pd.read_csv(path, header='infer', nrows=n)
    df2 = pd.read_csv(path, header=None, nrows=n)
    sim = (df1.dtypes.values == df2.dtypes.values).mean()
    return 'infer' if sim < th else None

Based on a small sample, the function checks the similarity of dtypes with and without a header row. If the dtypes match for a certain percentage of columns, it is assumed that there is no header present. I found a threshold of 0.9 to work well for my use cases. This function is also fairly fast as it only reads a small sample of the csv file.

2 Comments

if the csv files are big. this could be a problem
@FoggyMindedGreenhorn Why? We don't read the entire file here.
3

I'd do something like this:

is_header = not any(cell.isdigit() for cell in csv_table[0])

Given a CSV table csv_table, grab the top (zeroth) row. Iterate through the cells and check if they contain any pure digit strings. If so, it's not a header. Negate that with a not in front of the whole expression.

Results:

In [1]: not any(cell.isdigit() for cell in ['2','1'])
Out[1]: False

In [2]: not any(cell.isdigit() for cell in ['2','gravy'])
Out[2]: False

In [3]: not any(cell.isdigit() for cell in ['gravy','gravy'])
Out[3]: True

Comments

2

For files that are not necessarily in '.csv' format, this is very useful:

built-in function in Python to check Header in a Text file

def check_header(filename):
    with open(filename) as f:
        first = f.read(1)
        return first not in '.-0123456789'

Answer by: https://stackoverflow.com/users/908494/abarnert

Post link: https://stackoverflow.com/a/15671103/7763184

Comments

0

Well i faced exactly the same problem with the wrong return of has_header for sniffer.has_header and even made a very simple checker that worked in my case

    has_header = ''.join(next(some_csv_reader)).isalpha()

I knew that it wasn't perfect but it seemed it was working...and why not it was a simple replace and check if the the result was alpha or not...and then i put it on my def and it failed.... :( and then i saw the "light"
The trouble is not with the has_header the trouble was with my code because i wanted to also check the delimiter before i parse the actual .csv ...but all the sniffing has a "cost" as they advance one line at a time in the csv. !!!
So in order to have has_header working as it should you should make sure you have reset everything before using it. In my case my method is :

  def _get_data(self, filename):
        sniffer = csv.Sniffer()
        training_data = ''
        with open(filename, 'rt') as csvfile:
            dialect = csv.Sniffer().sniff(csvfile.read(2048))
            training_data = csv.reader(csvfile, delimiter=dialect.delimiter)
            csvfile.seek(0)
            has_header=csv.Sniffer().has_header(csvfile.read(2048))
            #has_header = ''.join(next(training_data)).isalpha()
            csvfile.seek(0)

Comments

0

I think the best way to check this is -> simply reading 1st line from file and then match your string instead of any library.

Comments

0

Simply use try and except ::::::::::::::::::::::::::

import pandas as pd
try:
   data = pd.read_csv('file.csv',encoding='ISO-8859-1')
   print('csv file has header::::::')        
except:
    print('csv file has no header::::::')
   

Comments

0

An updated version of ChrisD's answer with fallback for empty files:

with open(filename, "r") as f:
    try:
        has_headings = csv.Sniffer().has_header(f.read(1024))
    except csv.Error:
        # The file seems to be empty
        has_headings = False

https://docs.python.org/3/library/csv.html#csv.Sniffer.has_header

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.