Right now I am trying to read in data which is provided in a messy to read-in format. Here is an example
#Name,
#Comment,""
#ExtComment,""
#Source,
[Data]
1,2
3,4
5,6
#[END_OF_FILE]
When working with one or two of these files, I have manually changed the ['DATA'] header to ['x', 'y'] and am able to read in data just fine by skipping the first few rows and not reading the last line.
However, right now I have 30+ files, split between two different folders and I am trying to figure out the best way to read in the files and change the header of each file from ['DATA'] to ['x', 'y'].
The excel files are in a folder one path lower than the file that is supposed to read them (i.e. folder 1 contains set of code below, and folder 2 contains the excel files, folder 1 contains folder 2)
Here is what I have right now:
#sets - refers to the set containing the name of each file (i.e. [file1, file2])
#df - the dataframe which you are going to store the data in
#dataLabels - the headers you want to search for within the .csv file
#skip - the number of rows you want to skip
#newHeader - what you want to change the column headers to be
#pathName - provide path where files are located
def reader (sets, df, dataLabels, skip, newHeader, pathName):
for i in range(len(sets)):
df_temp = pd.read_csv(glob.glob(pathName+ sets[i]+".csv"), sep=r'\s*,', skiprows = skip, engine = 'python')[:-1]
df_temp.column.value[0] = [newHeader]
for j in range(len(dataLabels)):
df_temp[dataLabels[j]] = pd.to_numeric(df_temp[dataLabels[j]],errors = 'coerce')
df.append(df_temp)
return df
When I run my code, I run into the error:
No columns to parse from file
I am not quite sure why - I have tried skipping past the [DATA] header and I still receive that error.
Note, for this example I would like the headers to be 'x', 'y' - I am trying to make a universal function so that I could change it to something more useful depending on what I am measuring.
read_csvhas askiprowsparameter that can be used to skip the first rows, acommentthat can be set to#to skip the first rows. If nothing works, you could usedf.iterrows()to find the index of the first "good" row then usegood_df = df.iloc[(i + 1):].reset_index(drop=True)to get everything after that. You'll have to do some column-renaming and type-fixing in that case