Editing CSV using Python

Question

I have a big csv file that I want to edit. The editing here means deleting the columns that have only one value. So far, I wrote this (since I'm a newbie in Python I'm stuck and not sure if that is the right solution for the problem):

import csv
import collections
import numpy as np 


number_of_rows = 2432 
interseting_cols = [] 

col_values = collections.defaultdict(list)
col_values_named = collections.defaultdict(list)
new_row = collections.defaultdict(list)
inputFile = open('input.csv', 'r',newline='');
outputFile= open('output.csv','w')

reader = csv.reader(inputFile)
writer = csv.writer(outputFile)
#skip field names
next(reader)
for row in reader:
    for col, value in enumerate(row):
        col_values[col].append(value)
        #each column is now saved col_values ( without the headers )


for  i in range(len(col_values)):
    if len(set(col_values[i][:(number_of_rows-1)])) != 1:
        interseting_cols.append(i)# saved the index of the columns with valid values 

inputFile.seek(0)

# reading the file again now with headers
for row in reader:
    for col, value in enumerate(row):
        col_values_named[col].append(value)# saving the columns now with header 


# generating a new CSV file, only with interessting columns :
for i in range(number_of_rows):
    print("i value ",i)
    for j in range(len(interseting_cols)): # I'm not sure about this PART !!!!
            new_row.append(col_values_named[interseting_cols[j]])
            writer.writerow(new_row)

Any idea how to do the last loop? Or is there is a better way to solve this?

UPDATE say the file looks like

---------------------------------------------------
            |A|B   |C   |D  |F   |G|H   |I|J  |K   |       
--------------------------------------------------- 
1           |1|NULL|444 |201|0.01|A|NULL|4|9.5|NULL|     
--------------------------------------------------- 
2           |2|NULL|NULL|201|0   |A|NULL|4|9.5|NULL|
--------------------------------------------------- 
3           |4|NULL|444 |201|0   |A|NULL|4|9.5|NULL|
--------------------------------------------------- 
4           |1|NULL|444 |201|0   |A|NULL|4|9.5|NULL|

in this case the result should only include only three columns A,C and F

Could you edit the question to include a small sample from your CSV file, and also how you want it to appear afterwards. — Martin Evans
– Martin Evans, Commented Mar 22, 2017 at 12:54
As a tip, there exists a libary called Pandas, which is extremely useful in reading, manipulating und writing data — Christian Sauer
– Christian Sauer, Commented Mar 22, 2017 at 13:03

Naveen Honest Raj K · Accepted Answer · 2017-03-22 13:04:13Z

2

Using pandas library, you can reduce all your extra work by its own inbuilt functions. Here is a small implementation of the requirement you posted above. If you are a beginner and in need of little more clear explanations, ping me in comment and am ready to give a little more information. By the way, start playing around with pandas.

import pandas as pd 

df = pd.read_csv('input.csv')

for columns in df:
    if len(df[columns].unique()) == 1:
        df.drop(columns, 1, inplace=True)

df.to_csv('output.csv', index=None)

answered Mar 22, 2017 at 13:04

Naveen Honest Raj K

3621 gold badge2 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Engine Over a year ago

thanks for your answer, but how can I ignore the headers in the 1st rows ?

Naveen Honest Raj K Over a year ago

While writing to csv?? just do df.to_csv('output.csv', index=None, header=False)

Engine Over a year ago

no while reading because if the check includes the headers , it will never delete any column

Naveen Honest Raj K Over a year ago

Sorry for the late reply bro.! while reading, just do this.. df = pd.read_csv('input.csv', header = None )

Engine Over a year ago

thanks for replying I've solved it, but using the hardway no panda ;-)

|

nigel222 · Accepted Answer · 2017-03-22 13:47:31Z

1

Unless the spreadsheet is truly enormous, just read the whole thing in and then find what you want!

Untested code:

headers = reader.next()
sheet = []
for row in reader:
    sheet.append(row)

# assuming all rows are the same length ...
for colno,header in enumerate(headers):
    col = [ row[colno] for row in sheet ]
    distinct = set( col)
    if len(distinct) > 1:
        # col contains at least two distinct values, so
        # do something with it and its header and/or column number
        writer.writerow(  [header, colno] + col )

answered Mar 22, 2017 at 13:47

nigel222

8,3221 gold badge19 silver badges31 bronze badges

Collectives™ on Stack Overflow

Editing CSV using Python

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related