2

I have a big csv file that I want to edit. The editing here means deleting the columns that have only one value. So far, I wrote this (since I'm a newbie in Python I'm stuck and not sure if that is the right solution for the problem):

import csv
import collections
import numpy as np 


number_of_rows = 2432 
interseting_cols = [] 

col_values = collections.defaultdict(list)
col_values_named = collections.defaultdict(list)
new_row = collections.defaultdict(list)
inputFile = open('input.csv', 'r',newline='');
outputFile= open('output.csv','w')

reader = csv.reader(inputFile)
writer = csv.writer(outputFile)
#skip field names
next(reader)
for row in reader:
    for col, value in enumerate(row):
        col_values[col].append(value)
        #each column is now saved col_values ( without the headers )


for  i in range(len(col_values)):
    if len(set(col_values[i][:(number_of_rows-1)])) != 1:
        interseting_cols.append(i)# saved the index of the columns with valid values 

inputFile.seek(0)

# reading the file again now with headers
for row in reader:
    for col, value in enumerate(row):
        col_values_named[col].append(value)# saving the columns now with header 


# generating a new CSV file, only with interessting columns :
for i in range(number_of_rows):
    print("i value ",i)
    for j in range(len(interseting_cols)): # I'm not sure about this PART !!!!
            new_row.append(col_values_named[interseting_cols[j]])
            writer.writerow(new_row)

Any idea how to do the last loop? Or is there is a better way to solve this?

UPDATE say the file looks like

---------------------------------------------------
            |A|B   |C   |D  |F   |G|H   |I|J  |K   |       
--------------------------------------------------- 
1           |1|NULL|444 |201|0.01|A|NULL|4|9.5|NULL|     
--------------------------------------------------- 
2           |2|NULL|NULL|201|0   |A|NULL|4|9.5|NULL|
--------------------------------------------------- 
3           |4|NULL|444 |201|0   |A|NULL|4|9.5|NULL|
--------------------------------------------------- 
4           |1|NULL|444 |201|0   |A|NULL|4|9.5|NULL|

in this case the result should only include only three columns A,C and F

2
  • 1
    Could you edit the question to include a small sample from your CSV file, and also how you want it to appear afterwards. Commented Mar 22, 2017 at 12:54
  • 1
    As a tip, there exists a libary called Pandas, which is extremely useful in reading, manipulating und writing data Commented Mar 22, 2017 at 13:03

2 Answers 2

2

Using pandas library, you can reduce all your extra work by its own inbuilt functions. Here is a small implementation of the requirement you posted above. If you are a beginner and in need of little more clear explanations, ping me in comment and am ready to give a little more information. By the way, start playing around with pandas.

import pandas as pd 

df = pd.read_csv('input.csv')

for columns in df:
    if len(df[columns].unique()) == 1:
        df.drop(columns, 1, inplace=True)

df.to_csv('output.csv', index=None)
Sign up to request clarification or add additional context in comments.

6 Comments

thanks for your answer, but how can I ignore the headers in the 1st rows ?
While writing to csv?? just do df.to_csv('output.csv', index=None, header=False)
no while reading because if the check includes the headers , it will never delete any column
Sorry for the late reply bro.! while reading, just do this.. df = pd.read_csv('input.csv', header = None )
thanks for replying I've solved it, but using the hardway no panda ;-)
|
1

Unless the spreadsheet is truly enormous, just read the whole thing in and then find what you want!

Untested code:

headers = reader.next()
sheet = []
for row in reader:
    sheet.append(row)

# assuming all rows are the same length ...
for colno,header in enumerate(headers):
    col = [ row[colno] for row in sheet ]
    distinct = set( col)
    if len(distinct) > 1:
        # col contains at least two distinct values, so
        # do something with it and its header and/or column number
        writer.writerow(  [header, colno] + col )

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.