how to store csv field values in an array in python

Question

Suppose I have two csv files file1.csv

event_id, polarity
   1124,   0.3763
  36794,   0.638
 dhejjd,   0.3627

file2.csv

event_id, tallies
   61824,   0.3
   36794,   0.8
   dhejjd,   0.9
   dthdnb,   0.66

I want to access the polarity and tallies for each event_id. how can i read these files in 2 arrays so that for each [event_id] i can get the polarity and tallies and then perform my calculations with these two values. i was trying this but didnt work out I got an error:

 for event_id, polarity in file1reader: ValueError: need more than 1 value to unpack

My code: import csv

file1reader = csv.reader(open("file1.csv"), delimiter=",")
file2reader = csv.reader(open("file2.csv"), delimiter=",")

header1 = file1reader.next() #header
header2 = file2reader.next() #header

for event_id, polarity in file1reader:
    #if event_id and polarity in file1reader:
      for event_id, tallies in file2reader:
        #if event_id in file2reader:
          if file1reader.event_id == file2reader.event_id:
            print event_id, polarity, tallies   
            break   
file1reader.close()
file2reader.close()

What did not work out? Be more specific. Are you getting any error? — Mihai Caracostea
– Mihai Caracostea, Commented May 29, 2015 at 9:09
yes. for event_id, polarity in file1reader: ValueError: need more than 1 value to unpack — MEH
– MEH, Commented May 29, 2015 at 9:22
@MEH. can you post the actual format of your csv files, what you have in your question is a mess. — Padraic Cunningham
– Padraic Cunningham, Commented May 29, 2015 at 9:46

lanenok · Accepted Answer · 2015-05-31 09:03:49Z

3

Use pandas data frames instead of numpy arrays

import pandas as pd
df = pd.read_csv("file1.csv", index_col="event_id", skipinitialspace=True)
df2 = pd.read_csv("file2.csv", index_col="event_id", skipinitialspace=True)
df = df.merge(df2, how='outer', left_index=True, right_index=True)

P.S. Corrected the code so that it runs. The 'outer' join means that if only 'polarity' or 'tallies' exist for a given 'event_id', then missing values are coded as NaNs. The output is

          polarity  tallies
event_id                   
1124        0.3763      NaN
36794       0.6380     0.80
61824          NaN     0.30
dhejjd      0.3627     0.90
dthdnb         NaN     0.66

If you need only rows where both are present, use how='inner'

P.P.S To work with this data frame further you can, for example, replace NaNs with some value, let us say 0:

df.fillna(0, inplace=True)

You can select elements by label

df.loc["dhejjd","polarity"]
df.loc[:,"tallies"]

or by integer position

df.iloc[0:3,:]

If you never used pandas, it takes some time to learn it and get used to it. And it is worth every second.

edited May 31, 2015 at 9:03

answered May 29, 2015 at 9:14

lanenok

2,75919 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

The6thSense Over a year ago

No one will and is doing that if our answer is really bad then how can we find what mistake we are doing

Padraic Cunningham Over a year ago

Not mine but this would not even run so I am not overly surprised it was downvoted, you also don't link to pandas which is not builtin or explain what it is doing

lanenok Over a year ago

@Padraic Cunningham Thank you for the constructive criticism. I corrected and extended my answer accordingly

The6thSense Over a year ago

@lanenok it is a fellow stack overflowers feeling

Kasravnd · Accepted Answer · 2015-05-29 09:11:08Z

You don't need to loop over both csvreader objects. you can first use itertools.chain to concatenate 2 csvreader. Then use a dictionary (with setdefault method )to store the event_id's as the keys and polarity as the values.

import csv
from itertools import chain
d={}
with open('a1.txt', 'rb') as csvfile1,open('ex.txt', 'rb') as csvfile2:
     spamreader1 = csv.reader(csvfile1, delimiter=',')
     spamreader2 = csv.reader(csvfile2, delimiter=',')
     spamreader1.next()
     spamreader2.next()
     sp=chain(spamreader1,spamreader2)
     for i,j in sp:
            d.setdefault(i,[]).append(j)
     print d

result :

{'36794': ['0.638', '0.8'], 
 '61824': ['0.3'], 
 '1124': ['0.3763'], 
 'dthdnb': ['0.66'], 
 'dhejjd': ['0.3627', '0.9']}

NDevox · Accepted Answer · 2015-05-29 09:22:04Z

0

When you loop through file2 the first time, you hit the stop iteration and the file will remain there. To read it multiple times you have to open it multiple times - but this entire process is wasteful. Assuming you can fit all of the data into memory you could just read the data into dicts:

import csv

file1 = {}

file2 = {}

with open('file1.csv', 'r') as input1:

    reader = csv.reader(input1)
    reader.next()

    for row in reader:
        file1[row[0]] = row[1]

with open('file2.csv', 'r') as input2:

    reader = csv.reader(input2)
    reader.next()

    for row in reader:
        file2[row[0]] = row[1]


# And now we can directly compare without looping through file 2 every time

for key in file1:
    # try/except is more pythonic.
    try:
        print key, file1[key], file2[key]
    except KeyError:
        pass

This saves processing time as you don't have to loop so much and stops you from having to open and close the file every time you go to the next iteration of file1.

Note: I originally used dictreader in this example - but this was based on the assumption you had multiple columns which I believe was wrong. In this case you can just use list indexing.

If you were to have multiple columns with the same name and varying order, you could use dictreader instead

If this is the case and you need to use the DictReader, the code is as follows:

import csv

file1 = {}

file2 = {}

with open('file1.csv', 'r') as input1:

    reader = csv.DictReader(input1)
    # Don't use next so we can use the headers as keys

    for row in reader:
        file1[row['event_id']] = row['polarity']

with open('file2.csv', 'r') as input2:

    reader = csv.DictReader(input2)
    # Don't use next so we can use the headers as keys

    for row in reader:
        file2[row['event_id']] = row['tallies']


# And now we can directly compare without looping through file 2 every time

for key in file1:
    # try/except is more pythonic.
    try:
        print key, file1[key], file2[key]
    except KeyError:
        pass

edited May 29, 2015 at 9:22

answered May 29, 2015 at 9:13

NDevox

4,0865 gold badges25 silver badges36 bronze badges

8 Comments

MEH Over a year ago

i am getting:Traceback (most recent call last): File "combined5.py", line 14, in <module> file1[row['event_id']] = row['polarity'] KeyError: 'polarity'

NDevox Over a year ago

Check the column header names match the keys. Note - you don't really need the dictreader to do this, you can use list indexing which I've changed the answer to match.

NDevox Over a year ago

look out for capitalisation, extra spaces etc. they all count. You can print row for an example of what the actual keys are.

MEH Over a year ago

my actual keys are event_id and polarity. same goes for event_id and tallies.

NDevox Over a year ago

and if you switch file1[row['event_id']] = row['polarity'] to print row.keys() what do you get for the keys printed out? The error suggests that polarity isn't quite whats in the header row.

|

Padraic Cunningham · Accepted Answer · 2015-05-29 09:59:47Z

You can group them using a dict:

from collections import defaultdict
d = defaultdict (list)

with open("file1.csv") as f1, open("file2.csv") as f2:
    d = defaultdict(list)
    next(f1),next(f2)
    r1 = csv.reader(f1,skipinitialspace=True)
    r2 = csv.reader(f2,skipinitialspace=True)
    for row in r1:
        d[row[0]].append(float(row[1]))
    for row in r2:
        d[row[0]].append(float(row[1]))

defaultdict(<type 'list'>, {'36794': [0.638, 0.8], '61824': [0.3], '1124': [0.3763], 'dthdnb': [0.66], 'dhejjd': [0.3627, 0.9]})

from operator import mul
for k, v in filter(lambda x: len(x[1])== 2, d.items()):
    print(mul(*v))
0.5104
0.32643

If you actually have multiple spaces in your file then the csv module is not going to work which based on your ValueError is probably the case.

If your file is a mess this will work:

with open("file1.csv") as f1, open("file2.csv") as f2:
    d = defaultdict(list)
    next(f1), next(f2)
    for row in f1:
        eve, pol = row.replace(" ","").split(",")
        d[eve].append(float(pol))
    for row in f2:
        eve, tal = row.replace(" ","").split(",")
        d[eve].append(float(tal))

Input:

event_id, polarity
   1124,   0.3763
  36794,   0.638
 dhejjd,   0.3627
file2.csv

event_id, tallies
   61824,   0.3
   36794,   0.8
   dhejjd,   0.9
   dthdnb,   0.66

Output:

defaultdict(<type 'list'>, {'36794': [0.638, 0.8], '61824': [0.3], '1124': [0.3763], 'dthdnb': [0.667], 'dhejjd': [0.3627, 0.9]})

martineau · Accepted Answer · 2015-05-29 11:00:56Z

I'd suggest storing the data from the two files into a dictionary of dictionaries which can easily be created by using collections.defaultdict.

import csv
from collections import defaultdict
import json  # just for pretty printing resulting data structure

event_data = defaultdict(dict)

filename1 = "file1.csv"
filename2 = "file2.csv"

with open(filename1, "rb") as file1:
    file1reader = csv.reader(file1, delimiter=",", skipinitialspace=True)
    next(file1reader)  # skip over header
    for event_id, polarity in file1reader:
        event_data[event_id]['polarity'] = float(polarity)

with open(filename2, "rb") as file2:
    file2reader = csv.reader(file2, delimiter=",", skipinitialspace=True)
    next(file2reader)  # skip over header
    for event_id, tallies in file2reader:
        event_data[event_id]['tallies'] = float(tallies)

print 'event_data:', json.dumps(event_data, indent=4)
print

# print as table
for event_id in sorted(event_data):
    print 'event_id: {!r:<8} polarity: {:<8} tallies: {:<8}'.format(
        event_id,
        event_data[event_id].get('polarity', None),
        event_data[event_id].get('tallies', None))

Output:

event_data: {
    "36794": {
        "polarity": 0.638, 
        "tallies": 0.8
    }, 
    "61824": {
        "tallies": 0.3
    }, 
    "1124": {
        "polarity": 0.3763
    }, 
    "dthdnb": {
        "tallies": 0.66
    }, 
    "dhejjd": {
        "polarity": 0.3627, 
        "tallies": 0.9
    }
}

event_id: '1124'   polarity: 0.3763   tallies: None    
event_id: '36794'  polarity: 0.638    tallies: 0.8     
event_id: '61824'  polarity: None     tallies: 0.3     
event_id: 'dhejjd' polarity: 0.3627   tallies: 0.9     
event_id: 'dthdnb' polarity: None     tallies: 0.66

Collectives™ on Stack Overflow

how to store csv field values in an array in python

5 Answers 5

4 Comments

Comments

8 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

Comments

8 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related