Python "input data"

Question

I have file *.data, which include data in this order:

2.5,10,U1
3,4.5,U1
3,9,U1
3.5,5.5,U1
3.5,8,U1
4,7.5,U1
4.5,3.5,U1
4.5,4.5,U1
4.5,6,U1
5,5,U1
5,7,U1
7,6.5,U1
3.5,9.5,U2
3.5,10.5,U2
4.5,8,U2
4.5,10.5,U2
5,9,U2
5.5,5.5,U2
5.5,7.5,U2

In this data(I have different types of data, this is just example where are just 2 classes...), is 2 classes: U1 and U2, and for every class there is 2 values... What I need is to read this data and separate them to classes, in this case to U1 and U2.... Then after that I need to take from every class 2/3 data to new value(learning_set), and other 1/3 to other value(test_set).

I started with this code:

data = open('set.data', 'rt')                             
data_list=[]                                                   
border=2./3                                                  
data_list = [line.strip().split(',') for line in data]

learning_set=data_list[:int(round(len(data_list)*border))]
test_set=data_list[int(round(len(data_list)*border)):]

But there I take from all data 2/3 and 1/3, not from every class.

Many thanks for help

If you're generating this file from python to begin with, maybe you should consider the pickle module, especially if you've already done this processing beforehand. There will be a little more overhead though and the file will no longer be entirely human-readable. — WirthLuce
– WirthLuce, Commented May 15, 2011 at 16:13

Howard · Accepted Answer · 2011-05-15 16:25:25Z

2

You can filter your list after reading into two distinct subsets:

data_list_1 = [(x,y,c) for (x,y,c) in data_list if c=='U1']
data_list_2 = [(x,y,c) for (x,y,c) in data_list if c=='U2']

Afterwards you can then construct two different learing sets and test sets as before but on the filtered lists, e.g.

learning_set = data_list_1[:int(round(len(data_list_1)*border))] + data_list_2[:int(round(len(data_list_2)*border))]

and same for test_set.

Update: If you don't know the classes before you can use the following code to first detect all classes and then loop over them.

classes = set([t[-1] for t in data_list])

learning_set = []
test_set = []

for cl in classes:
    data_list_filtered = [t for t in data_list if t[-1]==cl]

    learning_set += data_list_filtered[:int(round(len(data_list_filtered)*border))]
    test_set += data_list_filtered[int(round(len(data_list_filtered)*border)):]

edited May 15, 2011 at 16:25

answered May 15, 2011 at 16:07

Howard

39.3k9 gold badges68 silver badges85 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

thaking Over a year ago

Problem is that i don't know name of class, even don't know how many class I have; only know that class name is last element in every row, after comma.

thaking Over a year ago

Hmmm, i received error: classes = set([c for (x,y,c) in data_list]) ValueError: too many values to unpack This happend, when I use other iput data, which have 3 classes...

Howard Over a year ago

@thaking no, it happens if there is a line with more than three values

thaking Over a year ago

Well, I have data where is { x,y,z,f,g....., class } how can I modify your code for this?

senderle · Accepted Answer · 2011-05-15 16:36:47Z

2

Ah, you want itertools.groupby:

import itertools
class_dict = dict(itertools.groupby(data_list, key=lambda x: x[-1]))
class_names = class_dict.keys()
class_lists = [list(group) for group in class_dict.values()]

Then just slice each list in class_lists appropriately and extend learning_set and test_set with the results.

Here's a full solution:

data_list = [line.strip().split(',') for line in data]
data_list.sort(key=lambda x: x[-1])

def bisect_list(split_list, fraction):
    split_index = int(fraction * len(split_list))
    return split_list[:split_index], split_list[split_index:]

learning_set, test_set = [], []
for key, group in itertools.groupby(data_list, key=lambda x: x[-1]):
    l, t = bisect_list(list(group), 0.66)
    learning_set.extend(l)
    test_set.extend(t)

edited May 15, 2011 at 16:36

answered May 15, 2011 at 16:15

senderle

152k36 gold badges218 silver badges244 bronze badges

1 Comment

Rob Cowie Over a year ago

Good answer though note that the input to groupby() must be sorted on the grouping key

Rob Cowie · Accepted Answer · 2011-05-15 16:44:21Z

For what it's worth (and because I've typed it out already), I'd accomplish this with something like...

from itertools import groupby
from operator import attrgetter
from collections import namedtuple

row_container = namedtuple('row', 'val1,val2,klass')

def process_row(row):
    """Return a named tuple"""
    return row_container(float(row[0]), float(row[1]), row[2])

def bisect_list(split_list, fraction):
    split_index = int(fraction * len(split_list))
    return split_list[:split_index], split_list[split_index:]


data = open('test.csv', 'rt')

## Parse & process each line
data = (row.strip().split(',') for row in data)
data = (process_row(row) for row in data)

## Sort & group the data by class
sorted_data = sorted(data, key=attrgetter('klass'))
grouped_data = groupby(sorted_data, attrgetter('klass'))

## For each class, create learning and test sets
final_data = {}
for klass, class_rows in grouped_data:
    learning_set, test_set = bisect_list(list(class_rows), 0.66)
    final_data[klass] = dict(learning=learning_set, test=test_set)

Method of operation is similar to other answers already provided. Uses namedtuple. bisectlist() lifted from @senderle

MRAB · Accepted Answer · 2011-05-15 16:11:12Z

1

I would use a defaultdict to collect the entries into separate lists.

from collections import defaultdict

data = open(r'C:\Documents and Settings\Administrator\Desktop\set.data', 'r')
data_lists = defaultdict(list)
border = 2.0 / 3
for line in data:
    entries = line.strip().split(',')
    data_lists[entries[-1]].append(entries[ : -1])

learning_sets = {}
test_sets = {}
for cls, values in data_lists.items():
    pos = int(round(len(values) * border))
    learning_sets[cls] = values[ : pos]
    test_sets[cls] = values[pos : ]

for cls in learning_sets:
    print "for class", cls
    print "\tlearning set is", learning_sets[cls]
    print "\ttest set is", test_sets[cls]
    print

answered May 15, 2011 at 16:11

MRAB

20.7k6 gold badges44 silver badges34 bronze badges

Comments

matchew · Accepted Answer · 2011-05-15 16:11:29Z

1

consider using a dict/hash instead of a list.

i'd write more, but I am having trouble comprehending what you want to do afterwards.

answered May 15, 2011 at 16:11

matchew

19.8k5 gold badges46 silver badges48 bronze badges

Collectives™ on Stack Overflow

Python "input data"

5 Answers 5

4 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related