6

I am reading a dataset (separated by whitespace) from a file. I need to store all columns apart from last one in the array data, and the last column in the array target.

Can you guide me how to proceed further?

That's what I have so far:

with open(filename) as f:
    data = f.readlines()

Or should I be reading line by line?

PS: The data type of columns is also different.

Edit: Sample Data

faban       1   0   0.288   withspy
faban       2   0   0.243   withoutspy
simulated   1   0   0.159   withoutspy
faban       1   1   0.189   withoutspy
5
  • 1
    Can you provide the sample data? Commented Jan 4, 2016 at 7:31
  • Kindly check edit part. Commented Jan 4, 2016 at 7:33
  • 1
    You probably want to use the csv module. Commented Jan 4, 2016 at 7:34
  • Please describe the output as well Commented Jan 4, 2016 at 7:43
  • If you're going to do some sort of analysis later, you can probably also look at pandas (pandas.pydata.org). It provides functionality to read in data from CSV files. You can then separate the columns and play around with the data in the way you wish. Commented Jan 4, 2016 at 7:43

4 Answers 4

9

This would work:

data = []
target = []
with open('faban.txt') as fobj:
    for line in fobj:
        row = line.split()
        data.append(row[:-1])
        target.append(row[-1])

Now:

>>> data
[['faban', '1', '0', '0.288'],
 ['faban', '2', '0', '0.243'],
 ['simulated', '1', '0', '0.159'],
 ['faban', '1', '1', '0.189']]

>>> target
['withspy', 'withoutspy', 'withoutspy', 'withoutspy']
Sign up to request clarification or add additional context in comments.

Comments

4

I think numpy has a clean, easy solution here.

>>> import numpy as np
>>> data, target = np.array_split(np.loadtxt('file', dtype=str), [-1], axis=1)

results in:

>>> data.tolist()
[['faban', '1', '0', '0.288'], 
 ['faban', '2', '0', '0.243'], 
 ['simulated', '1', '0', '0.159'], 
 ['faban', '1', '1', '0.189']]
>>> target.flatten().tolist()
['withspy', 'withoutspy', 'withoutspy', 'withoutspy']

Comments

3

You could do that with pandas using read_table to read your data, iloc to subset your data, values to get values from DataFrame and tolist method to convert numpy array to list:

import pandas as pd
df = pd.read_table('path_to_your_file', delim_whitespace=True, header=None)
print(df)
           0  1  2      3           4
0      faban  1  0  0.288     withspy
1      faban  2  0  0.243  withoutspy
2  simulated  1  0  0.159  withoutspy
3      faban  1  1  0.189  withoutspy


data = df.iloc[:,:-1].values.tolist()
target = df.iloc[:,-1].tolist()

print(data)
[['faban', 1, 0, 0.28800000000000003],
 ['faban', 2, 0, 0.243],
 ['simulated', 1, 0, 0.159],
 ['faban', 1, 1, 0.18899999999999997]]

print(target)
['withspy', 'withoutspy', 'withoutspy', 'withoutspy']

1 Comment

read_table is deprecated, modern version : pd.read_csv('path_to_your_file', sep='\t', header=None). As a bonus note that you can name columns with names=['foo','bar','whatever','target'].
0

The following works nicely:

data = open('<FILE>', 'r').read().split('\n')
out = []
for l in data:
    out.append([e for e in l.split(' ') if e])

out will then have the the format [['faban', '1', '0', '0.288', 'withspy'],[...],...] (Note, all elements are strings)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.