30

I would like to load a big text file (around 1 GB with 3*10^6 rows and 10 - 100 columns) as a 2D np-array containing strings. However, it seems like numpy.loadtxt() only takes floats as default. Is it possible to specify another data type for the entire array? I've tried the following without luck:

loadedData = np.loadtxt(address, dtype=np.str)

I get the following error message:

/Library/Python/2.7/site-packages/numpy-1.8.0.dev_20224ea_20121123-py2.7-macosx-10.8-x86_64.egg/numpy/lib/npyio.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin)
    833             fh.close()
    834
--> 835     X = np.array(X, dtype)
    836     # Multicolumn data are returned with shape (1, N, M), i.e.
    837     # (1, 1, M) for a single row - remove the singleton dimension there

ValueError: cannot set an array element with a sequence

Any ideas? (I don't know the exact number of columns in my file on beforehand.)

4 Answers 4

60

Use genfromtxt instead. It's a much more general method than loadtxt:

import numpy as np
print np.genfromtxt('col.txt',dtype='str')

Using the file col.txt:

foo bar
cat dog
man wine

This gives:

[['foo' 'bar']
 ['cat' 'dog']
 ['man' 'wine']]

If you expect that each row has the same number of columns, read the first row and set the attribute filling_values to fix any missing rows.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks! It works fine, except it's extremely time consuming. But from what I've read, that is expected using genfromtxt instead of loadtxt. Any way to speed things up? Or any other faster way to load big quantities of data?
If you are going to be using the dataset again and again you might want to think of a storage solution other than a raw text file. Personally I'd go with pytables or another hd5f solution.
Actually, python crashes when trying to load a 800Mb text file. It fills up the memory using 8GB RAM and 35GB Swap..
I use the data to train a classifier algorithm, so I only need it once.
@Sigur you'll have to do that after you load the file. If you want to use pandas as indicated in some of the other answers, they have direct string methods for this kind of stuff.
|
17

There is also read_csv in Pandas, which is fast and supports non-comma column separators and automatic typing by column:

import pandas as pd
df = pd.read_csv('your_file',sep='\t')

It can be converted to a NumPy array if you prefer that type with:

import numpy as np
arr = np.array(df)

This is by far the easiest and most mature text import approach I've come across.

Comments

3

np.loadtxt(file_path, dtype=str) enter image description here

Comments

2

Is it essential that you need a NumPy array? Otherwise you could speed things up by loading the data as a nested list.

def load(fname):
    ''' Load the file using std open'''
    f = open(fname,'r')

    data = []
    for line in f.readlines():
        data.append(line.replace('\n','').split(' '))

    f.close()

    return data

For a text file with 4000x4000 words this is about 10 times faster than loadtxt.

2 Comments

if you convert the list obj to ndarray when return it, the time consumption will be almost the same
Of course in that case you don't save time. That's what I said with the first sentence followed by the Otherwise :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.