Load text file as strings using numpy.loadtxt()

Question

I would like to load a big text file (around 1 GB with 3*10^6 rows and 10 - 100 columns) as a 2D np-array containing strings. However, it seems like numpy.loadtxt() only takes floats as default. Is it possible to specify another data type for the entire array? I've tried the following without luck:

loadedData = np.loadtxt(address, dtype=np.str)

I get the following error message:

/Library/Python/2.7/site-packages/numpy-1.8.0.dev_20224ea_20121123-py2.7-macosx-10.8-x86_64.egg/numpy/lib/npyio.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin)
    833             fh.close()
    834
--> 835     X = np.array(X, dtype)
    836     # Multicolumn data are returned with shape (1, N, M), i.e.
    837     # (1, 1, M) for a single row - remove the singleton dimension there

ValueError: cannot set an array element with a sequence

Any ideas? (I don't know the exact number of columns in my file on beforehand.)

Hooked · Accepted Answer · 2013-02-20 16:50:59Z

60

Use genfromtxt instead. It's a much more general method than loadtxt:

import numpy as np
print np.genfromtxt('col.txt',dtype='str')

Using the file col.txt:

foo bar
cat dog
man wine

This gives:

[['foo' 'bar']
 ['cat' 'dog']
 ['man' 'wine']]

If you expect that each row has the same number of columns, read the first row and set the attribute filling_values to fix any missing rows.

answered Feb 20, 2013 at 16:50

Hooked

88.9k46 gold badges197 silver badges271 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user1966176 Over a year ago

Thanks! It works fine, except it's extremely time consuming. But from what I've read, that is expected using genfromtxt instead of loadtxt. Any way to speed things up? Or any other faster way to load big quantities of data?

Hooked Over a year ago

If you are going to be using the dataset again and again you might want to think of a storage solution other than a raw text file. Personally I'd go with pytables or another hd5f solution.

user1966176 Over a year ago

Actually, python crashes when trying to load a 800Mb text file. It fills up the memory using 8GB RAM and 35GB Swap..

user1966176 Over a year ago

I use the data to train a classifier algorithm, so I only need it once.

Hooked Over a year ago

@Sigur you'll have to do that after you load the file. If you want to use pandas as indicated in some of the other answers, they have direct string methods for this kind of stuff.

|

Demitri · Accepted Answer · 2019-10-28 02:59:05Z

17

There is also read_csv in Pandas, which is fast and supports non-comma column separators and automatic typing by column:

import pandas as pd
df = pd.read_csv('your_file',sep='\t')

It can be converted to a NumPy array if you prefer that type with:

import numpy as np
arr = np.array(df)

This is by far the easiest and most mature text import approach I've come across.

edited Oct 28, 2019 at 2:59

Demitri

14.2k4 gold badges44 silver badges43 bronze badges

answered Mar 20, 2014 at 19:44

Alexander Tronchin-James

8281 gold badge9 silver badges15 bronze badges

Comments

StupidWolf · Accepted Answer · 2021-04-11 09:44:08Z

3

np.loadtxt(file_path, dtype=str) enter image description here

edited Apr 11, 2021 at 9:44

StupidWolf

47.1k17 gold badges50 silver badges81 bronze badges

answered Apr 11, 2021 at 8:45

Peng Zheng

771 silver badge3 bronze badges

Comments

Peter Mortensen · Accepted Answer · 2018-01-08 22:47:18Z

2

Is it essential that you need a NumPy array? Otherwise you could speed things up by loading the data as a nested list.

def load(fname):
    ''' Load the file using std open'''
    f = open(fname,'r')

    data = []
    for line in f.readlines():
        data.append(line.replace('\n','').split(' '))

    f.close()

    return data

For a text file with 4000x4000 words this is about 10 times faster than loadtxt.

edited Jan 8, 2018 at 22:47

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Feb 21, 2013 at 14:56

flonk

3,9163 gold badges27 silver badges41 bronze badges

2 Comments

zhangxaochen Over a year ago

if you convert the list obj to ndarray when return it, the time consumption will be almost the same

flonk Over a year ago

Of course in that case you don't save time. That's what I said with the first sentence followed by the Otherwise :)

Collectives™ on Stack Overflow

Load text file as strings using numpy.loadtxt()

4 Answers 4

6 Comments

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related