2

I have datasets containing data for frequent rule mining where each row has a different number of items like

9 10 5
8 9 10 5 12 15
7 3 5

Is there a way that we could read the files with the above contents at once and convert it to numpy array of arrays like np.array(np.array([

array([array([ 9, 10,  5]), array([ 8,  9, 10,  5, 12, 15]),
       array([7, 3, 5])], dtype=object)

I have come across numpy.loadtxt function but it does not cater the different number of columns the way I want. With different numbers of columns, loadtxt requires mentioning the columns to be used for reading the data. But, I want to to read all the values in each row.
One way to achieve this could be to manually read the files and convert each line into numpy 'array` but I don't want to take that route because the actual datasets will be a lot bigger than the tiny example shown here. For instance, I am planning to use datasets from FIMI repository. One sample data is accident data.
Edit: I used the following code to achieve what I want

data = []
# d = np.loadtxt('datasets/grocery.dat')
with open('datasets/accidents.dat', 'r') as f:
    for l in f.readlines():
        ar = np.genfromtxt(StringIO(l))
        data.append(ar)
print(data)
data = np.array(data)
print(data)

But, this is what I want to avoid: looping in the python code because it took more than four minutes to just read the data and convert it into numpy arrays

2
  • 1
    Why go though genfromtxt if you are just going parse one line at a time? It will slow things down. Load everything as a list of lists, and forget numpy. Commented Apr 16, 2020 at 0:36
  • You could read all lines at once, but you then have a list of strings, that still gave to be split and converted to numbers. Commented Apr 16, 2020 at 1:43

1 Answer 1

2
In [401]: txt="""9 10 5 
     ...: 8 9 10 5 12 15 
     ...: 7 3 5 
     ...: 9 10 5 
     ...: 8 9 10 5 12 15 
     ...: 7 3 5 
     ...: 9 10 5 
     ...: 8 9 10 5 12 15 
     ...: 7 3 5""".splitlines()                                                                        

(this approximates what we'd get with readlines)

Collecting a list of lists is straight forward, but converting the strings to numbers would require list comprehension:

In [402]: alist = []                                                                                   
In [403]: for line in txt: 
     ...:     alist.append(line.split()) 
     ...:                                                                                              
In [404]: alist                                                                                        
Out[404]: 
[['9', '10', '5'],
 ['8', '9', '10', '5', '12', '15'],
 ['7', '3', '5'],
 ['9', '10', '5'],
 ['8', '9', '10', '5', '12', '15'],
 ['7', '3', '5'],
 ['9', '10', '5'],
 ['8', '9', '10', '5', '12', '15'],
 ['7', '3', '5']]
In [405]: np.array(alist)                                                                              
Out[405]: 
array([list(['9', '10', '5']), list(['8', '9', '10', '5', '12', '15']),
       list(['7', '3', '5']), list(['9', '10', '5']),
       list(['8', '9', '10', '5', '12', '15']), list(['7', '3', '5']),
       list(['9', '10', '5']), list(['8', '9', '10', '5', '12', '15']),
       list(['7', '3', '5'])], dtype=object)

It might be faster to convert each line to an integer array (but that's just a guess):

In [406]: alist = [] 
     ...: for line in txt: 
     ...:     alist.append(np.array(line.split(), dtype=int)) 
     ...:      
     ...:                                                                                              
In [407]: alist                                                                                        
Out[407]: 
[array([ 9, 10,  5]),
 array([ 8,  9, 10,  5, 12, 15]),
 array([7, 3, 5]),
 array([ 9, 10,  5]),
 array([ 8,  9, 10,  5, 12, 15]),
 array([7, 3, 5]),
 array([ 9, 10,  5]),
 array([ 8,  9, 10,  5, 12, 15]),
 array([7, 3, 5])]
In [408]: np.array(alist)                                                                              
Out[408]: 
array([array([ 9, 10,  5]), array([ 8,  9, 10,  5, 12, 15]),
       array([7, 3, 5]), array([ 9, 10,  5]),
       array([ 8,  9, 10,  5, 12, 15]), array([7, 3, 5]),
       array([ 9, 10,  5]), array([ 8,  9, 10,  5, 12, 15]),
       array([7, 3, 5])], dtype=object)

Given the irregular nature of the text, and mix of array lengths in the result, there isn't much of an alternative. Arrays or lists of diverse size is a pretty good indicator that fast multidimensional array operations are not possible.

We could load all numbers as a 1d array with:

In [413]: np.fromstring(' '.join(txt), sep=' ', dtype=int)                                             
Out[413]: 
array([ 9, 10,  5,  8,  9, 10,  5, 12, 15,  7,  3,  5,  9, 10,  5,  8,  9,
       10,  5, 12, 15,  7,  3,  5,  9, 10,  5,  8,  9, 10,  5, 12, 15,  7,
        3,  5])

but splitting that into line arrays still requires some sort of line count followed by an array split. So I doubt if it would save any time.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.