I have datasets containing data for frequent rule mining where each row has a different number of items like
9 10 5
8 9 10 5 12 15
7 3 5
Is there a way that we could read the files with the above contents at once and convert it to numpy array of arrays like
np.array(np.array([
array([array([ 9, 10, 5]), array([ 8, 9, 10, 5, 12, 15]),
array([7, 3, 5])], dtype=object)
I have come across numpy.loadtxt function but it does not cater the different number of columns the way I want. With different numbers of columns, loadtxt requires mentioning the columns to be used for reading the data. But, I want to to read all the values in each row.
One way to achieve this could be to manually read the files and convert each line into numpy 'array` but I don't want to take that route because the actual datasets will be a lot bigger than the tiny example shown here. For instance, I am planning to use datasets from FIMI repository. One sample data is accident data.
Edit:
I used the following code to achieve what I want
data = []
# d = np.loadtxt('datasets/grocery.dat')
with open('datasets/accidents.dat', 'r') as f:
for l in f.readlines():
ar = np.genfromtxt(StringIO(l))
data.append(ar)
print(data)
data = np.array(data)
print(data)
But, this is what I want to avoid: looping in the python code because it took more than four minutes to just read the data and convert it into numpy arrays
genfromtxtif you are just going parse one line at a time? It will slow things down. Load everything as a list of lists, and forget numpy.