Memory error when creating huge 3D sparse numpy array for one-hot vector

Ask Question

Asked 7 years, 6 months ago

Modified 7 years, 6 months ago

Viewed 233 times

Part of NLP Collective

I am trying to create 3 tensors for my language translation LSTM network.

import numpy as np
Num_samples=50000
Time_step=100
Vocabulary=5000
shape = (Num_samples,Time_step,Vocabulary)
encoder_input_data = np.zeros(shape,dtype='float32')
decoder_input_data = np.zeros(shape,dtype='float32')
decoder_target_data = np.zeros(shape,dtype='float32')

Obviously, my machine doesn't have enough memory to do so. Since the data is represented as one-hot vectors, it seems using the function csc_matrix() from scipy.sparse will be the solution, as suggested in this tread and this tread.

But after trying the csc_matrix() and crc_matrix(), it seems they only support 2D array.

Old treads from 6 years ago did talk about this issue, but they are not machine learning orientated.

My question is: Is there any python lib/tool that can help me to create sparse 3D arrays that allows me to store one-hot vectors for machine learning purpose later?

edited May 21, 2018 at 11:03

asked May 21, 2018 at 3:49

Raven Cheuk

3,0535 gold badges35 silver badges59 bronze badges

scipy.sparse matrices are used in learning, such as in the sklearn package. But, no they have not been expanded to 3d.

hpaulj
– hpaulj

2018-05-21 04:00:11 +00:00
Commented May 21, 2018 at 4:00
If it is the case, is there a way to train my network with the scipy.sparse matrices? Since all of my machine learning experience are creating 3D array in the very beginning which contains batch, time_step, and length of the one-hot vector. I am open to new ways to deal with the training data.

Raven Cheuk
– Raven Cheuk

2018-05-21 04:24:40 +00:00
Commented May 21, 2018 at 4:24
I suddenly have another thought in mind (Which is not directly related to this problem). How strong is the machine that Google use to train their translation deep network? Obviously, 50000 training examples are a not huge data size at all, yet, it takes 23Gb memory for each array already. I would expect Google would use a even larger training examples, say 5 millions training examples. By using the same calcualtio, they would need 2000Gb RAM. Is such a machine even exist?

Raven Cheuk
– Raven Cheuk

2018-05-21 04:41:04 +00:00
Commented May 21, 2018 at 4:41
@RavenCheuk the sorts of models trained by giant companies are usually trained on large, distributed computing clusters, not a single machine. And probably significantly more training data than 5 million examples

juanpa.arrivillaga
– juanpa.arrivillaga

2018-05-21 05:27:10 +00:00
Commented May 21, 2018 at 5:27
But yes, you can nowadays buy a machine with 2000GB of ram anyway.

juanpa.arrivillaga
– juanpa.arrivillaga

2018-05-21 05:27:54 +00:00
Commented May 21, 2018 at 5:27

| Show 4 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Memory error when creating huge 3D sparse numpy array for one-hot vector

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked