I have problem with importing big xml file (1.3 gb) into mongodb in order to search for most frequent words in map & reduce manner.
http://dumps.wikimedia.org/plwiki/20141228/plwiki-20141228-pages-articles-multistream.xml.bz2
Here I enclose xml cut (first 10 000 lines) out from this big file:
http://www.filedropper.com/text2
I know that I can't import xml directly into mongodb. I used some tools do so. I used some python scripts and all has failed.
Which tool or script should I use? What should be a key & value? I think the best solution to find most frequent world would be this.
(_id : id, value: word )
then I would sum all the elements like in docs example:
http://docs.mongodb.org/manual/core/map-reduce/
Any clues would be greatly appreciated, but how to import this file into mongodb to have collections like that?
(_id : id, value: word )
If you have any idea please share.
Edited After research, I would use python or js to complete this task.
I would extract only words in <text></text> section which is under /<page><revision>, exlude <, > etc., and then separate words and upload them to mongodb with pymongo or js.
So there are several pages with revision and text.
Edited
fileinput, because you will load only one line at once, and not the whole file will be loaded to memory, then you decide when you will write to another file (csv or json).openwill use all memory, github.com/abdelouahabb/kouider-ezzadam/blob/master/…