1

Scenario:

I have imported the data from SQl Server to HDFS. The data stored in HDFS directory in a multiple files as:

part-m-00000
part-m-00001
part-m-00002
part-m-00003

Question:

My question is that While reading this stored data from HDFS directory we have to read all file (part-m-00000,01,02,03) or just part-m-00000. Because when I read that data, I found that the data inside HDFS is little bit missing. So, is it happens or something I missed out?

3 Answers 3

2

You need to read all the files, not just 00000. The reason there are multiple files is that sqoop works in a map-reduce fashion, splitting the "import" work to multiple parts. The output from each part is put in a separate file.

RL

Sign up to request clarification or add additional context in comments.

Comments

1

Sqoop is running the import with no reducers.As a result,there is no consolidation for the part files which were processed by the mappers.Hence you will see part files depending upon the number of mappers you have set in the sqoop command as --m4 or --num-4.So if you provide sqoop import --connect jdbc:mysql://localhost/db --username <>--table <>--m1 then it will create only one part file.

Comments

0

If your result size is huge, then Hive will store the result in chunks. And If you want to Read those all files using CLI, then execute below command.

$ sudo cat part-m-*

It will give you final result without any of missing part.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.