How do I read Snappy compressed files on HDFS without using Hadoop?

Question

I'm storing files on HDFS in Snappy compression format. I'd like to be able to examine these files on my local Linux file system to make sure that the Hadoop process that created them has performed correctly.

When I copy them locally and attempt to de-compress them with the Google standard libarary, it tells me that the file is missing the Snappy identifier. When I try to go around this by inserting a Snappy identifier, it messes up the checksum.

What can I do to read these files without having to write a separate Hadoop program or pass it through something like Hive?

Robert Rapplean · Accepted Answer · 2020-05-29 02:06:43Z

26

I finally found out that I can use the following command to read the contents of a Snappy compressed file on HDFS:

hadoop fs -text /path/filename

Using the latest commands on Cloudera or HDP:

hdfs dfs -text /path/filename

If the intent is to download the file in text format for additional examination and processing, the output of that command can be piped to a file on the local system. You can also use head to just view the first few lines of the file.

edited May 29, 2020 at 2:06

answered Nov 26, 2014 at 23:08

Robert Rapplean

6691 gold badge9 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Bunder Over a year ago

How can I do this programatically in scala or java ?

Robert Rapplean Over a year ago

To write to a Snappy file programmatically, you need to import the Snappy codec class and get an instance of that class as part of the mapper or reducer setup. You need to pass your output stream through the codec's "createOutputStream" function to get the encoded output stream. Here's a snippet. Reading is the same in reverse codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf); fileOut = fs.create(targetPath, false); thiswriter = new LineRecordWriter<EtlKey, EtlValue>(new DataOutputStream(codec.createOutputStream(fileOut)));

Community · Accepted Answer · 2017-05-23 12:09:54Z

3

Please take a look at this post on Cloudera blog. It explains how to use Snappy with Hadoop. Essentially, Snappy files on raw text are not splittable, so you cannot read a single file across multiple hosts.

The solution is to use Snappy in a container format, so essentially you're using Hadoop SequenceFile with compression set as Snappy. As described in this answer, you can set the property mapred.output.compression.codec to org.apache.hadoop.io.compress.SnappyCodec and setup your job output format as SequenceFileOutputFormat.

And then to read it, you should only need to use SequenceFile.Reader because the codec information is stored in the file header.

edited May 23, 2017 at 12:09

CommunityBot

11 silver badge

answered May 21, 2013 at 18:34

Charles Menguy

41.6k18 gold badges97 silver badges117 bronze badges

1 Comment

Robert Rapplean Over a year ago

Thanks, Charles, but I don't think that this addresses my question. Let me simplify it. I use hadoop fs -get filename to move a file from HDFS to my local directory on Linux. Now that I have it here, why can't I use the snappy java libraries to decompress it?

Jyotirmoy Sundi · Accepted Answer · 2013-07-25 22:54:15Z

0

Thats because snappy used by hadoop has some more meta data which is not undesrtood by libraries like https://code.google.com/p/snappy/, You need to use hadoop native snappy to unsnap the data file that you downloaded.

answered Jul 25, 2013 at 22:54

Jyotirmoy Sundi

3551 gold badge5 silver badges12 bronze badges

1 Comment

Robert Rapplean Over a year ago

Could you be a little more specific? I'm looking to do this from the hadoop fs interface if possible.

UnknownBeef · Accepted Answer · 2021-10-20 01:39:29Z

0

If you land here and are trying to decompress a .snappy file via a local command line (like I was), try this tool:

https://github.com/kubo/snzip#hadoop-snappy-format

answered Oct 20, 2021 at 1:39

UnknownBeef

5784 silver badges22 bronze badges

Collectives™ on Stack Overflow

How do I read Snappy compressed files on HDFS without using Hadoop?

4 Answers 4

2 Comments

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related