How to use Snappy in Hadoop in Container format

Question

I have to use Snappy to compress the map o/p and the map-reduce o/p as well. Further, this should be splittable.

As I studied online, to make Snappy write splittable o/p, we have to use it in a Container like format.

Can you please suggest how to go about it? I tried finding some examples online, but could not fine one. I am using Hadoop v0.20.203.

Thanks. Piyush

root1982 · Accepted Answer · 2012-04-25 05:10:20Z

5

for output

conf.setOutputFormat(SequenceFileOutputFormat.class); SequenceFileOutputFormat.setOutputCompressionType(conf, CompressionType.BLOCK); SequenceFileOutputFormat.setCompressOutput(conf, true); conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

For map output

Configuration conf = new Configuration(); conf.setBoolean("mapred.compress.map.output", true); conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

answered Apr 25, 2012 at 5:10

root1982

4702 gold badges4 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Piyush Kansal Over a year ago

Thanks. However, I am not using the Sequence file format, but BufferedWriter. So, can you suggest how to do it.

root1982 Over a year ago

"One thing to note is that Snappy is intended to be used with a container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce." (cloudera.com/blog/2011/09/snappy-and-hadoop)

Piyush Kansal Over a year ago

The data we are going to compress using Snappy will not be passed further to any MapReduce job, it will just stay on the disk. So, we just want to use it for compression and measure the difference b/w Gzip and Snappy in terms of compression ration and execution time. So it is okay with me even it is not splittable.

root1982 Over a year ago

I think it should be OK then.

VeLKerr · Accepted Answer · 2015-03-03 12:18:34Z

1

In the new API OutputFormat installing for the Job, and not for the configuration. Then, first part will be:

Job job = new Job(conf);
...
SequenceFileOutputFormat.setOutputCompressionType(job, CompressionType.BLOCK);
SequenceFileOutputFormat.setCompressOutput(job, true);

conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

answered Mar 3, 2015 at 12:18

VeLKerr

3,2075 gold badges31 silver badges48 bronze badges

Collectives™ on Stack Overflow

How to use Snappy in Hadoop in Container format

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related