1

I want to use a compression in bigdata processing, but there are two compression codecs.

Anyone know the difference?

0

2 Answers 2

1

From a high-level perspective, both codecs represent the same functionality InputStream -> transformation -> OutputStream

The same concept is available in two places, because you can use Spark without Hadoop and Hadoop without Spark.

Just use codec, which is already available in your dependencies. If both are available, I suggest Spark as bit more modern.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/CompressionCodec.java

EDIT

If you are running Spark on Hadoop, SparkCodec uses HadoopCodec under the hood anyway

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/HadoopCodecStreams.scala

Sign up to request clarification or add additional context in comments.

Comments

0

Hadoop CompressionCodec compresses data at rest.
Spark CompressionCodec compresses data in motion.

File compression is NOT Shuffle compression they are independent and mutually exclusive..

Why ?

They solve different probelms :

Hadoop codec compresses and reduces storage cost

Spark Codec compresses and reduces network pressure/cost

The decompressed data from hdfs here is passed to Spark exection engine and then spark compression codec will be used if shuffle or spill happens

Sample exec flow :

  1. Read compressed Parquet file from hdfs using Hadoop codec

  2. Transform data in memory

  3. Shuffle data between executors using Spark codec

  4. Write output files using Hadoop codec

enter image description here

Hope this helps!

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.