Difference between org.apache.hadoop.io.compress.CompressionCodec and org.apache.spark.io.CompressionCodec

Question

I want to use a compression in bigdata processing, but there are two compression codecs.

Anyone know the difference?

mgosk · Accepted Answer · 2025-12-15 09:23:11Z

From a high-level perspective, both codecs represent the same functionality InputStream -> transformation -> OutputStream

The same concept is available in two places, because you can use Spark without Hadoop and Hadoop without Spark.

Just use codec, which is already available in your dependencies. If both are available, I suggest Spark as bit more modern.

EDIT

If you are running Spark on Hadoop, SparkCodec uses HadoopCodec under the hood anyway

Ram Ghadiyaram · Accepted Answer · 2025-12-20 22:34:53Z

Hadoop CompressionCodec compresses data at rest.
Spark CompressionCodec compresses data in motion.

File compression is NOT Shuffle compression they are independent and mutually exclusive..

Why ?

They solve different probelms :

Hadoop codec compresses and reduces storage cost

Spark Codec compresses and reduces network pressure/cost

The decompressed data from hdfs here is passed to Spark exection engine and then spark compression codec will be used if shuffle or spill happens

Sample exec flow :

Hope this helps!