I want to use a compression in bigdata processing, but there are two compression codecs.
Anyone know the difference?
From a high-level perspective, both codecs represent the same functionality InputStream -> transformation -> OutputStream
The same concept is available in two places, because you can use Spark without Hadoop and Hadoop without Spark.
Just use codec, which is already available in your dependencies. If both are available, I suggest Spark as bit more modern.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/CompressionCodec.java
EDIT
If you are running Spark on Hadoop, SparkCodec uses HadoopCodec under the hood anyway
Hadoop CompressionCodec compresses data at rest.
Spark CompressionCodec compresses data in motion.
Why ?
They solve different probelms :
Hadoop codec compresses and reduces storage cost
Spark Codec compresses and reduces network pressure/cost
The decompressed data from hdfs here is passed to Spark exection engine and then spark compression codec will be used if shuffle or spill happens
Sample exec flow :
Read compressed Parquet file from hdfs using Hadoop codec
Transform data in memory
Shuffle data between executors using Spark codec
Write output files using Hadoop codec
Hope this helps!