Big Data / Spark

different file format ORC, Avro, Parquet in Hadoop, Spark

ORC (Optimized Row Columnar) is a file format that is optimized for reading and writing large datasets in a columnar format. It was developed by the Apache Hive project and is widely used in the Hadoop ecosystem. ORC files are optimized for reading and writing large datasets and provide features such as predicate pushdown, column projection, and row-level access. They also support a wide range of compression codecs such as Snappy, Zlib, and others. ORC’s format is relatively lightweight and therefore results in smaller file sizes.

Parquet is another file format that is optimized for reading and writing large datasets in a columnar format. It was developed by the Apache Parquet project and is widely used in the Hadoop ecosystem. Parquet files are optimized for analytical processing and provide features such as column projection, encoding, and compression. They also support a wide range of compression codecs such as Snappy, Gzip, and others. Parquet’s format is relatively heavyweight and therefore results in larger file sizes.

Avro is a data serialization system that is widely used in the big data ecosystem. Avro files are optimized for data exchange and provide features such as schema evolution and support for a wide range of compression codecs such as Snappy, Deflate, and others. Avro’s format is relatively heavyweight and therefore results in larger file sizes than ORC.

Feature	ORC (Optimized Row Columnar)	Parquet	Avro
Provider	Apache Hadoop	Apache Parquet	Apache
Compression	Snappy, Zlib, and others	Snappy, Gzip, and others	Snappy, Deflate, and others
Schema evolution	Yes	Yes	Yes
Column projection	Yes	Yes	No
Encryption support	Yes	No	No
Row-level access	Yes	No	No
File size	Smaller	Larger	Larger than ORC
Query performance	Faster	Slower	Slower than ORC