different file format ORC, Avro, Parquet in Hadoop, Spark

ORC (Optimized Row Columnar) is a file format that is optimized for reading and writing large datasets in a columnar format. It was developed by the Apache Hive project and is widely used in the Hadoop ecosystem. ORC files are optimized for reading and writing large datasets and provide features such as predicate pushdown, column projection, and row-level access. They also support a wide range of compression codecs such as Snappy, Zlib, and others. ORC’s format is relatively lightweight and therefore results in smaller file sizes.

Parquet is another file format that is optimized for reading and writing large datasets in a columnar format. It was developed by the Apache Parquet project and is widely used in the Hadoop ecosystem. Parquet files are optimized for analytical processing and provide features such as column projection, encoding, and compression. They also support a wide range of compression codecs such as Snappy, Gzip, and others. Parquet’s format is relatively heavyweight and therefore results in larger file sizes.

Avro is a data serialization system that is widely used in the big data ecosystem. Avro files are optimized for data exchange and provide features such as schema evolution and support for a wide range of compression codecs such as Snappy, Deflate, and others. Avro’s format is relatively heavyweight and therefore results in larger file sizes than ORC.

Feature ORC (Optimized Row Columnar) Parquet Avro
Provider Apache Hadoop Apache Parquet Apache
Compression Snappy, Zlib, and others Snappy, Gzip, and others Snappy, Deflate, and others
Schema evolution Yes Yes Yes
Column projection Yes Yes No
Encryption support Yes No No
Row-level access Yes No No
File size Smaller Larger Larger than ORC
Query performance Faster Slower Slower than ORC