different file format ORC, Avro, Parquet in Hadoop, Spark
ORC (Optimized Row Columnar) is a file format that is optimized for reading and writing large datasets in a columnar format. It was developed by the Apache Hive project and is widely used in the Hadoop ecosystem. ORC files are optimized for reading and writing large datasets and provide features such as predicate pushdown, column projection, and row-level access. They also support a wide range of compression codecs such as Snappy, Zlib, and others. ORC’s format is relatively lightweight and therefore results in smaller file sizes.
Parquet is another file format that is optimized for reading and writing large datasets in a columnar format. It was developed by the Apache Parquet project and is widely used in the Hadoop ecosystem. Parquet files are optimized for analytical processing and provide features such as column projection, encoding, and compression. They also support a wide range of compression codecs such as Snappy, Gzip, and others. Parquet’s format is relatively heavyweight and therefore results in larger file sizes.
Avro is a data serialization system that is widely used in the big data ecosystem. Avro files are optimized for data exchange and provide features such as schema evolution and support for a wide range of compression codecs such as Snappy, Deflate, and others. Avro’s format is relatively heavyweight and therefore results in larger file sizes than ORC.
Feature | ORC (Optimized Row Columnar) | Parquet | Avro |
---|---|---|---|
Provider | Apache Hadoop | Apache Parquet | Apache |
Compression | Snappy, Zlib, and others | Snappy, Gzip, and others | Snappy, Deflate, and others |
Schema evolution | Yes | Yes | Yes |
Column projection | Yes | Yes | No |
Encryption support | Yes | No | No |
Row-level access | Yes | No | No |
File size | Smaller | Larger | Larger than ORC |
Query performance | Faster | Slower | Slower than ORC |