HDFS Formats: Parquet vs AVRO

AVRO

  • Row-based storage format 
  • Its schema is also stored with it
  • robust support for data schemas that changes over time, i.e. schema evolution. 
  • Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub-record.
  • When to use
    • Data from the landing zone is usually read as a whole for further processing by downstream systems 
    • Any source schema change is easily handled (schema evolution).

Parquet

  • Parquet stores the data in a column-oriented way 
    • Values of each and every column are organized so that all the columns are adjacent, enabling better compression rate. 
  • It is especially good for the queries which read columns from a “wide” (with many columns) table since only needed columns are read and the IO(Input/Output) is minimized.
  • Nested data structures in a flat columnar format.

Reference:

Leave a comment