In the last decade, Apache Parquet has become the standard format to store tabular data on disk regardless of the technology stack used. This is due to its read/write performance, efficient compression technology, interoperability and especially outstanding performance with the default settings.
While these default settings and access patterns already provide decent performance, by understanding the format in more detail and using recent developments, one can get much better performance, smaller files, and utilise Parquet's newer partial reading features to read even smaller subsets of a file for a given query.
This talk aims to provide insight into the Parquet format and its recent development that are useful for end users' daily workflows. One only needs prior knowledge to know what a DataFrame/tabular data is.