As you continue on with your work as a Data Scientist, you will start encountering parquet files as a way of storing large data. Let’s add to our toolkit.
What is a parquet file? It is another way of storing data that is organized in a way that it’s fast to search. In contrast to row-based storage, where the data is stored by row, parquet
is column-based: data is stored by column. There are some particular advantages to this storage type:
- It is well suited to storing very large datasets.
- It is extremely fast to filter within a column, since you can just scan up and down in a column in the file
- It is compressed, and so it is much smaller to store files than a format like Comma-Separated-Value (CSV).
- Many database systems are now optimized to work with
parquet
files.
The main drawback is that there is additional technology required to work with parquet
files. We’ll use the nanoparquet
, duckdb
, and duckplyr
packages to interact with them using a tidy workflow.
You Don’t Load Parquet Files Into Memory
In general, you don’t load parquet files into memory to work with them. You tend to work with them through a database query engine. In our case, we’ll use a system called DuckDB. DuckDB is a database query engine that works with a wide variety of data types.
Other query engines that are important are Snowflake, Google BigQuery, and Amazon RedShift. So you can see that learning how to work with parquet
files is important.
A Titanic Parquet file
We’ve got a folder called data/
, and there is a file called titanic.parquet
(link to data if you want to download) in there.
Now that we know what’s in our data/
folder, we can get some info on this file. We’ll use the {nanoparquet}
package to get some info on the file. We can get an idea of the size of the data using parquet_info()
:
Let’s get info about the column types in our file with parquet_column_types
In general, nanoparquet
is very useful for reading and writing parquet
files. But remember, we are going to be interacting with the files through a database query engine. Enter DuckDB.
Querying your Data
To actually interact with the data, we can connect to the parquet file with the DuckDB package. We first need to start what’s called a database connection so that we can connect to the DuckDB software:
Now, we’re going to make a “View”, which is a temporary table in DuckDB to do our queries on. This also connects our parquet
file into DuckDB with the PARQUET_SCAN()
function in DuckDB.
This is the only SQL we need to write to interact with the data. The rest we can do with dplyr
commands thanks to a package called duckplyr
.
Now, we have our connection and our view, we can start to take a look at the data. We can pass our connection and view with the tbl()
function. Here we’re calling head()
, followed by collect()
.
Why do we need to call collect()
? DuckDB uses what’s called Lazy Execution: it only calculates a value when you tell it to. Lazy Execution allows DuckDB to make an execution plan: it finds the best strategy for executing the entire query rather than calculate it piece by piece.
Summarizing the Passengers
Let’s look at who survived and their average fare. Note that we can remove missing values with the na.rm
argument to mean()
:
Depressingly, it seems that those who survived paid more for their fare.
Try Constructing Your Own Query
Find out the mean Age
by Pclass
(1st, 2nd, 3rd class).
Explain it to Me
You can see how DuckDB is building the query by asking it to explain()
. It will show you the query plan for the search:
This output is a bit difficult to understand, but as you start writing more and more queries, it can be helpful to dive into the query plan.
The Tip of the, er, Iceberg
That’s just the beginning of working with DuckDB and parquet
files. Hopefully you learned something useful and I’ll write more about parquet
files in a future data snack.
Thanks to Francois Michonneau for the DuckDB Tutorial.
Citation
@online{laderas2024,
author = {Laderas, Ted},
title = {Large {Data} {Work:} {Intro} to Parquet Files in {R}},
date = {2024-09-30},
langid = {en}
}