Jcipar/batching parquet writer #23391

jcipar · 2024-09-19T19:08:06Z

The batching_parquet_writer is a high-level interface that ties
together all the low-level components for writing Parquet files from
iceberg::value. It
1. Opens a ss::file to store the results
2. Accepts iceberg::value and collects them in an arrow_translator
3. Once the row count or size threshold is reached it writes data to
   the file:
   1. takes a chunk from the arrow_translator
   2. Adds the chunk to the parquet_writer
   3. Extracts iobufs from the parquet_writer
   4. Writes them to the open file
4. When finish() is called it flushes all remaining data and closes
the files.

Backports Required

Release Notes

None

This adds an arrow_to_iobuf interface that converts Arrow data to iobufs representing Parquet files that can be written to disk. There are two components: 1. An implementation of arrow::io::OutputStream that collects data in iobufs 2. A class that creates a parquet::io::FileWriter using that output stream and allows the caller to extract the generated iobufs. This allows us to separate the compute side of generating parquet, which still occurs in the Arrow library, from the file io, which can now be made seastar-friednly.

The batching_parquet_writer is a high-level interface that ties together all the low-level components for writing Parquet files from iceberg::value. It 1. Opens a ss::file to store the results 2. Accepts iceberg::value and collects them in an arrow_translator 3. Once the row count or size threshold is reached it writes data to the file: 1. takes a chunk from the arrow_translator 2. Adds the chunk to the parquet_writer 3. Extracts iobufs from the parquet_writer 4. Writes them to the open file 4. When finish() is called it flushes all remaining data and closes the files.

jcipar added 2 commits September 18, 2024 16:35

github-actions bot added the area/redpanda label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jcipar/batching parquet writer #23391

Jcipar/batching parquet writer #23391

jcipar commented Sep 19, 2024

Jcipar/batching parquet writer #23391

Are you sure you want to change the base?

Jcipar/batching parquet writer #23391

Conversation

jcipar commented Sep 19, 2024

Backports Required

Release Notes