Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jcipar/batching parquet writer #23391

Draft
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

jcipar
Copy link
Contributor

@jcipar jcipar commented Sep 19, 2024

The batching_parquet_writer is a high-level interface that ties
together all the low-level components for writing Parquet files from
iceberg::value. It
1. Opens a ss::file to store the results
2. Accepts iceberg::value and collects them in an arrow_translator
3. Once the row count or size threshold is reached it writes data to
   the file:
   1. takes a chunk from the arrow_translator
   2. Adds the chunk to the parquet_writer
   3. Extracts iobufs from the parquet_writer
   4. Writes them to the open file
4. When finish() is called it flushes all remaining data and closes
the files.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

None

This adds an arrow_to_iobuf interface that converts Arrow data to iobufs
representing Parquet files that can be written to disk. There are two
components:
1. An implementation of arrow::io::OutputStream that collects data in
iobufs
2. A class that creates a parquet::io::FileWriter using that output
stream and allows the caller to extract the generated iobufs.

This allows us to separate the compute side of generating parquet, which
still occurs in the Arrow library, from the file io, which can now be
made seastar-friednly.
The batching_parquet_writer is a high-level interface that ties
together all the low-level components for writing Parquet files from
iceberg::value. It
1. Opens a ss::file to store the results
2. Accepts iceberg::value and collects them in an arrow_translator
3. Once the row count or size threshold is reached it writes data to
   the file:
   1. takes a chunk from the arrow_translator
   2. Adds the chunk to the parquet_writer
   3. Extracts iobufs from the parquet_writer
   4. Writes them to the open file
4. When finish() is called it flushes all remaining data and closes
the files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant