pub struct ArrowWriter<W: Write> {
writer: SerializedFileWriter<W>,
in_progress: Option<ArrowRowGroupWriter>,
arrow_schema: SchemaRef,
max_row_group_size: usize,
}Expand description
Encodes [RecordBatch] to parquet
Writes Arrow RecordBatches to a Parquet writer. Multiple [RecordBatch] will be encoded
to the same row group, up to max_row_group_size rows. Any remaining rows will be
flushed on close, leading the final row group in the output file to potentially
contain fewer than max_row_group_size rows
let col = Arc::new(Int64Array::from_iter_values([1, 2, 3])) as ArrayRef;
let to_write = RecordBatch::try_from_iter([("col", col)]).unwrap();
let mut buffer = Vec::new();
let mut writer = ArrowWriter::try_new(&mut buffer, to_write.schema(), None).unwrap();
writer.write(&to_write).unwrap();
writer.close().unwrap();
let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buffer), 1024).unwrap();
let read = reader.next().unwrap().unwrap();
assert_eq!(to_write, read);§Memory Limiting
The nature of parquet forces buffering of an entire row group before it can be flushed to the underlying writer. Data is mostly buffered in its encoded form, reducing memory usage. However, some data such as dictionary keys or large strings or very nested data may still result in non-trivial memory usage.
See Also:
ArrowWriter::memory_size: the current memory usage of the writer.ArrowWriter::in_progress_size: Estimated size of the buffered row group,
Call Self::flush to trigger an early flush of a row group based on a
memory threshold and/or global memory pressure. However, smaller row groups
result in higher metadata overheads, and thus may worsen compression ratios
and query performance.
writer.write(&batch).unwrap();
// Trigger an early flush if anticipated size exceeds 1_000_000
if writer.in_progress_size() > 1_000_000 {
writer.flush().unwrap();
}§Type Support
The writer supports writing all Arrow DataTypes that have a direct mapping to
Parquet types including StructArray and ListArray.
The following are not supported:
IntervalMonthDayNanoArray: Parquet does not support nanosecond intervals.
Fields§
§writer: SerializedFileWriter<W>Underlying Parquet writer
in_progress: Option<ArrowRowGroupWriter>The in-progress row group if any
arrow_schema: SchemaRefA copy of the Arrow schema.
The schema is used to verify that each record batch written has the correct schema
max_row_group_size: usizeThe length of arrays to write to each row group
Implementations§
Source§impl<W: Write + Send> ArrowWriter<W>
impl<W: Write + Send> ArrowWriter<W>
Sourcepub fn try_new(
writer: W,
arrow_schema: SchemaRef,
props: Option<WriterProperties>,
) -> Result<Self>
pub fn try_new( writer: W, arrow_schema: SchemaRef, props: Option<WriterProperties>, ) -> Result<Self>
Try to create a new Arrow writer
The writer will fail if:
- a
SerializedFileWritercannot be created from the ParquetWriter - the Arrow schema contains unsupported datatypes such as Unions
Sourcepub fn try_new_with_options(
writer: W,
arrow_schema: SchemaRef,
options: ArrowWriterOptions,
) -> Result<Self>
pub fn try_new_with_options( writer: W, arrow_schema: SchemaRef, options: ArrowWriterOptions, ) -> Result<Self>
Try to create a new Arrow writer with ArrowWriterOptions.
The writer will fail if:
- a
SerializedFileWritercannot be created from the ParquetWriter - the Arrow schema contains unsupported datatypes such as Unions
Sourcepub fn flushed_row_groups(&self) -> &[RowGroupMetaData]
pub fn flushed_row_groups(&self) -> &[RowGroupMetaData]
Returns metadata for any flushed row groups
Sourcepub fn memory_size(&self) -> usize
pub fn memory_size(&self) -> usize
Estimated memory usage, in bytes, of this ArrowWriter
This estimate is formed bu summing the values of
ArrowColumnWriter::memory_size all in progress columns.
Sourcepub fn in_progress_size(&self) -> usize
pub fn in_progress_size(&self) -> usize
Anticipated encoded size of the in progress row group.
This estimate the row group size after being completely encoded is,
formed by summing the values of
ArrowColumnWriter::get_estimated_total_bytes for all in progress
columns.
Sourcepub fn in_progress_rows(&self) -> usize
pub fn in_progress_rows(&self) -> usize
Returns the number of rows buffered in the in progress row group
Sourcepub fn bytes_written(&self) -> usize
pub fn bytes_written(&self) -> usize
Returns the number of bytes written by this instance
Sourcepub fn write(&mut self, batch: &RecordBatch) -> Result<()>
pub fn write(&mut self, batch: &RecordBatch) -> Result<()>
Encodes the provided [RecordBatch]
If this would cause the current row group to exceed WriterProperties::max_row_group_size
rows, the contents of batch will be written to one or more row groups such that all but
the final row group in the file contain WriterProperties::max_row_group_size rows.
This will fail if the batch’s schema does not match the writer’s schema.
Sourcepub fn append_key_value_metadata(&mut self, kv_metadata: KeyValue)
pub fn append_key_value_metadata(&mut self, kv_metadata: KeyValue)
Additional KeyValue metadata to be written in addition to those from WriterProperties
This method provide a way to append kv_metadata after write RecordBatch
Sourcepub fn inner_mut(&mut self) -> &mut W
pub fn inner_mut(&mut self) -> &mut W
Returns a mutable reference to the underlying writer.
It is inadvisable to directly write to the underlying writer, doing so will likely result in a corrupt parquet file
Sourcepub fn into_inner(self) -> Result<W>
pub fn into_inner(self) -> Result<W>
Flushes any outstanding data and returns the underlying writer.
Sourcepub fn finish(&mut self) -> Result<FileMetaData>
pub fn finish(&mut self) -> Result<FileMetaData>
Close and finalize the underlying Parquet writer
Unlike Self::close this does not consume self
Attempting to write after calling finish will result in an error
Sourcepub fn close(self) -> Result<FileMetaData>
pub fn close(self) -> Result<FileMetaData>
Close and finalize the underlying Parquet writer
Trait Implementations§
Auto Trait Implementations§
impl<W> Freeze for ArrowWriter<W>where
W: Freeze,
impl<W> !RefUnwindSafe for ArrowWriter<W>
impl<W> Send for ArrowWriter<W>where
W: Send,
impl<W> !Sync for ArrowWriter<W>
impl<W> Unpin for ArrowWriter<W>where
W: Unpin,
impl<W> !UnwindSafe for ArrowWriter<W>
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more