Improve memory usage for arrow-row -> String/BinaryView
when utf8 validation disabled
#7917
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
As described in above issue, when constructing a
StringViewArray
from rows, we currently store inline strings twice: once throughmake_view
, and again in thevalues buffer
so that we can validate utf8 in one go. However, this is suboptimal in terms of memory consumption, so ideally, we should avoid placing inline strings into the values buffer when UTF-8 validation is disabled.What changes are included in this PR?
When UTF-8 validation is disabled, this PR modifies the string/bytes view array construction from rows as follows:
The capacity of the values buffer is set to accommodate only long strings plus 12 bytes for a single inline string placeholder.
All decoded strings are initially appended to the values buffer.
If a string turns out to be an inline string, it is included via
make_view
, and then the corresponding inline portion is truncated from the values buffer, ensuring the inline string does not appear twice in the resulting array.Are these changes tested?
fuzz_test
to set disable utf8 validation.Are there any user-facing changes?
No.
Considered alternatives
One idea was to support separate buffers for inline strings even when UTF-8 validation is enabled. However, since we need to call
decoded_len()
first to determine the target buffer, this approach can be tricky or inefficient:For example, precomputing a boolean flag per string to determine which buffer to use would increase temporary memory usage.
Alternatively, appending to the values buffer first and then moving inline strings to a separate buffer would lead to frequent memcpy overhead.
Given that datafusion disables UTF-8 validation when using RowConverter, this PR focuses on improving memory efficiency specifically when validation is turned off.