Improve memory usage for `arrow-row -> String/BinaryView` when utf8 validation disabled #7917

ding-young · 2025-07-12T11:54:17Z

Which issue does this PR close?

Related to Improve arrow-row --> StringView/BinaryView memory usage #6057 .

Rationale for this change

As described in above issue, when constructing a StringViewArray from rows, we currently store inline strings twice: once through make_view, and again in the values buffer so that we can validate utf8 in one go. However, this is suboptimal in terms of memory consumption, so ideally, we should avoid placing inline strings into the values buffer when UTF-8 validation is disabled.

What changes are included in this PR?

When UTF-8 validation is disabled, this PR modifies the string/bytes view array construction from rows as follows:

The capacity of the values buffer is set to accommodate only long strings plus 12 bytes for a single inline string placeholder.
All decoded strings are initially appended to the values buffer.
If a string turns out to be an inline string, it is included via make_view, and then the corresponding inline portion is truncated from the values buffer, ensuring the inline string does not appear twice in the resulting array.

Are these changes tested?

copied & modified existing fuzz_test to set disable utf8 validation.
Run bench & add bench case when array consists of both inline string & long strings

Are there any user-facing changes?

No.

Considered alternatives

One idea was to support separate buffers for inline strings even when UTF-8 validation is enabled. However, since we need to call decoded_len() first to determine the target buffer, this approach can be tricky or inefficient:

For example, precomputing a boolean flag per string to determine which buffer to use would increase temporary memory usage.
Alternatively, appending to the values buffer first and then moving inline strings to a separate buffer would lead to frequent memcpy overhead.

Given that datafusion disables UTF-8 validation when using RowConverter, this PR focuses on improving memory efficiency specifically when validation is turned off.

…ecked

ding-young · 2025-07-14T03:36:23Z

@alamb @XiangpengHao @2010YOUY01

When running the existing benchmarks (cargo bench --bench row_format "convert_rows 4096 string view” ), I noticed there might be a slight regression, but it seems relatively minor given the normal level of fluctuation.
I’d love to hear your thoughts on this PR — if you think this direction is useful, I’ll run more benchmark experiments and polish it further.

XiangpengHao · 2025-07-14T16:53:54Z

I took a high level look and it looks good to me. Curious to see the perf diff. Does the benchmark report memory usage yet? @ding-young

ding-young · 2025-07-15T09:50:24Z

@XiangpengHao The current benchmark doesn’t report memory usage directly, but I’ve been printing stats manually using jemalloc. It seems like there might be an issue with my implementation, so I’ll double-check that and share the perf once I’ve confirmed.

ding-young · 2025-07-16T07:41:16Z

cargo bench result

Case (str_len, null prob)	main	issue-6057
string view(10, 0)	51.23 µs	52.18 µs
string view(30, 0)	45.47 µs	46.63 µs
string view(100, 0)	64.18 µs	68.54 µs
string view(100, 0.5)	70.11 µs	74.06 µs
string view(1..100, 0)	100.72 µs	103.80 µs
string view(1..100, 0.5)	80.48 µs	86.02 µs

manual memory profiling result (*unit = B)

I added code to get jemalloc stats (allocate, resident, active) before and after decoding binary view, and the memory usage actually improved especially when short strings are mixed up with large strings. When given rows consists of only large strings, the memory usage was the same.

let before = jemalloc_stat();

let view = if !validate_utf8 {
    decode_binary_view_inner_utf8_unchecked(rows, options)
} else {
    decode_binary_view_inner(rows, options, validate_utf8)
};

let after = jemalloc_stat();
// print ( after - before )

(To reproduce, see https://github.com/ding-young/arrow-rs/tree/issue-6057-bench-mem )

Case	main (alloc / active)	issue-6057 (alloc / active)
string view(10, 0)	102656 / 114688	65536 / 69632
string view(30, 0)	196608 / 204800	196608 / 204800
string view(100, 0)	524288 / 532480	524288 / 532480
string view(100, 0.5)	294912 / 303104	294912 / 303104
string view(1..100, 0)	294912 / 303104	294912 / 303104
string view(1..100, 0.5)	180224 / 188416	163840 / 172032

ding-young added 2 commits July 11, 2025 08:21

avoid buffering all inline strings when decode_binary_view_inner_unch…

eeda228

…ecked

add bench for case when short & long string both exists

50a67c7

github-actions bot added the arrow Changes to the arrow crate label Jul 12, 2025

fix wrong len value

a09b08b

chore: remove unnecessary test

ab8cb56

ding-young force-pushed the issue-6057 branch from fe7c44d to ab8cb56 Compare July 16, 2025 07:43

ding-young marked this pull request as ready for review July 16, 2025 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve memory usage for `arrow-row -> String/BinaryView` when utf8 validation disabled #7917

Improve memory usage for `arrow-row -> String/BinaryView` when utf8 validation disabled #7917

Uh oh!

ding-young commented Jul 12, 2025 •

edited

Loading

Uh oh!

ding-young commented Jul 14, 2025

Uh oh!

XiangpengHao commented Jul 14, 2025

Uh oh!

ding-young commented Jul 15, 2025

Uh oh!

ding-young commented Jul 16, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Improve memory usage for arrow-row -> String/BinaryView when utf8 validation disabled #7917

Are you sure you want to change the base?

Improve memory usage for arrow-row -> String/BinaryView when utf8 validation disabled #7917

Uh oh!

Conversation

ding-young commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Considered alternatives

Uh oh!

ding-young commented Jul 14, 2025

Uh oh!

XiangpengHao commented Jul 14, 2025

Uh oh!

ding-young commented Jul 15, 2025

Uh oh!

ding-young commented Jul 16, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Improve memory usage for `arrow-row -> String/BinaryView` when utf8 validation disabled #7917

Improve memory usage for `arrow-row -> String/BinaryView` when utf8 validation disabled #7917

ding-young commented Jul 12, 2025 •

edited

Loading