Skip to content

[Variant] Support appending complex variants in VariantBuilder #7914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 16, 2025

Conversation

friendlymatthew
Copy link
Contributor

@friendlymatthew friendlymatthew commented Jul 11, 2025

Which issue does this PR close?

Rationale for this change

When trying to append VariantObject or VariantLists directly on the VariantBuilder, it will panic.

Changes to the public API

VariantBuilder now has these additional methods:

  • append_object, will panic if shallow validation fails or the object has duplicate field names

  • try_append_object, will perform full validation on the object before appending

  • append_list, will panic if shallow validation fails

  • try_append_list, will perform full validation on the list before appending

@friendlymatthew friendlymatthew changed the title Append complex variants [Variant] Append complex variants Jul 11, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/append-complex branch from b054960 to 9ba2156 Compare July 11, 2025 22:53
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 11, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/append-complex branch from 9ba2156 to 9bb416b Compare July 11, 2025 23:00
let variant = value.into();

match variant {
Variant::Object(obj) => self.append_object(obj),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create a clone of this append_value function using the try_append functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree -- I thik it would make a lot of sense to have a try_append_value that returns an error (rather than panic'ing) if an invalid variant is passed in.
Then append_value would just call self try_append_value(v).unwrap() 🤔

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew and @abacef -- I think this is pretty close

I think @abacef 's suggestions are spot on and I left some similar ones

Let me know what you think

"Nested values are handled specially by ObjectBuilder and ListBuilder"
);
}
_ => unreachable!("Objects and lists must be appended using VariantBuilder::append_object and VariantBuilder::append_list"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHy not call self.append_variant_object and self.append_variant_list here (to avoid the panic?)

let variant = value.into();

match variant {
Variant::Object(obj) => self.append_object(obj),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree -- I thik it would make a lot of sense to have a try_append_value that returns an error (rather than panic'ing) if an invalid variant is passed in.
Then append_value would just call self try_append_value(v).unwrap() 🤔

@alamb alamb changed the title [Variant] Append complex variants [Variant] Support appending complex variants in VariantBuilder Jul 12, 2025
@friendlymatthew friendlymatthew marked this pull request as draft July 12, 2025 19:58
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting PR -- it certainly uncovered some questions in my mind!

Out of curiosity, why only support this for the top-level builder and not the list/object builders?

"Nested values are handled specially by ObjectBuilder and ListBuilder"
);
}
_ => unreachable!("Objects and lists must be appended using VariantBuilder::append_object and VariantBuilder::append_list"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awkward division of responsibilities. What if we got rid of ValueBuffer::append_variant entirely, and just had the builder invoke the primitive append_xxx methods directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the method is currently important because all three builders need to call it. I wonder more and more whether there might be some way to harmonize the implementations more, leveraging a combination of ParentState and VariantBuilderExt?

/// Appends a [`VariantObject`] to the builder with full validation during iteration.
///
/// Recursively validates all nested variants in the object during iteration.
pub fn try_append_object<'m, 'v>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there were an efficient way to reuse code for fallible vs. infallible versions of this method...

Samyak2 added a commit to Samyak2/arrow-rs that referenced this pull request Jul 13, 2025
In parquet-variant:
- Add a new function `Variant::get_path`: this traverses the path to create a new Variant (does not cast any of it).
- Add a new module `parquet_variant::path`: adds structs/enums to define a path to access a variant value deeply.

In parquet-variant-compute:
- Add a new compute kernel `variant_get`: does the path traversal over a `VariantArray`. In the future, this would also cast the values to a specified type.
- Includes some basic unit tests. Not comprehensive.
- Includes a simple micro-benchmark for reference.

Current limitations:
- It can only return another VariantArray. Casts are not implemented yet.
- Only top-level object/list access is supported. It panics on finding a nested object/list. Needs apache#7914 to fix this.
- Perf is a TODO.
@alamb
Copy link
Contributor

alamb commented Jul 14, 2025

I think this PR unlocks several other APIs so I hope to spend some time helping it along later today

@friendlymatthew
Copy link
Contributor Author

I think this PR unlocks several other APIs so I hope to spend some time helping it along later today

Hi, I am traveling today so I've been a bit preoccupied. I'll try to get this over the line during my flight.

Samyak2 added a commit to Samyak2/arrow-rs that referenced this pull request Jul 14, 2025
In parquet-variant:
- Add a new function `Variant::get_path`: this traverses the path to create a new Variant (does not cast any of it).
- Add a new module `parquet_variant::path`: adds structs/enums to define a path to access a variant value deeply.

In parquet-variant-compute:
- Add a new compute kernel `variant_get`: does the path traversal over a `VariantArray`. In the future, this would also cast the values to a specified type.
- Includes some basic unit tests.
- Includes a simple micro-benchmark for reference.

Current limitations:
- It can only return another VariantArray. Casts are not implemented yet.
- Only top-level object/list access is supported. It panics on finding a nested object/list. Needs apache#7914 to fix this.
- Perf is a TODO.
Samyak2 added a commit to Samyak2/arrow-rs that referenced this pull request Jul 14, 2025
In parquet-variant:
- Add a new function `Variant::get_path`: this traverses the path to create a new Variant (does not cast any of it).
- Add a new module `parquet_variant::path`: adds structs/enums to define a path to access a variant value deeply.

In parquet-variant-compute:
- Add a new compute kernel `variant_get`: does the path traversal over a `VariantArray`. In the future, this would also cast the values to a specified type.
- Includes some basic unit tests.
- Includes a simple micro-benchmark for reference.

Current limitations:
- It can only return another VariantArray. Casts are not implemented yet.
- Only top-level object/list access is supported. It panics on finding a nested object/list. Needs apache#7914 to fix this.
- Perf is a TODO.
@alamb
Copy link
Contributor

alamb commented Jul 14, 2025

I am starting to hack in this a bit

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can signifcantly simplify this PR in this way

It isn't quite ready to go -- I need to run now -- but I got enough of the sketch done to show how it would look like.

I'll mess around with it a bit more tomorrow too

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/append-complex branch from 014ffad to db32eef Compare July 14, 2025 22:06
@friendlymatthew friendlymatthew marked this pull request as ready for review July 14, 2025 22:13
@alamb
Copy link
Contributor

alamb commented Jul 15, 2025

I think this is the most important Variant PR at the moment from my perspective as we will likely need this for many of the other things we are discussing.

Thus I plan to help with this one first and then move on to more shredding related ones

@friendlymatthew please let me know if you get stuck or need some help. Otherwise I'll wait for a ping from you

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/append-complex branch 3 times, most recently from 5113739 to 850dd15 Compare July 16, 2025 10:23
@friendlymatthew
Copy link
Contributor Author

friendlymatthew commented Jul 16, 2025

Hi, round tripping variant objects would fail because we were writing out object field names by the sort order. This is fine for variants with a sorted dictionary, but everything else will produce inconsistent results

What we want instead is to write out the object fields by insertion order, we can do this by passing in the VariantMetadata ahead of time. This way we know the insertion order of field names

I'm sure we can improve the implementation, but the feature is working

cc @alamb

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/append-complex branch from 850dd15 to e77576c Compare July 16, 2025 14:54
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/append-complex branch from e77576c to 1314f5c Compare July 16, 2025 14:54
@friendlymatthew
Copy link
Contributor Author

I think this is the most important Variant PR at the moment from my perspective as we will likely need this for many of the other things we are discussing.

Thus I plan to help with this one first and then move on to more shredding related ones

@friendlymatthew please let me know if you get stuck or need some help. Otherwise I'll wait for a ping from you

Hi, I'm curious to get your thoughts.

Sorry for the awolness, I am fighting jet lag/conferences

Cc @scovich

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

Sorry for the awolness, I am fighting jet lag/conferences

No worries at all -- I think we are all struggling to find the time while juggling many other things. It is totally fine -- and the nature of open source work

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew -- I think there are several things to improve in this PR (such as dbg! and the need to sort the fields for appending) -- however in order to keep things moving I will merge this one as is and then make a follow on PR or ticket to improve

@@ -310,6 +376,8 @@ impl MetadataBuilder {
fn upsert_field_name(&mut self, field_name: &str) -> u32 {
let (id, new_entry) = self.field_names.insert_full(field_name.to_string());

dbg!(new_entry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is left over


let mut object_builder = self.new_object(metadata_builder);

// first add all object fields that exist in metadata builder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the code here is basically trying to preserve the order so that the variants compare deeply equal

Rather than trying to preserve the byte-for-byte equality here, I think a better idea could be to implement a Variant::eq_value type method that compares the Variants for logical equality, rather than byte-for-byte equality.

I am thinking we can merge this PR and then I will make a follow on PR for that idea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the logical equality function on variants. This will be needed in tests where the result and expected variants come from different sources (and thus might have a mismatch in metadata or ordering).

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

Let's get this in and keep iterating

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

(BTW @friendlymatthew if you are willing, if you make PRs from your own personal fork, rather than the pydantic one, I can push commits directly to the PR, which might speed up development)

@alamb alamb merged commit 7af62d5 into apache:main Jul 16, 2025
12 checks passed
@friendlymatthew
Copy link
Contributor Author

Thank you @friendlymatthew -- I think there are several things to improve in this PR (such as dbg! and the need to sort the fields for appending) -- however in order to keep things moving I will merge this one as is and then make a follow on PR or ticket to improve

I will have time for the next hour so I would like to clean up

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

I will have time for the next hour so I would like to clean up

Awesome -- maybe a follow on PR!

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

What I suggest doing is looking into changing the PartialEq implementaion for Variant to implement "logical: equality rather than physical equality

So rather than the #derived implementaiton of PartialEQ, which will check equality of underyling bytes, recursively check the variant and its children

I am not 100% sure what the semantics should be for field name order

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

I think accoding to my reading of https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-data-for-object-basic_type2, the field order can vary from objet to object so when comparing two variants for equality the field order should not be taken into account

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

Does that make sense?

@friendlymatthew
Copy link
Contributor Author

(BTW @friendlymatthew if you are willing, if you make PRs from your own personal fork, rather than the pydantic one, I can push commits directly to the PR, which might speed up development)

Ah sounds good. I haven't checked recently, but I noticed pushing up PRs from my fork requires approval to run CI which was kind of a pain

@friendlymatthew
Copy link
Contributor Author

the field order can vary from objet to object so when comparing two variants for equality the field order should not be taken into account

Hm, I assume you are talking about

The field ids and field offsets must be in lexicographical order of the corresponding field names in the metadata dictionary.
However, the actual value entries do not need to be in any particular order.
This implies that the field_offset values may not be monotonically increasing.
For example, for the following object:

{
  "c": 3,
  "b": 2,
  "a": 1
}

The field_id list must be [<id for key "a">, <id for key "b">, <id for key "c">], in lexicographical order.
The field_offset list must be [<offset for value 1>, <offset for value 2>, <offset for value 3>, <last offset>].
The value list can be in any order.

alamb added a commit that referenced this pull request Jul 16, 2025
# Which issue does this PR close?

- Closes #7893

# What changes are included in this PR?

In parquet-variant:
- Add a new function `Variant::get_path`: this traverses the path to
create a new Variant (does not cast any of it).
- Add a new module `parquet_variant::path`: adds structs/enums to define
a path to access a variant value deeply.

In parquet-variant-compute:
- Add a new compute kernel `variant_get`: does the path traversal over a
`VariantArray`. In the future, this would also cast the values to a
specified type.
- Includes some basic unit tests. Not comprehensive.
- Includes a simple micro-benchmark for reference.

Current limitations:
- It can only return another VariantArray. Casts are not implemented
yet.
- Only top-level object/list access is supported. It panics on finding a
nested object/list. Needs #7914
to fix this.
- Perf is a TODO.

# Are these changes tested?

Some basic unit tests are added.

# Are there any user-facing changes?

Yes

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

Ah sounds good. I haven't checked recently, but I noticed pushing up PRs from my fork requires approval to run CI which was kind of a pain

This happens for first time contributors. Once we have merged a PR from your CI will run automatically

I think since we have now merged several your CI runs should go automatically

@scovich
Copy link
Contributor

scovich commented Jul 17, 2025

I think accoding to my reading of https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-data-for-object-basic_type2, the field order can vary from objet to object so when comparing two variants for equality the field order should not be taken into account

I thought the fields (as in, the field id array) did need to be in (lexical) order? The corresponding value bytes can be out of order but hopefully their physical placement doesn't change the logical equality test.

One observation tho -- an unordered metadata dictionary can have duplicate field ids. Two fields have the same name if their field ids are the same, but field ids differing does NOT prove that the fields have different names.

@alamb
Copy link
Contributor

alamb commented Jul 17, 2025

One observation tho -- an unordered metadata dictionary can have duplicate field ids. Two fields have the same name if their field ids are the same, but field ids differing does NOT prove that the fields have different names.

I think @friendlymatthew correctly implemented these semantics in #7943

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Panic when appending nested objects to VariantBuilder
5 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy