-
Notifications
You must be signed in to change notification settings - Fork 974
[Variant] Support appending complex variants in VariantBuilder
#7914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b054960
to
9ba2156
Compare
9ba2156
to
9bb416b
Compare
9bb416b
to
4dc7f30
Compare
parquet-variant/src/builder.rs
Outdated
let variant = value.into(); | ||
|
||
match variant { | ||
Variant::Object(obj) => self.append_object(obj), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create a clone of this append_value
function using the try_append
functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree -- I thik it would make a lot of sense to have a try_append_value
that returns an error (rather than panic'ing) if an invalid variant is passed in.
Then append_value
would just call self try_append_value(v).unwrap()
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew and @abacef -- I think this is pretty close
I think @abacef 's suggestions are spot on and I left some similar ones
Let me know what you think
parquet-variant/src/builder.rs
Outdated
"Nested values are handled specially by ObjectBuilder and ListBuilder" | ||
); | ||
} | ||
_ => unreachable!("Objects and lists must be appended using VariantBuilder::append_object and VariantBuilder::append_list"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHy not call self.append_variant_object
and self.append_variant_list
here (to avoid the panic?)
parquet-variant/src/builder.rs
Outdated
let variant = value.into(); | ||
|
||
match variant { | ||
Variant::Object(obj) => self.append_object(obj), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree -- I thik it would make a lot of sense to have a try_append_value
that returns an error (rather than panic'ing) if an invalid variant is passed in.
Then append_value
would just call self try_append_value(v).unwrap()
🤔
VariantBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting PR -- it certainly uncovered some questions in my mind!
Out of curiosity, why only support this for the top-level builder and not the list/object builders?
parquet-variant/src/builder.rs
Outdated
"Nested values are handled specially by ObjectBuilder and ListBuilder" | ||
); | ||
} | ||
_ => unreachable!("Objects and lists must be appended using VariantBuilder::append_object and VariantBuilder::append_list"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an awkward division of responsibilities. What if we got rid of ValueBuffer::append_variant
entirely, and just had the builder invoke the primitive append_xxx
methods directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the method is currently important because all three builders need to call it. I wonder more and more whether there might be some way to harmonize the implementations more, leveraging a combination of ParentState
and VariantBuilderExt
?
parquet-variant/src/builder.rs
Outdated
/// Appends a [`VariantObject`] to the builder with full validation during iteration. | ||
/// | ||
/// Recursively validates all nested variants in the object during iteration. | ||
pub fn try_append_object<'m, 'v>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish there were an efficient way to reuse code for fallible vs. infallible versions of this method...
In parquet-variant: - Add a new function `Variant::get_path`: this traverses the path to create a new Variant (does not cast any of it). - Add a new module `parquet_variant::path`: adds structs/enums to define a path to access a variant value deeply. In parquet-variant-compute: - Add a new compute kernel `variant_get`: does the path traversal over a `VariantArray`. In the future, this would also cast the values to a specified type. - Includes some basic unit tests. Not comprehensive. - Includes a simple micro-benchmark for reference. Current limitations: - It can only return another VariantArray. Casts are not implemented yet. - Only top-level object/list access is supported. It panics on finding a nested object/list. Needs apache#7914 to fix this. - Perf is a TODO.
I think this PR unlocks several other APIs so I hope to spend some time helping it along later today |
Hi, I am traveling today so I've been a bit preoccupied. I'll try to get this over the line during my flight. |
In parquet-variant: - Add a new function `Variant::get_path`: this traverses the path to create a new Variant (does not cast any of it). - Add a new module `parquet_variant::path`: adds structs/enums to define a path to access a variant value deeply. In parquet-variant-compute: - Add a new compute kernel `variant_get`: does the path traversal over a `VariantArray`. In the future, this would also cast the values to a specified type. - Includes some basic unit tests. - Includes a simple micro-benchmark for reference. Current limitations: - It can only return another VariantArray. Casts are not implemented yet. - Only top-level object/list access is supported. It panics on finding a nested object/list. Needs apache#7914 to fix this. - Perf is a TODO.
In parquet-variant: - Add a new function `Variant::get_path`: this traverses the path to create a new Variant (does not cast any of it). - Add a new module `parquet_variant::path`: adds structs/enums to define a path to access a variant value deeply. In parquet-variant-compute: - Add a new compute kernel `variant_get`: does the path traversal over a `VariantArray`. In the future, this would also cast the values to a specified type. - Includes some basic unit tests. - Includes a simple micro-benchmark for reference. Current limitations: - It can only return another VariantArray. Casts are not implemented yet. - Only top-level object/list access is supported. It panics on finding a nested object/list. Needs apache#7914 to fix this. - Perf is a TODO.
I am starting to hack in this a bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can signifcantly simplify this PR in this way
It isn't quite ready to go -- I need to run now -- but I got enough of the sketch done to show how it would look like.
I'll mess around with it a bit more tomorrow too
Unify nested variant handling
014ffad
to
db32eef
Compare
I think this is the most important Variant PR at the moment from my perspective as we will likely need this for many of the other things we are discussing. Thus I plan to help with this one first and then move on to more shredding related ones @friendlymatthew please let me know if you get stuck or need some help. Otherwise I'll wait for a ping from you |
5113739
to
850dd15
Compare
Hi, round tripping variant objects would fail because we were writing out object field names by the sort order. This is fine for variants with a sorted dictionary, but everything else will produce inconsistent results What we want instead is to write out the object fields by insertion order, we can do this by passing in the I'm sure we can improve the implementation, but the feature is working cc @alamb |
850dd15
to
e77576c
Compare
e77576c
to
1314f5c
Compare
Hi, I'm curious to get your thoughts. Sorry for the awolness, I am fighting jet lag/conferences Cc @scovich |
No worries at all -- I think we are all struggling to find the time while juggling many other things. It is totally fine -- and the nature of open source work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew -- I think there are several things to improve in this PR (such as dbg!
and the need to sort the fields for appending) -- however in order to keep things moving I will merge this one as is and then make a follow on PR or ticket to improve
@@ -310,6 +376,8 @@ impl MetadataBuilder { | |||
fn upsert_field_name(&mut self, field_name: &str) -> u32 { | |||
let (id, new_entry) = self.field_names.insert_full(field_name.to_string()); | |||
|
|||
dbg!(new_entry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is left over
|
||
let mut object_builder = self.new_object(metadata_builder); | ||
|
||
// first add all object fields that exist in metadata builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the code here is basically trying to preserve the order so that the variants compare deeply equal
Rather than trying to preserve the byte-for-byte equality here, I think a better idea could be to implement a Variant::eq_value
type method that compares the Variants for logical equality, rather than byte-for-byte equality.
I am thinking we can merge this PR and then I will make a follow on PR for that idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the logical equality function on variants. This will be needed in tests where the result and expected variants come from different sources (and thus might have a mismatch in metadata or ordering).
Let's get this in and keep iterating |
(BTW @friendlymatthew if you are willing, if you make PRs from your own personal fork, rather than the pydantic one, I can push commits directly to the PR, which might speed up development) |
I will have time for the next hour so I would like to clean up |
Awesome -- maybe a follow on PR! |
What I suggest doing is looking into changing the So rather than the I am not 100% sure what the semantics should be for field name order |
I think accoding to my reading of https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-data-for-object-basic_type2, the field order can vary from objet to object so when comparing two variants for equality the field order should not be taken into account |
Does that make sense? |
Ah sounds good. I haven't checked recently, but I noticed pushing up PRs from my fork requires approval to run CI which was kind of a pain |
Hm, I assume you are talking about
|
# Which issue does this PR close? - Closes #7893 # What changes are included in this PR? In parquet-variant: - Add a new function `Variant::get_path`: this traverses the path to create a new Variant (does not cast any of it). - Add a new module `parquet_variant::path`: adds structs/enums to define a path to access a variant value deeply. In parquet-variant-compute: - Add a new compute kernel `variant_get`: does the path traversal over a `VariantArray`. In the future, this would also cast the values to a specified type. - Includes some basic unit tests. Not comprehensive. - Includes a simple micro-benchmark for reference. Current limitations: - It can only return another VariantArray. Casts are not implemented yet. - Only top-level object/list access is supported. It panics on finding a nested object/list. Needs #7914 to fix this. - Perf is a TODO. # Are these changes tested? Some basic unit tests are added. # Are there any user-facing changes? Yes --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
This happens for first time contributors. Once we have merged a PR from your CI will run automatically I think since we have now merged several your CI runs should go automatically |
I thought the fields (as in, the field id array) did need to be in (lexical) order? The corresponding value bytes can be out of order but hopefully their physical placement doesn't change the logical equality test. One observation tho -- an unordered metadata dictionary can have duplicate field ids. Two fields have the same name if their field ids are the same, but field ids differing does NOT prove that the fields have different names. |
I think @friendlymatthew correctly implemented these semantics in #7943 |
Which issue does this PR close?
Rationale for this change
When trying to append
VariantObject
orVariantList
s directly on theVariantBuilder
, it will panic.Changes to the public API
VariantBuilder
now has these additional methods:append_object
, will panic if shallow validation fails or the object has duplicate field namestry_append_object
, will perform full validation on the object before appendingappend_list
, will panic if shallow validation failstry_append_list
, will perform full validation on the list before appending