Tags: parquet-go/parquet-go
Tags
simplify big endian compat (#182) Follow up to #164, this PR removes the `getOffset` function and moves all offset computation to build time constants. I also moved most use of `cpu.IsBigEndian` to using separate files with build tags for consistency and to keep the main code logic as clean as possible. In some cases, this might also help with inlining since functions contain less code. Let me know if you have any concerns about the change.
Sorting merge data corruption (#140) This adds a unit test that recreates the failure seen in #139 The merged row reader will overwrite values in its internal buffers when filling up the buffer during the `ReadRows` call. This causes data corruption in the final `rows` that are returned to the caller. This adds a potential fix for this issue by called the `rowAllocator` capture method we can copy in the byte arrays. However this creates some significant hits to the benchmarks seen below. Is there a better solution to prevent this buffer corruption?
Bug: Respect current page offset in reslice (#109) Currently if you reslice a boolean page it will not respect it's internal bit offset into the first byte. The impact is that if you slice a boolean page multiple times on values that are not byte aligned the page will return misaligned values when reading.
Return errors for missing page indices (#94) As a result of #84, we can now return a `nil` column or offset index when the `SkipPageIndex` option is set. Furthermore, if the page index is not skipped when opening a file, the return value for missing indices will be a zero-value struct instead of nil. This commit reconciles this inconsistency by returning `ErrMissingColumnIndex` and `ErrMissingOffsetIndex` respectively when either index is missing. Follow up to #85 (comment).
Migrate WriterTest to use parquet-cli (#52) Replace the deprecated parquet-tools command used in the WriterTest with parquet-cli. Validate the output of the "meta" and "pages" subcommands, which gives most of the same coverage as parquet-tools. Ideally this could have included the "cat" command to validate each row. However, the command has some problems with parquet schemas that don't map cleanly onto Avro definitions, including many uses of repeated fields, so this doesn't reliably work. The PARQUET_GO_TEST_CLI environment variable provides an alternative way to configure the parquet-cli command for use in the test, and supports both an alternaive executable name and prefix CLI arguments. This is necessary to support the method for running the CLI described in the parquet-cli README: `java -cp 'target/parquet-cli-1.12.3.jar:target/dependency/*' org.apache.parquet.cli.Main` See: https://github.com/apache/parquet-mr/blob/master/parquet-cli/README.md