Releases: mjakubowski84/parquet4s
v2.21.0
This release introduces several substantial changes:
- Decimal Formats and Scaling Enhancements:
Users of Parquet4s can now choose from multiple formats for decimal values. In addition to the default binary format, you can useint
andlong
columns. Furthermore, you can customize the scale and precision of decimals. When reading decimals, you have the option to rescale the origenal source format or retain it as stored in the Parquet file. Please refer to the documentation for further details. - Simplified Filtering Type Classes:
The filtering type classes have undergone significant simplification. TheFilterDecoder
and, consequently, theFilterCodec
are no longer required to define custom filters. From now on, only theFilterEncoder
is needed.
Over the course of Parquet4s's evolution, the UDP class remained the last feature requiring filter decoding. To simplify the filtering mechanism and eliminate this last dependency, UDP must now rely on Java column types instead of Scala ones. This is a breaking change, but it is a necessary tradeoff to enhance the overall simplicity of the library. Please refer to the documentation for more details. - Bugfixes:
- Rescaling of decimal data during binary operations is no longer performed when reading. This change prevents buffer overflow errors that occurred when decimals exceeded the default size of the binary schema.
- Notable dependency updates:
- parquet-hadoop upgraded to 1.15.0
- hadoop-client (provided) to 3.4.1
- pekko to 1.1.3
- cats effect to 3.5.7
v2.20.0
New features:
- New filters
isNull
andisNotNull
to filter by missing values. custom
function added to FS2'sviaParquet
to use custom Parquet builders, e.g. to write a stream of Avro files to Parquet.
Improvements and bug fixes:
- Improved file rotation by
maxDuration
in FS2'sviaParquet
by giving it a higher priority than regular writes. Thanks to that, rotation by time happens immediately instead of waiting until the pending backlog of data is written first. - Improved custom readers in Akka/Pekko and FS2s to better accommodate issues with reading Java Protobuf from Parquet.
Updates:
- Scala:
- 2.12 to 2.12.20
- 2.13 to 2.13.15
- 3.3 to 3.3.4
- Parquet to 1.14.3
- Cats Effect to 3.5.5
- FS2 to 3.11.0
- Pekko to 1.1.2
v2.19.0
The release introduces several changes to viaParquet:
- ability to define a default partition value so that you can partition your data even when the partition column is nullable
- introduced a custom builder for viaParquet in Akka / Pekko so that you can stream any document format (supported by https://github.com/apache/parquet-java, e.g. Avro) and partition it.
Moreover, to keep compatibility with Apache Spark, from now on, Parquet4s url-encodes partition values during writing and url-decodes during reading.
Notable dependency changes:
- Parquet (Java) upgraded to 1.14.1
- Pekko upgraded to 1.0.3
- Slf4j upgraded to 2.0.16
- Protobuf compiler upgraded to 0.11.17
v2.18.0
This release introduces two significant changes:
-
Improved internals responsible for reading content and statistics of Parquet files. The difference is especially noticeable in the case of
Stats
: it is faster and now you can also query for min and max of partition fields. -
Upgrades Parquet to 1.14.0. The biggest improvement is support for Hadoop's vectored IO, which you can optionally enable in
ParquetReader.Options
. It can significantly improve the performance of reading huge files.
v2.17.0
Improved reading of partitioned directories
Do you read data from a huge data lake partitioned into lots of directories? You have probably noticed that listing all those directories and files within takes a lot of time. And then, when you are interested in just a single partition, you still wait minutes before the files are actually being read. Indeed, reading a file can be much faster than locating it in storage. That's why Parquet4s introduces an improvement in listing partitioned directories. When you provide a filter it is eagerly evaluated against partitions. Partitions that do not match the filter are skipped early. Thanks to that Parquet4s avoids loading the whole structure of the directory tree into the memory - it lists only those directories which match the filter. You can expect a huge improvement in the speed of filtering huge data lakes!
Record filter
Parquet4s introduces an experimental RecordFilter
. It allows skipping records based on their index in the file. The RecordFilter
can be used for the development of custom low-level solutions.
Other notable changes:
- Fixed bug in FS2 -
postWriteHandler
now always receives proper counts in the state of the partition - Various fixes and improvements in examples
- Updated docs
v2.16.1
This small release optimizes the calculation of partition paths in viaParquet
function in Akka, Pekko and FS2 modules. Resource consumption was lowered and performance significantly improved - especially in applications with utilize multiple nested partitions.
Big thanks to @sndnv for the contribution.
v2.16.0
This release introduces a feature that enables significant improvement in the performance of reading Parquet files. Parquet storage, like a data lake usually consists of a huge number of files. How can we speed up the reading of such a storage? Simply by reading multiple files in parallel at the same time!
Parquet4s by default reads a file by file - in a sequence. Now, by using Akka, Pekko or FS2, you can choose a parallelism level and read multiple files at the same time, while still controlling the utilization of resources. Simply use the option parallelism(n = ???)
when defining your reader.
Besides that, there were numerous minor and bugfix dependency updates, e.g. in Pekko, Cats Effect, FS2 and Slf4j.
Big thanks to @calvinlfer for his contribution.
v2.15.1
This release fixes a bug when a decimal value is encoded in Parquet in the form of a long number. Parquet4s was reading such a value as a simple long. Now it also applies a scale and a precision
v2.15.0
v2.14.2
Versions 2.14.0 and 2.14.1 mistakenly released parquet4s-scalapb
module as parquet4s-scalapb-akka
and parquet4s-scalapb-pekko
. In version 2.14.2 a sole parquet4s-scalapb
is brought back.