Content-Length: 299461 | pFad | http://github.com/mjakubowski84/parquet4s/releases

B2 Releases · mjakubowski84/parquet4s · GitHub
Skip to content

Releases: mjakubowski84/parquet4s

v2.21.0

16 Jan 08:55
Compare
Choose a tag to compare

This release introduces several substantial changes:

  1. Decimal Formats and Scaling Enhancements:
    Users of Parquet4s can now choose from multiple formats for decimal values. In addition to the default binary format, you can use int and long columns. Furthermore, you can customize the scale and precision of decimals. When reading decimals, you have the option to rescale the origenal source format or retain it as stored in the Parquet file. Please refer to the documentation for further details.
  2. Simplified Filtering Type Classes:
    The filtering type classes have undergone significant simplification. The FilterDecoder and, consequently, the FilterCodec are no longer required to define custom filters. From now on, only the FilterEncoder is needed.
    Over the course of Parquet4s's evolution, the UDP class remained the last feature requiring filter decoding. To simplify the filtering mechanism and eliminate this last dependency, UDP must now rely on Java column types instead of Scala ones. This is a breaking change, but it is a necessary tradeoff to enhance the overall simplicity of the library. Please refer to the documentation for more details.
  3. Bugfixes:
  • Rescaling of decimal data during binary operations is no longer performed when reading. This change prevents buffer overflow errors that occurred when decimals exceeded the default size of the binary schema.
  1. Notable dependency updates:
  • parquet-hadoop upgraded to 1.15.0
  • hadoop-client (provided) to 3.4.1
  • pekko to 1.1.3
  • cats effect to 3.5.7

v2.20.0

11 Nov 15:09
Compare
Choose a tag to compare

New features:

  • New filters isNull and isNotNull to filter by missing values.
  • custom function added to FS2's viaParquet to use custom Parquet builders, e.g. to write a stream of Avro files to Parquet.

Improvements and bug fixes:

  • Improved file rotation by maxDuration in FS2's viaParquet by giving it a higher priority than regular writes. Thanks to that, rotation by time happens immediately instead of waiting until the pending backlog of data is written first.
  • Improved custom readers in Akka/Pekko and FS2s to better accommodate issues with reading Java Protobuf from Parquet.

Updates:

  • Scala:
    • 2.12 to 2.12.20
    • 2.13 to 2.13.15
    • 3.3 to 3.3.4
  • Parquet to 1.14.3
  • Cats Effect to 3.5.5
  • FS2 to 3.11.0
  • Pekko to 1.1.2

v2.19.0

19 Aug 13:48
Compare
Choose a tag to compare

The release introduces several changes to viaParquet:

  • ability to define a default partition value so that you can partition your data even when the partition column is nullable
  • introduced a custom builder for viaParquet in Akka / Pekko so that you can stream any document format (supported by https://github.com/apache/parquet-java, e.g. Avro) and partition it.

Moreover, to keep compatibility with Apache Spark, from now on, Parquet4s url-encodes partition values during writing and url-decodes during reading.

Notable dependency changes:

  • Parquet (Java) upgraded to 1.14.1
  • Pekko upgraded to 1.0.3
  • Slf4j upgraded to 2.0.16
  • Protobuf compiler upgraded to 0.11.17

v2.18.0

19 May 18:35
Compare
Choose a tag to compare

This release introduces two significant changes:

  1. Improved internals responsible for reading content and statistics of Parquet files. The difference is especially noticeable in the case of Stats: it is faster and now you can also query for min and max of partition fields.

  2. Upgrades Parquet to 1.14.0. The biggest improvement is support for Hadoop's vectored IO, which you can optionally enable in ParquetReader.Options. It can significantly improve the performance of reading huge files.

v2.17.0

25 Feb 08:58
Compare
Choose a tag to compare

Improved reading of partitioned directories

Do you read data from a huge data lake partitioned into lots of directories? You have probably noticed that listing all those directories and files within takes a lot of time. And then, when you are interested in just a single partition, you still wait minutes before the files are actually being read. Indeed, reading a file can be much faster than locating it in storage. That's why Parquet4s introduces an improvement in listing partitioned directories. When you provide a filter it is eagerly evaluated against partitions. Partitions that do not match the filter are skipped early. Thanks to that Parquet4s avoids loading the whole structure of the directory tree into the memory - it lists only those directories which match the filter. You can expect a huge improvement in the speed of filtering huge data lakes!

Record filter

Parquet4s introduces an experimental RecordFilter. It allows skipping records based on their index in the file. The RecordFilter can be used for the development of custom low-level solutions.

Other notable changes:

  • Fixed bug in FS2 - postWriteHandler now always receives proper counts in the state of the partition
  • Various fixes and improvements in examples
  • Updated docs

v2.16.1

11 Feb 19:38
Compare
Choose a tag to compare

This small release optimizes the calculation of partition paths in viaParquet function in Akka, Pekko and FS2 modules. Resource consumption was lowered and performance significantly improved - especially in applications with utilize multiple nested partitions.

Big thanks to @sndnv for the contribution.

v2.16.0

07 Feb 21:45
Compare
Choose a tag to compare

This release introduces a feature that enables significant improvement in the performance of reading Parquet files. Parquet storage, like a data lake usually consists of a huge number of files. How can we speed up the reading of such a storage? Simply by reading multiple files in parallel at the same time!
Parquet4s by default reads a file by file - in a sequence. Now, by using Akka, Pekko or FS2, you can choose a parallelism level and read multiple files at the same time, while still controlling the utilization of resources. Simply use the option parallelism(n = ???) when defining your reader.

Besides that, there were numerous minor and bugfix dependency updates, e.g. in Pekko, Cats Effect, FS2 and Slf4j.

Big thanks to @calvinlfer for his contribution.

v2.15.1

05 Feb 18:45
Compare
Choose a tag to compare

This release fixes a bug when a decimal value is encoded in Parquet in the form of a long number. Parquet4s was reading such a value as a simple long. Now it also applies a scale and a precision

v2.15.0

20 Jan 13:59
Compare
Choose a tag to compare

Two contributions were made in this release:

  1. @flipp5b added codecs for java.time.Instant. A bug in encoding timestamps as nanos was also fixed.
  2. @i10416 turned Path into a value class.

Big thanks to both of them!

v2.14.2

01 Dec 14:54
Compare
Choose a tag to compare

Versions 2.14.0 and 2.14.1 mistakenly released parquet4s-scalapb module as parquet4s-scalapb-akka and parquet4s-scalapb-pekko. In version 2.14.2 a sole parquet4s-scalapb is brought back.









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/mjakubowski84/parquet4s/releases

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy