Skip to content

feat(sql): implement mode(Ø/T/L/K/S) group by functions #5623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

nwoolmer
Copy link
Contributor

@nwoolmer nwoolmer commented Apr 24, 2025

Closes #5580

mode() returns the most frequent appearance of a column value, for each group. Without mode(), you must first perform a count() operation. Then you must sort the result set and take the value corresponding to the highest count. This requires complex subqueries.

This has a variety of uses. For example, returning the most frequently traded symbol by a trader, or identifying the most frequent error state for a sensor.

todo:

  • implement single-threaded versions of the functions
  • implement parallel mode(BOOLEAN)
  • add new GroupByLongLongHashMap
  • add benchmark for new map
  • add tests for new map
  • make mode(LONG) run in parallel using the new off-heap map
  • make mode(VARCHAR) run in parallel using the new off-heap map
  • make mode(STRING) run in parallel using the new off-heap map
  • make mode(SYMBOL) run in parallel using the new off-heap map
  • add tests for all five functions

Benchmark

For GroupByLongHashSet, GroupByLongLongHashMap, LongLongHashMap: (not isolated)

Benchmark (size) Mode Cnt Score Error Units
GroupByLongLongHashMapBenchmark.testGroupByLongHashSet 5000 avgt 3 8.506 1.401 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongHashSet 50000 avgt 3 14.323 2.399 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongHashSet 500000 avgt 3 31.933 4.347 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongHashSet 5000000 avgt 3 52.546 16.526 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongLongHashMap 5000 avgt 3 14.065 0.740 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongLongHashMap 50000 avgt 3 22.420 3.034 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongLongHashMap 500000 avgt 3 56.049 12.629 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongLongHashMap 5000000 avgt 3 80.433 8.899 ns/op
GroupByLongLongHashMapBenchmark.testLongLongHashMap 5000 avgt 3 16.445 0.487 ns/op
GroupByLongLongHashMapBenchmark.testLongLongHashMap 50000 avgt 3 20.678 1.930 ns/op
GroupByLongLongHashMapBenchmark.testLongLongHashMap 500000 avgt 3 53.339 21.817 ns/op
GroupByLongLongHashMapBenchmark.testLongLongHashMap 5000000 avgt 3 71.944 6.283 ns/op

For GroupbyUtf8SequenceLongHashMap and Utf8SequenceLongHashMap: (not isolated)

Benchmark (size) Mode Cnt Score Error Units GroupByUtf8SequenceLongHashMapBenchmark.testGroupByUtf8SequenceLongHashMap 10 avgt 3 1078.801 ± 90.283 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testGroupByUtf8SequenceLongHashMap 50 avgt 3 2287.050 ± 213.875 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testGroupByUtf8SequenceLongHashMap 200 avgt 3 7145.411 ± 86.704 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testGroupByUtf8SequenceLongHashMap 1000 avgt 3 34169.157 ± 2721.831 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testUtf8SequenceLongHashMap 10 avgt 3 778.603 ± 51.911 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testUtf8SequenceLongHashMap 50 avgt 3 1972.678 ± 376.840 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testUtf8SequenceLongHashMap 200 avgt 3 6501.634 ± 133.595 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testUtf8SequenceLongHashMap 1000 avgt 3 29782.794 ± 2519.313 ns/op

For GroupbyCharSequenceLongHashMap and CharSequenceLongHashMap: (not isolated)

Benchmark (size) Mode Cnt Score Error Units GroupByCharSequenceLongHashMapBenchmark.testCharSequenceLongHashMap 10 avgt 3 503.610 ± 415.333 ns/op GroupByCharSequenceLongHashMapBenchmark.testCharSequenceLongHashMap 50 avgt 3 756.659 ± 178.424 ns/op GroupByCharSequenceLongHashMapBenchmark.testCharSequenceLongHashMap 200 avgt 3 2039.515 ± 222.997 ns/op GroupByCharSequenceLongHashMapBenchmark.testCharSequenceLongHashMap 1000 avgt 3 8263.961 ± 811.661 ns/op GroupByCharSequenceLongHashMapBenchmark.testGroupByCharSequenceLongHashMap 10 avgt 3 583.044 ± 105.280 ns/op GroupByCharSequenceLongHashMapBenchmark.testGroupByCharSequenceLongHashMap 50 avgt 3 1159.440 ± 954.187 ns/op GroupByCharSequenceLongHashMapBenchmark.testGroupByCharSequenceLongHashMap 200 avgt 3 3150.716 ± 132.363 ns/op GroupByCharSequenceLongHashMapBenchmark.testGroupByCharSequenceLongHashMap 1000 avgt 3 11260.364 ± 256.576 ns/op

@nwoolmer nwoolmer added the SQL Issues or changes relating to SQL execution label Apr 24, 2025
@nwoolmer nwoolmer changed the title implement mode(Ø/T/L/K/S) group by function feat(sql): implement mode(Ø/T/L/K/S) group by function Apr 24, 2025
… off-heap hashmap, passable by pointer, for use in group by queries.
@nwoolmer nwoolmer changed the title feat(sql): implement mode(Ø/T/L/K/S) group by function feat(sql): implement mode(Ø/T/L/K/S) group by functions Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SQL Issues or changes relating to SQL execution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mode() aggregate function
1 participant
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy