ENH: np.unique: support hash based unique for string dtype #28767

math-hiyoko · 2025-04-18T11:14:36Z

Description

This PR introduces hash-based uniqueness extraction support for NPY_STRING, NPY_UNICODE, and NPY_VSTRING types in NumPy's np.unique function.
The existing hash-based unique implementation, previously limited to integer data types, has been generalized to accommodate additional data types including string-related ones. Minor refactoring was also performed to improve maintainability and readability.

Benchmark Results

The following benchmark demonstrates significant performance improvement from the new implementation.
The test scenario (1 billion strings array) follows the experimental setup described #26018 (comment)

import random
import string
import time

import numpy as np
import polars as pl

chars = string.ascii_letters + string.digits
arr = np.array(
    [
        ''.join(random.choices(chars, k=random.randint(5, 10)))
        for _ in range(1_000)
    ] * 1_000_000,
    dtype='T',
)
np.random.shuffle(arr)

time_start = time.perf_counter()
print("unique count (hash based): ", len(np.unique(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

time_start = time.perf_counter()
print("unique count (polars): ", len(pl.Series(arr).unique()))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

Result

unique count (hash based):  1000
35.127 secs
unique count (numpy main):  1000
498.011 secs
unique count (polars):  1000
74.023 secs

close #28364

numpy/_core/src/multiarray/multiarraymodule.c

numpy/_core/src/multiarray/unique.cpp

math-hiyoko · 2025-05-04T07:53:26Z

@seberg @ngoldbaum
I've addressed all comments received so far.

ngoldbaum

I think the implementation of fnv-1a in this PR isn't correct. Maybe we should just be using the (public-domain licensed) reference implementation: https://github.com/lcn2/fnv.

I didn't look closely at the rest after I noticed this issue.

ngoldbaum · 2025-05-22T19:16:51Z

numpy/_core/src/multiarray/unique.cpp

-template<typename T>
+// function to caluculate the hash of a string
+template <typename T>
+size_t str_hash(const T *str, npy_intp num_chars) {


Note that npy_ucs4 is four bytes and fnv-1a operates on octets of data (e.g. individual bytes).

The reference implementation takes a void * pointer and immediately casts it to unsigned char *. You could do similar.

We could also add the reference implementation as a vendored dependency (e.g. a git submodule). The license is compatible.

Thanks for the suggestion.
If we go the submodule route for lcn2/fnv, where in the NumPy tree would you like the submodule to live? Let me know your preferred path and I’ll add it accordingly.

Vendoring would work for me as well (e.g. copying a single header the way stlab does in adobe/fnv.hpp).
If we go with a vendored file instead of a submodule, which path in the NumPy tree would you prefer for it to live?

I would probably vendor it as a git submodule in numpy/_core/src/common, next to pythoncapi-compat. Maybe take a look at the PR adding pythoncapi-compat as a vendered dependency to see how that header-only vendored dependency is integrated into the numpy build system.

I only have a slight preference for a git submodule, since it makes updating the vendored code marginally easier. If you feel like it's easier to structure it as just a copy/pasted new header file (that includes a note about the original copyright and license), I'd probably just put that header in numpy/_core/src/multiarray.

Tried the submodule approach, but ran into two blockers:

the reference source expects to be built with a plain make step that is not available on every platform/toolchain we target.

the code still uses legacy BSD typedefs such as u_int32_t; that builds fine on Linux/macOS but fails to compile on WASM, Windows/Clang, etc.

Because of these two issues the submodule route looks impractical for NumPy at the moment.

Fair enough, thanks for checking. Vendoring a somewhat adapted version makes sense.

math-hiyoko added 13 commits April 16, 2025 01:20

Support NPY_STRING, NPY_UNICODE

f620f3b

unique for NPY_STRING and NPY_UNICODE

20ccefe

fix construct array

38626b9

remove unneccessary include

56bd858

refactor

f79736a

refactoring

c4e5438

comment

7c51049

feature: unique for NPY_VSTRING

bd70552

refactoring

cc8ece6

remove unneccessary include

f7b20a0

add test

d0170ed

add error message

dbb140f

linter

49ed502

math-hiyoko marked this pull request as draft April 18, 2025 11:14

github-actions bot added the 01 - Enhancement label Apr 18, 2025

math-hiyoko added 15 commits April 18, 2025 20:16

linter

0238cee

reserve bucket

6905978

remove emoji from testcase

2fc1378

fix testcase

1ad6d6c

remove error

b478e15

fix testcase

95bc405

fix testcase name

3f1811b

use basic_string

99e3662

fix testcase

b99542a

add ValueError

2589dd7

fix testcase

3f40cdc

fix memory error

68d5a7b

remove multibyte char

d38c3e3

refactoring

8cf2c63

add multibyte char

0165d6a

math-hiyoko added 7 commits April 29, 2025 21:59

FIX: include

aa0db48

FIX: cast

7a2892f

ENH: support equal_nan=False

896bcba

FIX: function equal

f1c1947

FIX: check the case if pack_status douesn't return NULL

f35123a

FIX: check the case if pack_status douesn't return NULL

e6ea015

FIX: stderr

ddff98f

seberg reviewed May 1, 2025

View reviewed changes

numpy/_core/src/multiarray/multiarraymodule.c Outdated Show resolved Hide resolved

math-hiyoko added 2 commits May 1, 2025 18:39

ENH: METH_VARARGS -> METH_FASTCALL

2758e27

FIX: log

a6dc86a

ngoldbaum reviewed May 2, 2025

View reviewed changes

numpy/_core/src/multiarray/unique.cpp Outdated Show resolved Hide resolved

math-hiyoko force-pushed the feature/#28364 branch from d9333fa to a6dc86a Compare May 3, 2025 13:19

math-hiyoko added 7 commits May 3, 2025 07:23

FIX: release allocator

9a936eb

FIX: comment

1e967ee

FIX: delete log

52c2326

ENH: implemented FNV-1a as hash function

6f18a43

bool -> npy_bool

2a1bd41

FIX: cast

8b632f2

34sec -> 35.1sec

a7bfc08

math-hiyoko and others added 2 commits May 21, 2025 22:11

Merge branch 'main' into feature/numpy#28364

dd0d8f5

fix: lint

9fc9ce3

ngoldbaum reviewed May 22, 2025

View reviewed changes

melissawm added this to NumPy first-time contributor PRs May 22, 2025

melissawm moved this to Pending authors' response in NumPy first-time contributor PRs May 22, 2025

math-hiyoko added 5 commits May 26, 2025 23:49

fix: cast using const void *

998ca00

enh: add submodule lcn2/fnv @ tag 5.0.6

639fcd5

fix: submodule fnv add

ea3612f

fix: use fnv hash

5405012

fix: build submodule

b6394ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: np.unique: support hash based unique for string dtype #28767

ENH: np.unique: support hash based unique for string dtype #28767

math-hiyoko commented Apr 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

math-hiyoko commented May 4, 2025

Uh oh!

ngoldbaum left a comment

Uh oh!

ngoldbaum May 22, 2025

Uh oh!

math-hiyoko May 25, 2025

Uh oh!

math-hiyoko May 26, 2025

Uh oh!

ngoldbaum May 26, 2025 •

edited

Loading

Uh oh!

math-hiyoko May 27, 2025 •

edited

Loading

Uh oh!

ngoldbaum May 27, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Uh oh!

ENH: np.unique: support hash based unique for string dtype #28767

Are you sure you want to change the base?

ENH: np.unique: support hash based unique for string dtype #28767

Conversation

math-hiyoko commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Benchmark Results

Result

Uh oh!

Uh oh!

Uh oh!

math-hiyoko commented May 4, 2025

Uh oh!

ngoldbaum left a comment

Choose a reason for hiding this comment

Uh oh!

ngoldbaum May 22, 2025

Choose a reason for hiding this comment

Uh oh!

math-hiyoko May 25, 2025

Choose a reason for hiding this comment

Uh oh!

math-hiyoko May 26, 2025

Choose a reason for hiding this comment

Uh oh!

ngoldbaum May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

math-hiyoko May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

math-hiyoko commented Apr 18, 2025 •

edited

Loading

ngoldbaum May 26, 2025 •

edited

Loading

math-hiyoko May 27, 2025 •

edited

Loading