Skip to content

ENH: np.unique: support hash based unique for string dtype #28767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 83 commits into
base: main
Choose a base branch
from

Conversation

math-hiyoko
Copy link

@math-hiyoko math-hiyoko commented Apr 18, 2025

Description

This PR introduces hash-based uniqueness extraction support for NPY_STRING, NPY_UNICODE, and NPY_VSTRING types in NumPy's np.unique function.
The existing hash-based unique implementation, previously limited to integer data types, has been generalized to accommodate additional data types including string-related ones. Minor refactoring was also performed to improve maintainability and readability.

Benchmark Results

The following benchmark demonstrates significant performance improvement from the new implementation.
The test scenario (1 billion strings array) follows the experimental setup described #26018 (comment)

import random
import string
import time

import numpy as np
import polars as pl

chars = string.ascii_letters + string.digits
arr = np.array(
    [
        ''.join(random.choices(chars, k=random.randint(5, 10)))
        for _ in range(1_000)
    ] * 1_000_000,
    dtype='T',
)
np.random.shuffle(arr)

time_start = time.perf_counter()
print("unique count (hash based): ", len(np.unique(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

time_start = time.perf_counter()
print("unique count (polars): ", len(pl.Series(arr).unique()))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

Result

unique count (hash based):  1000
35.127 secs
unique count (numpy main):  1000
498.011 secs
unique count (polars):  1000
74.023 secs

close #28364

@math-hiyoko math-hiyoko marked this pull request as draft April 18, 2025 11:14
@math-hiyoko
Copy link
Author

@seberg @ngoldbaum
I've addressed all comments received so far.

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation of fnv-1a in this PR isn't correct. Maybe we should just be using the (public-domain licensed) reference implementation: https://github.com/lcn2/fnv.

I didn't look closely at the rest after I noticed this issue.

template<typename T>
// function to caluculate the hash of a string
template <typename T>
size_t str_hash(const T *str, npy_intp num_chars) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that npy_ucs4 is four bytes and fnv-1a operates on octets of data (e.g. individual bytes).

The reference implementation takes a void * pointer and immediately casts it to unsigned char *. You could do similar.

We could also add the reference implementation as a vendored dependency (e.g. a git submodule). The license is compatible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.
If we go the submodule route for lcn2/fnv, where in the NumPy tree would you like the submodule to live? Let me know your preferred path and I’ll add it accordingly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vendoring would work for me as well (e.g. copying a single header the way stlab does in adobe/fnv.hpp).
If we go with a vendored file instead of a submodule, which path in the NumPy tree would you prefer for it to live?

Copy link
Member

@ngoldbaum ngoldbaum May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably vendor it as a git submodule in numpy/_core/src/common, next to pythoncapi-compat. Maybe take a look at the PR adding pythoncapi-compat as a vendered dependency to see how that header-only vendored dependency is integrated into the numpy build system.

I only have a slight preference for a git submodule, since it makes updating the vendored code marginally easier. If you feel like it's easier to structure it as just a copy/pasted new header file (that includes a note about the original copyright and license), I'd probably just put that header in numpy/_core/src/multiarray.

Copy link
Author

@math-hiyoko math-hiyoko May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried the submodule approach, but ran into two blockers:

  • the reference source expects to be built with a plain make step that is not available on every platform/toolchain we target.
  • the code still uses legacy BSD typedefs such as u_int32_t; that builds fine on Linux/macOS but fails to compile on WASM, Windows/Clang, etc.

Because of these two issues the submodule route looks impractical for NumPy at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, thanks for checking. Vendoring a somewhat adapted version makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Pending authors' response
Development

Successfully merging this pull request may close these issues.

np.unique: support string dtypes
3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy