Skip to content

ENH, API: New sorting mechanism for DType API #28516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

MaanasArora
Copy link
Contributor

@MaanasArora MaanasArora commented Mar 14, 2025

Resolves #26510.

Allocates the lock for the StringDType array before sort and releases after.

I noticed the sorting algorithms independently get the compare function from the descriptor, so I have created new helper functions in stringdtype/dtype.c but not sure if that's the right place. Changes have been made only in quicksort.cpp (but will add others later), so this is a draft but would appreciate feedback.

Posting simple benchmarks in a comment below. Thank you for reviewing!

@MaanasArora
Copy link
Contributor Author

Running this script on both branches:

import numpy
import random
import timeit

print(numpy.__version__)  # '2.0.0rc2'

options = ["a", "bb", "ccc", "dddd"]
lst = random.choices(options, k=1000)
arr_s = numpy.fromiter(lst, dtype="T", count=len(lst))

print(timeit.timeit(lambda: numpy.unique(arr_s), number=10000))

produces on master:

2.3.0.dev0+git20250310.c275e25
3.481267879999905

and on this branch:

2.3.0.dev0+git20250314.0eb0c8e
1.1663875859994732

@seberg
Copy link
Member

seberg commented Mar 25, 2025

@ngoldbaum just to let you know, I'll let you decide on whether you want this. I am starting to think it is time to implement a get_sortfunction slot but that doesn't mean we can't do this in the mean-time as it's a pretty big speed advantage.

@ngoldbaum
Copy link
Member

I am starting to think it is time to implement a get_sortfunction slot

Agreed. @MaanasArora as you said this was a draft, would you be up for a bigger refactor? IMO this functionality deserves support in the new DType system without relying on the ArrFuncs baggage.

but that doesn't mean we can't do this in the mean-time as it's a pretty big speed advantage.

Also agreed if you don't want to take this further.

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Mar 25, 2025

Yes, agreed, and willing to do a larger refactor! I actually began by considering special casing array sorting for strings overall, but wondered what the preferred approach would be. I think the sorting routines are not very flexible and could use an overhaul.

PS just to clarify, adding a slot to the dtype will mean we will restructure the sorting to be more generic and allow replacing or extending compare etc.? I'll look further into this but would appreciate pointers.

@ngoldbaum
Copy link
Member

ngoldbaum commented Mar 25, 2025

adding a slot to the dtype will mean we will restructure the sorting to be more generic and allow replacing or extending compare etc.?

Take a look at numpy/_core/src/multiarray/dtypemeta.h - I'm talking about adding a new entry in NPY_DType_Slots that handles comparison. Adding a new slot takes some ceremony - there are some magic constants doing offsets on structs elsewhere in NumPy that need to be updated alongside any changes - but there are comments that should hopefully guide you along your way.

We already have entries for getitem and setitem as well as the legacy arrfuncs slots. You'd be migrating from using NPY_DT_SLOTS(dtype)->f->compare to some new api like NPY_DT_COMPARE(dtype) that uses its own slot and allows per-dtype setup for sorting.

@seberg
Copy link
Member

seberg commented Mar 25, 2025

PS just to clarify, adding a slot to the dtype will mean we will restructure the sorting to be more generic and allow replacing or extending compare etc.? I'll look further into this but would appreciate pointers.

Yes, if you look around, you will find for example get_clear_loop and the way users can specify it. I think sorting would look similar.
(I.e. the core sort loop probably always works on a contiguous chunk of memory that is aligned -- that part may be simpler would have to check. I assume it would work in-place, but we will also need an argsort.)
The get_sort_function -- or what we name it -- would then get the desired sort-kind passed in, that way we will also have an easier time of adding new sort methods in the future.
(We may want to provision for an ascending/descending flag even if we don't use it).


I see nathan has some other pointers, I don't expect mine to be quite enough, so please ask!

@MaanasArora
Copy link
Contributor Author

Thank you both, this was helpful! Starting to plan this now and will surely clarify if needed.

@MaanasArora
Copy link
Contributor Author

I've added the slots and done some patchy work around using it, and the stringdtype integration. Looking into how to better relate to the array funcs. WIP, but hopefully this is in the right direction!

@ngoldbaum
Copy link
Member

Yeah, I think you see we still have a lot of other functionality that we should have slots for. Definitely nonzero, at least.

@MaanasArora
Copy link
Contributor Author

Yes, nonzero and some other arrayfuncs could definitely use a slot!

Thank you for the guidance--I've completed most of the missing pieces I think. I assume we would deprecate some of the earlier uses more gradually, so I've fallen back on array funcs as defaults in some places. I think this should be ready for a first pass!

}

return 0;
}
Copy link
Contributor Author

@MaanasArora MaanasArora Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a lot of somewhat redundant code, but I added it here as a 'test' use which provides a boilerplate to replace the routines with a more efficient special-case indirect sort in a future PR. As a "bonus" it allows us to at least temporarily do the allocation this PR was intended for

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'll let @seberg evaluate whether this API needs adjustments to fit in with the broader DType API but at a first glance it looks reasonable to me, especially if the common case with no specializations can be done with less boilerplate.

That said - there definitely are specialized string sorting implementations we could be using here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, the thing that I would like to change is the PyArray_(Arg)SortFunc itself, so that it gets a context and auxdata instead of the "array". And this function would get NpyAuxData *out_auxdata in also.
(Although, maybe for sorting this is less interesting as it is not as common to sort many arrays in one.)

The unfortunate thing is that you need to wrap the existing functions if you do this or have a second path for the old function.

I am also considering if we should have a return -2 or so to indicate that the sort-kind is not supported (no error set), to allow NumPy to fall back to a different one.
(But I am not sure we need it, it is useful only for somtehing like mergesort/stablesort, explicitly.)
@charris may have a thought on that.

Copy link
Contributor Author

@MaanasArora MaanasArora Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into passing context now! It looks like a good idea, will try to implement. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change would be really nice; unfortunately both the PyArray_CompareFunc and the sort functions use the PyArrayObject right now, which we probably (!) shouldn't get from the context.

If I'm thinking right, we can define a new type such as PyArray_SortCompareFunc that uses the descr instead of the array and make new sort functions that do not use the array somehow (as we can no longer interchange the SortCompareFunc and the CompareFunc!), but we would still probably need the old functions to use with the older compare functions; I think the duplication will be quite complicated.

At the same time, this would be a missed opportunity to have the new CompareFunc type if we do deprecate later and want to go down this route...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two possible approaches here:

  1. We deal with it in the sort function, and just have to different calls depending on whether it is an old or new sort function.
  2. You ignore the fact that an array is currently passed (effectively). We do that in some other places as well, due to how terrible it is.
    That is, we wrap it into a dummy object for which basically the only valid field is arr->descr (and maybe arr->flags, don't recall). (See get_dummy_stack_array, yes this is terrible and even reviewers stumble over it, but...)
    If you do that, you can write a short function that wraps the old call into a function taking the new one.

I did the second for ufuncs (not sure if that was the easier!), so I suspect the first is likely the simplest here.


Allowing to set a compare function, seems like a nice idea (also to have a simple default).

It would be nice to move that into a default slot function. I.e. rather than setting it for StringDType here, auto-fill the slot with the function that tries to use the SortCompareFunc (If that slot is undefined, we can keep the slot filled with NULL).
That also removes the second check later.


About this SortCompareFunc, it may make sense to keep it "light-weight" (i.e. a single function even if that may not be ideal if you have to inspect the dtype to do the comparisons -- for example structured has to do this).

But I would like to think about what we need to sort things like NaNs if possible. Unfortunately, I am not immediately sure, i.e. <=> in C++ can return a partial order, which means that:

  • For us probably an "error" is a valid return (right now we can't propagate errors!).
  • "unordered" is a valid return, although I am not sure how to deal with it. If we have "compare(a, b) == unordered" (i.e. one or both are NaN), we don't yet know how to swap them. That may be possible to resolve with compare(b, b) and compare(a, a).
    But the only way I am quite sure how to resolve possible reversed sorting, etc. might be to have unordered_left, unordered_both, unordered_right.

Or we should keep it roughly as is, and accept that this function doesn't exist for all dtypes... A neat thing about having a clear order with "unordered" is that we could also use it from the comparison (u)funcs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try the first approach I think! I think it will help isolate the new API in a way that makes deprecation easier later too. I will also do this default logic for both compare and sort compare funcs.

As for allowing partial order: yes that could break the symmetry, I suppose. But it could be useful to make sorting more precise in the long run too, and so "unordered" does seem a better way to allow for those kinds of extensions. And then we can use this machinery as the go-to for anywhere comparison decisions need to be made as you said.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an update: after some thought, I think it might be quite nice to go with unordered_left, unordered_both, unordered_right, mainly because it saves us any issues with reverse sorting down the line, might as well get everything in, especially as you mentioned dataframes and that's clearly a very important use case. Working on this! I'll try to draft an API that can easily fill in 'defaults' somehow, so that the user-facing side can be used at different levels of complexity / customization.

npy_intp, int, PyArray_SortFunc **);
typedef int *(PyArrayDTypeMeta_GetArgSortFunction)(PyArray_Descr *,
npy_intp, int, PyArray_ArgSortFunc **);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New stuff in the public API needs new API docs as well as a release note describing the new features.

Maybe also as a proof-of-concept, it looks like both quaddtype and mpfdtype in numpy-user-dtypes implement sorting - would you be willing to update them to use the new API in a PR to numpy-user-dtypes that depends on this PR to numpy? That should give you a feeling for whether this API is helpful for someone writing a new user dtype. It'll also be a form of documentation - we don't have great docs for writing user dtypes besides the examples in numpy-user-dtypes.

Copy link
Member

@ngoldbaum ngoldbaum Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what should we do about the flags that got added before we made the dtype API public, e.g. NPY_DT_PyArray_ArrFuncs_compare? I guess we can deprecate them although I don't know how hard it would be to generate deprecation warnings if those are used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy to generate a deprecation warning during registration (a bit tedious maybe, as you need explicit check).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add API docs and a release note, and willing to make a PR to numpy-user-dtypes! Will look into that.

Just to be clear, NPY_DT_PyArray_ArrFuncs_compare is still needed right? We can move it to a new slot rather than an arrayfunc but it's going to be different from the sort comparison for now if I'm thinking right (as it is user-facing rather than used in the sorting). Do we need to do this another way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't change slot numbers (unless they are guarded as private)! So the numbers are fixed (until they have not been used for a bit at least).
So yeah, I think we should keep it the old slot for now, maybe easier to make the deprecation a follow up.[^depr]

So, we just have to live with the numbering we got, I half thought I asked for an offset for the NPY_DT_PyArray_ArrFuncs slots, but maybe I didn't bother.
(It's not a big issue, the only thing is the convenience if slot numbers == slot offset so you don't need to translate it.)

[^depr] I think this is as simple as asking users to compile with the new NumPy, and then adding PyArray_RUNTIME_VERSION, but this PR is complicated enough due to API decisions for the new loops, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we just have to live with the numbering we got, I half thought I asked for an offset for the NPY_DT_PyArray_ArrFuncs slots, but maybe I didn't bother.

There is an offset, _NPY_DT_ARRFUNCS_OFFSET:

#define NPY_DT_MAX_ARRFUNCS_SLOT \
NPY_NUM_DTYPE_PYARRAY_ARRFUNCS_SLOTS + _NPY_DT_ARRFUNCS_OFFSET

}

return 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'll let @seberg evaluate whether this API needs adjustments to fit in with the broader DType API but at a first glance it looks reasonable to me, especially if the common case with no specializations can be done with less boilerplate.

That said - there definitely are specialized string sorting implementations we could be using here.

}

static inline PyArray_CompareFunc *
PyArray_SortCompare(PyArray_Descr *descr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this PyArray_GetSortCompareFunction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, made this change. Thanks

@@ -44,7 +44,7 @@
* the below code implements this converted to an iteration and as an
* additional minor optimization skips the recursion depth checking on the
* smaller partition as it is always less than half of the remaining data and
* will thus terminate fast enough
* will thus terminate fast enough`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was added by mistake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry!

@MaanasArora MaanasArora changed the title ENH: Allocate lock only once in StringDType quicksort ENH, API: New sorting mechanism for DType API Mar 28, 2025
Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating so quickly :)

I think the new error path needs a little more thought, sorry for the back-and-forth.

sort-kind and order.

Additionally, the new `NPY_DT_sort_compare` slot can be used to provide a comparison function for
sorting, which will replace the default comparison function for the dtype in sorting functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a note that the old arrfuncs slots may be deprecated in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

#define NPY_DT_sort_compare 11
#define NPY_DT_get_clear_loop 12
#define NPY_DT_get_fill_zero_loop 13
#define NPY_DT_finalize_descr 14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you re-order these so the slots that were already in the struct keep their old values? I don't know offhand if changing this order is problematic but it seems more consistent to not change the old values even if it's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done!

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, long comments, and I realize this is becoming a lot more complex than it may have looked initially. But, I would really like to see the context passed in/a new signature.

That also probably includes filling in/returning ARRAY_METHODFLAGS, even if the only useful flag is "requires GIL".

I also still tend to think it may make sense to have a magic return for "unsupported sort method", although should maybe ask Chuck once in a meeting about that.
(in principle I agree we usually just need stable and not-stable, but if we want users to be able to choose more precisely, I think it may make sense to allow us to fallback here. We could still use something like "no error indicated, but func == NULL for it even, but maybe a special return is easier.)


If needed, maybe we have to talk briefly about it synchronously? Or maybe just write the docs/signatures first that we want for the public API.

@@ -0,0 +1 @@
* `PyArray_GetSortFunction`, `PyArray_GetArgSortFunction`, and `PyArray_GetSortCompareFunction` have been added to the C-API. These functions return the sorting, argsorting, and sort comparison functions if provided for a given dtype in new slots.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did not actually add them to the public C-API. Which is totally fine, though.

(I might start with adding a SortBuffer() function or so.)

}

return 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two possible approaches here:

  1. We deal with it in the sort function, and just have to different calls depending on whether it is an old or new sort function.
  2. You ignore the fact that an array is currently passed (effectively). We do that in some other places as well, due to how terrible it is.
    That is, we wrap it into a dummy object for which basically the only valid field is arr->descr (and maybe arr->flags, don't recall). (See get_dummy_stack_array, yes this is terrible and even reviewers stumble over it, but...)
    If you do that, you can write a short function that wraps the old call into a function taking the new one.

I did the second for ufuncs (not sure if that was the easier!), so I suspect the first is likely the simplest here.


Allowing to set a compare function, seems like a nice idea (also to have a simple default).

It would be nice to move that into a default slot function. I.e. rather than setting it for StringDType here, auto-fill the slot with the function that tries to use the SortCompareFunc (If that slot is undefined, we can keep the slot filled with NULL).
That also removes the second check later.


About this SortCompareFunc, it may make sense to keep it "light-weight" (i.e. a single function even if that may not be ideal if you have to inspect the dtype to do the comparisons -- for example structured has to do this).

But I would like to think about what we need to sort things like NaNs if possible. Unfortunately, I am not immediately sure, i.e. <=> in C++ can return a partial order, which means that:

  • For us probably an "error" is a valid return (right now we can't propagate errors!).
  • "unordered" is a valid return, although I am not sure how to deal with it. If we have "compare(a, b) == unordered" (i.e. one or both are NaN), we don't yet know how to swap them. That may be possible to resolve with compare(b, b) and compare(a, a).
    But the only way I am quite sure how to resolve possible reversed sorting, etc. might be to have unordered_left, unordered_both, unordered_right.

Or we should keep it roughly as is, and accept that this function doesn't exist for all dtypes... A neat thing about having a clear order with "unordered" is that we could also use it from the comparison (u)funcs.

NPY_SORTKIND which, int descending, PyArray_SortFunc **out_sort)
{
if (NPY_DT_SLOTS(NPY_DTYPE(descr))->get_sort_function == NULL) {
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to set an error (TypeError or DTypeTypeError, which is defined somewhere I think.)

(An error here will make sense after you move the fallback logic into a default slot filling. Or you could have the default slot raise an error, that is also completely fine.)

@MaanasArora
Copy link
Contributor Author

No worries; thank you for the detailed feedback actually! It's nice to be able to iron out the direction for the API. I'll address the docs and public API changes and keep working away at the SortFunc changes. Happy to have a synchronous chat if it seems useful.

@@ -477,4 +481,18 @@ typedef PyArray_Descr *(PyArrayDTypeMeta_FinalizeDescriptor)(PyArray_Descr *dtyp
typedef int(PyArrayDTypeMeta_SetItem)(PyArray_Descr *, PyObject *, char *);
typedef PyObject *(PyArrayDTypeMeta_GetItem)(PyArray_Descr *, char *);

typedef int (PyArray_CompareFuncWithDescr)(const void *, const void *,
PyArray_Descr *);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is a bit weird here, but I didn't want to disturb the original type as it's used a lot. I think the SortCompareFunc should still be a unique type so will do that (even if only a clone of this type).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have slightly mixed feelings. On the one hand, I think this is the pragmatic thing to have.
On the other hand, we could also look this function from the np.less_than or np.great_than ufunc to implement sorting, I think.
(The problem there is still how to deal with unordered elements, a compare ufunc would work better...)

But, on the other hand, it seems pragmatic even if it won't work well e.g. for structured dtypes (performance issues), it will always work and provides an easy entry-point (we can also use this to define default comparison ufuncs).

So overall, I think I end up at just doing this, although I could imaging punting if we don't need it for StringDType (I suspect we do, though).

Would like to hear if @ngoldbaum has an opinion.

(A neater future path would also be if this was more of a header-only code binding generator job with us making the sorting patterns available maybe. I.e. if this was defined in a C++ class and our sort code available, the DType could compile the full loop and avoid calling such a helper everywhere.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is fine, if only because it exists right now 😄

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Apr 5, 2025

Sorry for the bit of delay, I was thinking through this and essentially ended up with separating more the legacy sorting machinery from this API. This way, the new signatures can freely use context-related features and we do not have to create some sort of empty array or refactor a very large number of sorting-related files. Aside from how nice the feature is, I think this separation is actually a plus and even rolling back the stringdtype integration was worth it (in any case, user dtypes are not to define sorting with the internal, now legacy functions, so may be best to add a specialized routine).

We also have the new compare slot which defaults to sort_compare and related features now, though they're not used yet. Hopefully this is in the right direction. If it is, we can gradually move the older sorting machinery to the context signatures, thus converging. Thank you!

@ngoldbaum
Copy link
Member

Sorry for not getting to this yet. I'm going to try to make sure to give this a once-over next week.

I think you can fix the test failures by rebasing?

@MaanasArora
Copy link
Contributor Author

No worries, I have some things to address as well.

Just rebased--sorry not sure if things went perfectly smoothly.

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Apr 11, 2025

Just brought this implementation with the new signature to parity with the previous one, including the usage in StringDType and ensuring use cases for the new and old sort functions are handled properly in the handlers. There is repetitive code but I guess we will phase out the legacy slots. Now we can make the context nicer if needed, and incorporate the auxdata!

@MaanasArora MaanasArora force-pushed the enh/faster-string-sorting branch from d7cd9ed to 0e8b6a5 Compare April 11, 2025 21:29
@MaanasArora
Copy link
Contributor Author

MaanasArora commented May 8, 2025

Hello! Getting back to this, is there anything I need to address? Thinking of adding the functions to the public C-API if things look fine.

Would we need to create a new C-API version (regenerate the hashes and such), and I guess it would come under 2.4, given how close 2.3 is?

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, yeah, this won't make 2.3, sorry. I think it might be good to discuss a bit in depth with @ngoldbaum some time (not next week, sorry).

Another thing that I would like addressed/discussed is the problem of reverse sorting.
I do think we need at least a reverse=True, I think it might make sense to also provision for a nan_position (if nan goes first or last).

(NULL/NA ordering is very important in dataframe world, and I am tempted to include this, even if we say that the value for now is always "last").

@@ -1873,6 +1873,29 @@ described below.
pointer. Currently this is used for zero-filling and clearing arrays storing
embedded references.
.. c:type:: int (PyArray_SortFunc)( \
void *start, npy_intp num, PyArrayMethod_Context *context, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the context to the first spot just for similarity. I think I added a context for traversal functions, I am not sure that was smart, but since we have it, it may be a slightly better fit.

I might call start, data (not that it matters).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done!

NpyAuxData *auxdata, NpyAuxData **out_auxdata)
A function to sort a buffer of data. The *start* is a pointer to the
beginning of the buffer containing *num* elements. A function of this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
beginning of the buffer containing *num* elements. A function of this
beginning of the contiguous containing *num* elements. A function of this

It also should be aligned, but I have to think whether we should allow indicating support for unaligned data here. (Which would require flags, for ufuncs "supports unaligned" is flagged before get_loop(), although since here we always do contiguous, flagging it inside get_loop() is OK also -- that is, becuase get_loop() is not passed any strides).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Committed (modulo typo).

NpyAuxData **out_auxdata)
A function to arg-sort a buffer of data. The *start* is a pointer to the
beginning of the buffer containing *num* elements. The *tosort* is a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charris to confirm, even for argsorting it probably makes sense to always use a contiguous buffer for sorting?

Comment on lines 1888 to 1889
PyArrayMethod_Context *context, NpyAuxData *auxdata, \
NpyAuxData **out_auxdata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PyArrayMethod_Context *context, NpyAuxData *auxdata, \
NpyAuxData **out_auxdata)
PyArrayMethod_Context *context, NpyAuxData *auxdata)

The out_auxdata belongs on the get_loop function!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done, thank you :)

.. c:macro:: NPY_DT_get_sort_function
.. c:type:: int *(PyArrayDTypeMeta_GetSortFunction)(PyArray_Descr *, \
npy_intp sort_kind, int descending, PyArray_SortFunc **out_sort);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs *out_flags, since we need the ability to indicate whether the GIL is required (I think we can ignore FPEs), but who knows if we'll have a reason for other flags eventually.

It also needs **out_auxdata, since auxdata needs to come from somewhere :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@@ -1570,20 +1592,41 @@ PyArray_Sort(PyArrayObject *op, int axis, NPY_SORTKIND which)
return -1;
}

sort = PyDataType_GetArrFuncs(PyArray_DESCR(op))->sort[which];
PyArray_GetSortFunction(PyArray_DESCR(op), which, 0, &sort);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, let's just use < 0 to decide if it's an error. In which case sort != NULL is assumed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, this is done!

@@ -477,4 +481,18 @@ typedef PyArray_Descr *(PyArrayDTypeMeta_FinalizeDescriptor)(PyArray_Descr *dtyp
typedef int(PyArrayDTypeMeta_SetItem)(PyArray_Descr *, PyObject *, char *);
typedef PyObject *(PyArrayDTypeMeta_GetItem)(PyArray_Descr *, char *);

typedef int (PyArray_CompareFuncWithDescr)(const void *, const void *,
PyArray_Descr *);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have slightly mixed feelings. On the one hand, I think this is the pragmatic thing to have.
On the other hand, we could also look this function from the np.less_than or np.great_than ufunc to implement sorting, I think.
(The problem there is still how to deal with unordered elements, a compare ufunc would work better...)

But, on the other hand, it seems pragmatic even if it won't work well e.g. for structured dtypes (performance issues), it will always work and provides an easy entry-point (we can also use this to define default comparison ufuncs).

So overall, I think I end up at just doing this, although I could imaging punting if we don't need it for StringDType (I suspect we do, though).

Would like to hear if @ngoldbaum has an opinion.

(A neater future path would also be if this was more of a header-only code binding generator job with us making the sorting patterns available maybe. I.e. if this was defined in a C++ class and our sort code available, the DType could compile the full loop and avoid calling such a helper everywhere.)

@MaanasArora
Copy link
Contributor Author

MaanasArora commented May 10, 2025

Thanks for the comments! And no worries, I was just making sure I wasn't missing something to do :)

I think I need to think a bit more about the best way to adjust this for the extra features you mentioned, yes. Unordered elements is definitely something to consider at this stage, so I might try to draft something for that soon enough; that should hopefully create a clearer story around these features!

@MaanasArora MaanasArora force-pushed the enh/faster-string-sorting branch from 39f5ef2 to 237b7f0 Compare May 13, 2025 22:39
@ngoldbaum
Copy link
Member

I want to call your attention to this suggestion: #28516 (comment).

Did you ever take a look at numpy-user-dtypes? A worked example would help.

@MaanasArora
Copy link
Contributor Author

Yes, sorry, I took a look actually--but was having some trouble with installing the dtype packages over the editable install of numpy. I'll push my draft anyway and try to address that. Thanks for the reminder.

@MaanasArora MaanasArora force-pushed the enh/faster-string-sorting branch from be43aeb to 0edb4ea Compare May 20, 2025 04:04
@ngoldbaum
Copy link
Member

Just a head's up I haven't forgotten about this. I'm planning to spend some time later this week or next looking closely at this and the accompanying numpy-user-dtypes PR. Thanks so much for working on this having patience 🙂.,

@MaanasArora
Copy link
Contributor Author

MaanasArora commented May 20, 2025

Thank you @ngoldbaum! I realize there's a lot here so appreciate you taking time on this. I'll try to address Sebastian's other comments and iterate soon.

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting there! Left a few comments inline.

NpyAuxData *);
typedef int (PyArray_ArgSortFunc)(PyArrayMethod_Context *,
void *, npy_intp *, npy_intp,
NpyAuxData *);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two need different names and you need to leave the original typedefs in ndarraytypes.h that had these names, since they're public API.

Copy link
Contributor Author

@MaanasArora MaanasArora May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing! This is done.

@@ -477,4 +481,18 @@ typedef PyArray_Descr *(PyArrayDTypeMeta_FinalizeDescriptor)(PyArray_Descr *dtyp
typedef int(PyArrayDTypeMeta_SetItem)(PyArray_Descr *, PyObject *, char *);
typedef PyObject *(PyArrayDTypeMeta_GetItem)(PyArray_Descr *, char *);

typedef int (PyArray_CompareFuncWithDescr)(const void *, const void *,
PyArray_Descr *);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is fine, if only because it exists right now 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: np.unique and sorting is slow for StringDType
3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy