Skip to content

BUG: Polynomial package slower than polynomial module for python <= 3.11. Both are worse for python > 3.11. #28948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
eigenbrot opened this issue May 12, 2025 · 15 comments

Comments

@eigenbrot
Copy link

Describe the issue:

I really would like to use the new Polynomial package, but I'm finding the evaluation performance is much worse than np.poly1d for python versions 3.10 and 3.11. Running the example code below on different python versions gives the following results:

Method 3.10 time (s) 3.11 time (s) 3.12 time (s) 3.13 time (s)
Polynomial 51.5 36.2 35.5 44.3
np.poly1d 37.2 21.4 32.4 49.7

Notice that for python versions <= 3.11 np.poly1d is significantly faster than using the Polynomial package machinery. The performance of the two methods converges for python > 3.11, but mostly because np.poly1d is slowing down.

Is this expected behavior? I didn't see any mention of performance in the docs for the new Polynomial package. I have seen some reports that the Polynomial package is faster than np.poly1d (which would be great), but that's not what I'm seeing with my tests.

Any advice/insight would be greatly appreciated. For now the clear solution is to just use np.poly1d.

Reproduce the code example:

import numpy as np
from numpy.polynomial import Polynomial
import timeit
d = np.random.random((14, 2048, 2048))
P = Polynomial([1., 2., 3., 4.])
old_p = np.poly1d([4., 3., 2., 1.])

timeit.timeit("P(d)", globals={"P":P, "d": d}, number=10)

timeit.timeit("old_p(d)", globals={"old_p":old_p, "d": d}, number=10)

Error message:

Python and NumPy Versions:

For 3.10:

2.2.5
3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) [GCC 13.3.0]

For 3.11:

2.2.5
3.11.12 | packaged by conda-forge | (main, Apr 10 2025, 22:23:25) [GCC 13.3.0]

For 3.12:

2.2.5
3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0]

For 3.13:

2.2.5
3.13.3 | packaged by conda-forge | (main, Apr 14 2025, 20:44:03) [GCC 13.3.0]

Runtime Environment:

For 3.10:

[{'numpy_version': '2.2.5',
  'python': '3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) '
            '[GCC 13.3.0]',
  'uname': uname_result(system='Linux', node='XXX', release='6.14.4-arch1-2', version='#1 SMP PREEMPT_DYNAMIC Tue, 29 Apr 2025 09:23:13 +0000', machine='x86_64')},
 {'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2',
                                'AVX512F',
                                'AVX512CD',
                                'AVX512_SKX',
                                'AVX512_CLX',
                                'AVX512_CNL',
                                'AVX512_ICL'],
                      'not_found': ['AVX512_KNL', 'AVX512_KNM']}},
 {'architecture': 'SkylakeX',
  'filepath': '/home/XXX/micromamba/envs/tmp10/lib/python3.10/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so',
  'internal_api': 'openblas',
  'num_threads': 8,
  'prefix': 'libscipy_openblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.28'}]

For 3.11:

[{'numpy_version': '2.2.5',
  'python': '3.11.12 | packaged by conda-forge | (main, Apr 10 2025, 22:23:25) '
            '[GCC 13.3.0]',
  'uname': uname_result(system='Linux', node='XXX', release='6.14.4-arch1-2', version='#1 SMP PREEMPT_DYNAMIC Tue, 29 Apr 2025 09:23:13 +0000', machine='x86_64')},
 {'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2',
                                'AVX512F',
                                'AVX512CD',
                                'AVX512_SKX',
                                'AVX512_CLX',
                                'AVX512_CNL',
                                'AVX512_ICL'],
                      'not_found': ['AVX512_KNL', 'AVX512_KNM']}},
 {'architecture': 'SkylakeX',
  'filepath': '/home/XXX/micromamba/envs/tmp11/lib/python3.11/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so',
  'internal_api': 'openblas',
  'num_threads': 8,
  'prefix': 'libscipy_openblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.28'}]

For 3.12:

[{'numpy_version': '2.2.5',
  'python': '3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) '
            '[GCC 13.3.0]',
  'uname': uname_result(system='Linux', node='XXX', release='6.14.4-arch1-2', version='#1 SMP PREEMPT_DYNAMIC Tue, 29 Apr 2025 09:23:13 +0000', machine='x86_64')},
 {'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2',
                                'AVX512F',
                                'AVX512CD',
                                'AVX512_SKX',
                                'AVX512_CLX',
                                'AVX512_CNL',
                                'AVX512_ICL'],
                      'not_found': ['AVX512_KNL', 'AVX512_KNM']}},
 {'architecture': 'SkylakeX',
  'filepath': '/home/XXX/micromamba/envs/tmp12/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so',
  'internal_api': 'openblas',
  'num_threads': 8,
  'prefix': 'libscipy_openblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.28'}]

For 3.13:

[{'numpy_version': '2.2.5',
  'python': '3.13.3 | packaged by conda-forge | (main, Apr 14 2025, 20:44:03) '
            '[GCC 13.3.0]',
  'uname': uname_result(system='Linux', node='XXX', release='6.14.4-arch1-2', version='#1 SMP PREEMPT_DYNAMIC Tue, 29 Apr 2025 09:23:13 +0000', machine='x86_64')},
 {'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2',
                                'AVX512F',
                                'AVX512CD',
                                'AVX512_SKX',
                                'AVX512_CLX',
                                'AVX512_CNL',
                                'AVX512_ICL'],
                      'not_found': ['AVX512_KNL', 'AVX512_KNM']}},
 {'architecture': 'SkylakeX',
  'filepath': '/home/XXX/micromamba/envs/tmp/lib/python3.13/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so',
  'internal_api': 'openblas',
  'num_threads': 8,
  'prefix': 'libscipy_openblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.28'}]

Context for the issue:

I need to apply polynomials to many large arrays and do it quickly. I want to use the new-and-improved Polynomial package, but its performance is forcing me to use the older np.poly1d.

@charris
Copy link
Member

charris commented May 12, 2025

Could you give more details on your use case?

@eigenbrot
Copy link
Author

Sure, what exactly would you like to know? The use case is not much more complicated than the example code I provided.

We have image arrays that need a correction applied and the correction is parameterized as a polynomial of arbitrary degree. The full example looks something like this:

def correct_image(data: np.ndarray, correction_polynomial: numpy.poly1d | numpy.polynomial.polynomial.Polynomial) -> np.ndarray:
  correction = correction_polynomial(data)
  corrected_data = data / correction
  return corrected data

So there's an extra division, but I removed that common step in my tests above.

In the tests I ran above I used a stack of images (i.e., fourteen 2048 x 2048 images), mostly just to give numpy enough work to make the time differences obvious. In real life (as shown here) we typically correct a single 2D array at a time, but it really doesn't matter which we choose. The performance difference between Polynomial and poly1d is roughly the same regardless of the size/dimensionality of our data.

@ngoldbaum
Copy link
Member

It'd be useful to see a low-level profile generated using e.g. samply on a Python version that is seeing a slowdown and comparing that with one that isn't. That'll show where Python is spending its time, which might give a hint at where the difference is coming from.

@eendebakpt
Copy link
Contributor

I am having some problems confirming the dependence on the python version and the slowdowns. @eigenbrot Could you also benchmark with smaller d and higher number argument for timeit?

There are a couple of open PRs to improve the performance btw: #24499, #24467, #26885, #24531

@eigenbrot
Copy link
Author

eigenbrot commented May 13, 2025

@ngoldbaum, I'm attaching some profiles captured with samply via

$ samply record --save-only -o NN_polynomial.json -- python -c "import numpy as np; d = np.random.random((14, 2048, 2048)); p = np.polynomial.Polynomial([1, 2, 3, 4]); _ = p(d)"
$ samply record --save-only -o NN_poly1d.json -- python -c "import numpy as np; d = np.random.random((14, 2048, 2048)); p = np.poly1d([4, 3, 2, 1]); _ = p(d)"

The traces mean nothing to me, but if someone would like different samples please let me know.
10_poly1d.json
10_polynomial.json
11_poly1d.json
11_polynomial.json
12_poly1d.json
12_polynomial.json
13_poly1d.json
13_polynomial.json

@eendebakpt, I re-ran the test described in my initial post but with two modifications you requested: the shape of the data is now (3, 256, 256) and I set number=10_000 for the timeit call. Here are the results:

method 3.10 time (s) 3.11 time (s) 3.12 time (s) 3.13 time (s)
Polynomial 20.5 19.9 19. 8 19.6
np.poly1d 12.3 12.3 16.8 17.2

@ngoldbaum
Copy link
Member

I'm attaching some profiles captured with samply

Where? If you upload the profile you should be able to share Firefox profiler links.

@eigenbrot
Copy link
Author

@ngoldbaum, sorry. I borked the upload. They should be there now.

@ngoldbaum
Copy link
Member

Ah darn, unfortunately the profiles you uploaded don't have debug symbols, so they're pretty useless.

If I have time I can try doing this myself to give you something more useful to look at. In the meantime you should be able to build NumPy from source using the meson debugoptimized build profile to get an optimized executable but with useful debugging info and then regenerate the profiles.

If you're on Linux, you can also pass -X perf to Python to also get Python-level frames in the profiles.

@eigenbrot
Copy link
Author

Ok, I've built numpy with the debugoptimized profile and re-generate profiles and test results.

Just to make sure we're all on the same page, here is how I built numpy for each python version. N in these steps refers to the minor python version (10, 11, 12, 13):

$ mamba create -n nbuild_N python=3.N cython compilers openblas meson-python pkg-config
$ git clone https://github.com/numpy/numpy.git numpyN
$ cd numpyN
$ git checkout v2.2.5
$ git submodule update --init
$ python -m pip install -r requirements/build_requirements.txt
$ spin build -- -Dbuildtype=debugoptimized

All were build against openblas 0.3.29

The profiles were then collected with (note the use of -X perf)

$ PYTHONPATH='/home/XXX/opensource/numpy_poly/numpyN/build-install/usr/lib/python3.N/site-packages' samply record --save-only -o N_poly1d_debug_perf.json -- pyth\
on -X perf -c "import sys; del(sys.path[0]); import numpy as np; d = np.random.random((14, 2048, 2048)); P = np.poly1d([4, 3, 2, 1]); _ = P(d)"
$ PYTHONPATH='/home/XXX/opensource/numpy_poly/numpyN/build-install/usr/lib/python3.N/site-packages' samply record --save-only -o N_polynomial_debug_perf.json -- \
python -X perf -c "import sys; del(sys.path[0]); import numpy as np; d = np.random.random((14, 2048, 2048)); P = np.polynomial.Polynomial([1, 2, 3, 4]); _ = P(\
d)"

I had to go the PYTHONPATH route because spin python doesn't work for python 3.10 and I wanted a consistent test for all versions.

Profiles:
10_poly1d_debug_perf.json
10_polynomial_debug_perf.json

11_poly1d_debug_perf.json
11_polynomial_debug_perf.json

12_poly1d_debug_perf.json
12_polynomial_debug_perf.json

13_poly1d_debug_perf.json
13_polynomial_debug_perf.json

And here are the results of the timing test (with smaller arrays and more samples). Note that these were run in the same conda environment used to build numpy, prefixed with PYTHONPATH='/home/XXX/opensource/numpy_poly/numpyN/build-install/usr/lib/python3.N/site-packages', and include import sys; del(sys.path[0]):

method 3.10 time(s) 3.11 time (s) 3.12 time (s) 3.13 time (s)
Polynomial 19.4 20.7 19.5 19.9
np.poly1d 12.3 12.0 16.5 16.6

@ngoldbaum
Copy link
Member

Unfortunately the profiles still don't have C debugging information so that won't tell me anything useful.

@ngoldbaum
Copy link
Member

ngoldbaum commented May 14, 2025

Here's a profile generated on Python 3.11 using just Polynomial with 100 timeit iterations: https://share.firefox.dev/4j0GCyo

And here's the same thing, generated by calling poly1d with timeit 100 times: https://share.firefox.dev/4miarx9

Unfortunately this is on a Mac, so I can't use the perf integration.

@ngoldbaum
Copy link
Member

And here's a profile that does both: https://share.firefox.dev/4jQ2MEN

At a high level, without any Python frames, it sort of looks like Polynomial is doing more ufunc operations? I don't see anything obvious in the profiles though - both are spending almost all their time in the add and multiply ufunc implementations.

There's probably something different in the algorithm that's being used that's causing some extra ufunc calculations to happen that aren't needed by poly1d.

@eendebakpt
Copy link
Contributor

Both implementations use polyval in the end. Polynomial via

def __call__(self, arg):
arg = pu.mapdomain(arg, self.domain, self.window)
return self._val(arg, self.coef)

_val = staticmethod(polyval)

and poly1d calls directly:

def __call__(self, val):
return polyval(self.coeffs, val)

For Polynomial there are extra calculations involved in the domain mapping though the pu.mapdomain. Since the polynomial in the example is of relatively low degree (4), the domain mapping explains (part of) the performance difference between Polynomial and poly1d. It does not explain why performance of poly1d seems to have become worse when moving from 3.10 to 3.11.

@eigenbrot What timings do you get if you use polyval directly?

@eigenbrot
Copy link
Author

@eendebakpt just to make sure we're all on the same page, it looks like there are two versions of polyval: np.poly1d uses this version, while np.polynomial.Polynomial uses this version. Fortunately the argument order is swapped between these two functions, so it's hard to mix them up.

Here are the results of running similar timing tests with the two versions of polyval. These were NOT run with the custom build numpys mentioned above. The script is:

import numpy as np
from numpy.polynomial.polynomial import polyval as new_val
from numpy import polyval as old_val
import timeit

d = np.random.random((3, 256, 256))

old_res = timeit.timeit("old_val([4., 3., 2., 1.], d)", globals={"old_val":old_val, "d": d}, number=10_000)

new_res = timeit.timeit("new_val(d, [1., 2., 3., 4.])", globals={"new_val":new_val, "d": d}, number=10_000)

and the results:

method 3.10 time(s) 3.11 time (s) 3.12 time (s) 3.13 time (s)
Polynomial's polyval 14.3 15.2 13.8 13.5
np.poly1d's polyval 11.4 11.5 14.3 14.3

I'd say the overall trend is still the same, but the difference between the two implementations is much much smaller. There still seems to be an inflection point across the 3.11 -> 3.12 transition.

@eendebakpt
Copy link
Contributor

@eigenbrot You are completely right there are two versions of polyval.

In the old version there is an iteration over the coefficients with for pv in p. This leads to scalars pv=np.float64(4.). In the new version the coefficients are transformed via c.reshape(c.shape + (1,) * x.ndim) and the iteration is over c[-i] which leads to arrays array([4.]). I suspect using scalars is slower for small arrays (adding a scalar to an array has quite a bit of overhead).

With your example script I do not get the inflection across 3.11 to 3.12, but I do see another effect: for d of shape (2, 256, 256) the old version is faster, but for shape (3, 26, 256) the new version is faster. If I use a 1d array for d I see a similar effect (depending on the size).

Because I do not see the performance change from 3.10 to 3.11 (or 3.12 or 3.13) on my system it is a bit hard to further investigate this. Could you try to narrow down the issue even more by testing a few different shapes of d and benchmarking only a part of the polyval (I think benchmarking y * x would be interesting, and (with w=y*x) benchmarking w + pv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy