BUG: Polynomial package slower than polynomial module for python <= 3.11. Both are worse for python > 3.11. #28948

eigenbrot · 2025-05-12T16:35:13Z

Describe the issue:

I really would like to use the new Polynomial package, but I'm finding the evaluation performance is much worse than np.poly1d for python versions 3.10 and 3.11. Running the example code below on different python versions gives the following results:

Method	3.10 time (s)	3.11 time (s)	3.12 time (s)	3.13 time (s)
`Polynomial`	51.5	36.2	35.5	44.3
`np.poly1d`	37.2	21.4	32.4	49.7

Notice that for python versions <= 3.11 np.poly1d is significantly faster than using the Polynomial package machinery. The performance of the two methods converges for python > 3.11, but mostly because np.poly1d is slowing down.

Is this expected behavior? I didn't see any mention of performance in the docs for the new Polynomial package. I have seen some reports that the Polynomial package is faster than np.poly1d (which would be great), but that's not what I'm seeing with my tests.

Any advice/insight would be greatly appreciated. For now the clear solution is to just use np.poly1d.

Reproduce the code example:

import numpy as np
from numpy.polynomial import Polynomial
import timeit
d = np.random.random((14, 2048, 2048))
P = Polynomial([1., 2., 3., 4.])
old_p = np.poly1d([4., 3., 2., 1.])

timeit.timeit("P(d)", globals={"P":P, "d": d}, number=10)

timeit.timeit("old_p(d)", globals={"old_p":old_p, "d": d}, number=10)

Error message:

Python and NumPy Versions:

For 3.10:

2.2.5
3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) [GCC 13.3.0]

For 3.11:

2.2.5
3.11.12 | packaged by conda-forge | (main, Apr 10 2025, 22:23:25) [GCC 13.3.0]

For 3.12:

2.2.5
3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0]

For 3.13:

2.2.5
3.13.3 | packaged by conda-forge | (main, Apr 14 2025, 20:44:03) [GCC 13.3.0]

Runtime Environment:

For 3.10:

[{'numpy_version': '2.2.5',
  'python': '3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) '
            '[GCC 13.3.0]',
  'uname': uname_result(system='Linux', node='XXX', release='6.14.4-arch1-2', version='#1 SMP PREEMPT_DYNAMIC Tue, 29 Apr 2025 09:23:13 +0000', machine='x86_64')},
 {'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2',
                                'AVX512F',
                                'AVX512CD',
                                'AVX512_SKX',
                                'AVX512_CLX',
                                'AVX512_CNL',
                                'AVX512_ICL'],
                      'not_found': ['AVX512_KNL', 'AVX512_KNM']}},
 {'architecture': 'SkylakeX',
  'filepath': '/home/XXX/micromamba/envs/tmp10/lib/python3.10/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so',
  'internal_api': 'openblas',
  'num_threads': 8,
  'prefix': 'libscipy_openblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.28'}]

For 3.11:

[{'numpy_version': '2.2.5',
  'python': '3.11.12 | packaged by conda-forge | (main, Apr 10 2025, 22:23:25) '
            '[GCC 13.3.0]',
  'uname': uname_result(system='Linux', node='XXX', release='6.14.4-arch1-2', version='#1 SMP PREEMPT_DYNAMIC Tue, 29 Apr 2025 09:23:13 +0000', machine='x86_64')},
 {'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2',
                                'AVX512F',
                                'AVX512CD',
                                'AVX512_SKX',
                                'AVX512_CLX',
                                'AVX512_CNL',
                                'AVX512_ICL'],
                      'not_found': ['AVX512_KNL', 'AVX512_KNM']}},
 {'architecture': 'SkylakeX',
  'filepath': '/home/XXX/micromamba/envs/tmp11/lib/python3.11/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so',
  'internal_api': 'openblas',
  'num_threads': 8,
  'prefix': 'libscipy_openblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.28'}]

For 3.12:

[{'numpy_version': '2.2.5',
  'python': '3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) '
            '[GCC 13.3.0]',
  'uname': uname_result(system='Linux', node='XXX', release='6.14.4-arch1-2', version='#1 SMP PREEMPT_DYNAMIC Tue, 29 Apr 2025 09:23:13 +0000', machine='x86_64')},
 {'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2',
                                'AVX512F',
                                'AVX512CD',
                                'AVX512_SKX',
                                'AVX512_CLX',
                                'AVX512_CNL',
                                'AVX512_ICL'],
                      'not_found': ['AVX512_KNL', 'AVX512_KNM']}},
 {'architecture': 'SkylakeX',
  'filepath': '/home/XXX/micromamba/envs/tmp12/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so',
  'internal_api': 'openblas',
  'num_threads': 8,
  'prefix': 'libscipy_openblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.28'}]

For 3.13:

[{'numpy_version': '2.2.5',
  'python': '3.13.3 | packaged by conda-forge | (main, Apr 14 2025, 20:44:03) '
            '[GCC 13.3.0]',
  'uname': uname_result(system='Linux', node='XXX', release='6.14.4-arch1-2', version='#1 SMP PREEMPT_DYNAMIC Tue, 29 Apr 2025 09:23:13 +0000', machine='x86_64')},
 {'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2',
                                'AVX512F',
                                'AVX512CD',
                                'AVX512_SKX',
                                'AVX512_CLX',
                                'AVX512_CNL',
                                'AVX512_ICL'],
                      'not_found': ['AVX512_KNL', 'AVX512_KNM']}},
 {'architecture': 'SkylakeX',
  'filepath': '/home/XXX/micromamba/envs/tmp/lib/python3.13/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so',
  'internal_api': 'openblas',
  'num_threads': 8,
  'prefix': 'libscipy_openblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.28'}]

Context for the issue:

I need to apply polynomials to many large arrays and do it quickly. I want to use the new-and-improved Polynomial package, but its performance is forcing me to use the older np.poly1d.

The text was updated successfully, but these errors were encountered:

charris · 2025-05-12T16:44:23Z

Could you give more details on your use case?

eigenbrot · 2025-05-12T17:11:05Z

Sure, what exactly would you like to know? The use case is not much more complicated than the example code I provided.

We have image arrays that need a correction applied and the correction is parameterized as a polynomial of arbitrary degree. The full example looks something like this:

def correct_image(data: np.ndarray, correction_polynomial: numpy.poly1d | numpy.polynomial.polynomial.Polynomial) -> np.ndarray:
  correction = correction_polynomial(data)
  corrected_data = data / correction
  return corrected data

So there's an extra division, but I removed that common step in my tests above.

In the tests I ran above I used a stack of images (i.e., fourteen 2048 x 2048 images), mostly just to give numpy enough work to make the time differences obvious. In real life (as shown here) we typically correct a single 2D array at a time, but it really doesn't matter which we choose. The performance difference between Polynomial and poly1d is roughly the same regardless of the size/dimensionality of our data.

ngoldbaum · 2025-05-12T21:20:31Z

It'd be useful to see a low-level profile generated using e.g. samply on a Python version that is seeing a slowdown and comparing that with one that isn't. That'll show where Python is spending its time, which might give a hint at where the difference is coming from.

eendebakpt · 2025-05-13T10:11:59Z

I am having some problems confirming the dependence on the python version and the slowdowns. @eigenbrot Could you also benchmark with smaller d and higher number argument for timeit?

There are a couple of open PRs to improve the performance btw: #24499, #24467, #26885, #24531

eigenbrot · 2025-05-13T22:29:08Z

@ngoldbaum, I'm attaching some profiles captured with samply via

$ samply record --save-only -o NN_polynomial.json -- python -c "import numpy as np; d = np.random.random((14, 2048, 2048)); p = np.polynomial.Polynomial([1, 2, 3, 4]); _ = p(d)"
$ samply record --save-only -o NN_poly1d.json -- python -c "import numpy as np; d = np.random.random((14, 2048, 2048)); p = np.poly1d([4, 3, 2, 1]); _ = p(d)"

The traces mean nothing to me, but if someone would like different samples please let me know.
10_poly1d.json
10_polynomial.json
11_poly1d.json
11_polynomial.json
12_poly1d.json
12_polynomial.json
13_poly1d.json
13_polynomial.json

@eendebakpt, I re-ran the test described in my initial post but with two modifications you requested: the shape of the data is now (3, 256, 256) and I set number=10_000 for the timeit call. Here are the results:

method	3.10 time (s)	3.11 time (s)	3.12 time (s)	3.13 time (s)
`Polynomial`	20.5	19.9	19. 8	19.6
`np.poly1d`	12.3	12.3	16.8	17.2

ngoldbaum · 2025-05-13T22:31:33Z

I'm attaching some profiles captured with samply

Where? If you upload the profile you should be able to share Firefox profiler links.

eigenbrot · 2025-05-13T22:32:27Z

@ngoldbaum, sorry. I borked the upload. They should be there now.

ngoldbaum · 2025-05-13T22:37:10Z

Ah darn, unfortunately the profiles you uploaded don't have debug symbols, so they're pretty useless.

If I have time I can try doing this myself to give you something more useful to look at. In the meantime you should be able to build NumPy from source using the meson debugoptimized build profile to get an optimized executable but with useful debugging info and then regenerate the profiles.

If you're on Linux, you can also pass -X perf to Python to also get Python-level frames in the profiles.

eigenbrot · 2025-05-14T16:02:15Z

Ok, I've built numpy with the debugoptimized profile and re-generate profiles and test results.

Just to make sure we're all on the same page, here is how I built numpy for each python version. N in these steps refers to the minor python version (10, 11, 12, 13):

$ mamba create -n nbuild_N python=3.N cython compilers openblas meson-python pkg-config
$ git clone https://github.com/numpy/numpy.git numpyN
$ cd numpyN
$ git checkout v2.2.5
$ git submodule update --init
$ python -m pip install -r requirements/build_requirements.txt
$ spin build -- -Dbuildtype=debugoptimized

All were build against openblas 0.3.29

The profiles were then collected with (note the use of -X perf)

$ PYTHONPATH='/home/XXX/opensource/numpy_poly/numpyN/build-install/usr/lib/python3.N/site-packages' samply record --save-only -o N_poly1d_debug_perf.json -- pyth\
on -X perf -c "import sys; del(sys.path[0]); import numpy as np; d = np.random.random((14, 2048, 2048)); P = np.poly1d([4, 3, 2, 1]); _ = P(d)"
$ PYTHONPATH='/home/XXX/opensource/numpy_poly/numpyN/build-install/usr/lib/python3.N/site-packages' samply record --save-only -o N_polynomial_debug_perf.json -- \
python -X perf -c "import sys; del(sys.path[0]); import numpy as np; d = np.random.random((14, 2048, 2048)); P = np.polynomial.Polynomial([1, 2, 3, 4]); _ = P(\
d)"

I had to go the PYTHONPATH route because spin python doesn't work for python 3.10 and I wanted a consistent test for all versions.

Profiles:
10_poly1d_debug_perf.json
10_polynomial_debug_perf.json

11_poly1d_debug_perf.json
11_polynomial_debug_perf.json

12_poly1d_debug_perf.json
12_polynomial_debug_perf.json

13_poly1d_debug_perf.json
13_polynomial_debug_perf.json

And here are the results of the timing test (with smaller arrays and more samples). Note that these were run in the same conda environment used to build numpy, prefixed with PYTHONPATH='/home/XXX/opensource/numpy_poly/numpyN/build-install/usr/lib/python3.N/site-packages', and include import sys; del(sys.path[0]):

method	3.10 time(s)	3.11 time (s)	3.12 time (s)	3.13 time (s)
`Polynomial`	19.4	20.7	19.5	19.9
`np.poly1d`	12.3	12.0	16.5	16.6

ngoldbaum · 2025-05-14T19:30:11Z

Unfortunately the profiles still don't have C debugging information so that won't tell me anything useful.

ngoldbaum · 2025-05-14T19:50:39Z

Here's a profile generated on Python 3.11 using just Polynomial with 100 timeit iterations: https://share.firefox.dev/4j0GCyo

And here's the same thing, generated by calling poly1d with timeit 100 times: https://share.firefox.dev/4miarx9

Unfortunately this is on a Mac, so I can't use the perf integration.

ngoldbaum · 2025-05-14T19:58:09Z

And here's a profile that does both: https://share.firefox.dev/4jQ2MEN

At a high level, without any Python frames, it sort of looks like Polynomial is doing more ufunc operations? I don't see anything obvious in the profiles though - both are spending almost all their time in the add and multiply ufunc implementations.

There's probably something different in the algorithm that's being used that's causing some extra ufunc calculations to happen that aren't needed by poly1d.

eendebakpt · 2025-05-18T20:09:06Z

Both implementations use polyval in the end. Polynomial via

numpy/numpy/polynomial/_polybase.py

Lines 510 to 512 in 8d722b8

    
           def __call__(self, arg): 
        
               arg = pu.mapdomain(arg, self.domain, self.window) 
        
               return self._val(arg, self.coef)

numpy/numpy/polynomial/polynomial.py

Line 1580 in 8d722b8

_val = staticmethod(polyval)

and poly1d calls directly:

numpy/numpy/lib/_polynomial_impl.py

Lines 1342 to 1343 in 8d722b8

    
           def __call__(self, val): 
        
               return polyval(self.coeffs, val)

For Polynomial there are extra calculations involved in the domain mapping though the pu.mapdomain. Since the polynomial in the example is of relatively low degree (4), the domain mapping explains (part of) the performance difference between Polynomial and poly1d. It does not explain why performance of poly1d seems to have become worse when moving from 3.10 to 3.11.

@eigenbrot What timings do you get if you use polyval directly?

eigenbrot · 2025-05-20T14:55:32Z

@eendebakpt just to make sure we're all on the same page, it looks like there are two versions of polyval: np.poly1d uses this version, while np.polynomial.Polynomial uses this version. Fortunately the argument order is swapped between these two functions, so it's hard to mix them up.

Here are the results of running similar timing tests with the two versions of polyval. These were NOT run with the custom build numpys mentioned above. The script is:

import numpy as np
from numpy.polynomial.polynomial import polyval as new_val
from numpy import polyval as old_val
import timeit

d = np.random.random((3, 256, 256))

old_res = timeit.timeit("old_val([4., 3., 2., 1.], d)", globals={"old_val":old_val, "d": d}, number=10_000)

new_res = timeit.timeit("new_val(d, [1., 2., 3., 4.])", globals={"new_val":new_val, "d": d}, number=10_000)

and the results:

method	3.10 time(s)	3.11 time (s)	3.12 time (s)	3.13 time (s)
`Polynomial`'s `polyval`	14.3	15.2	13.8	13.5
`np.poly1d`'s `polyval`	11.4	11.5	14.3	14.3

I'd say the overall trend is still the same, but the difference between the two implementations is much much smaller. There still seems to be an inflection point across the 3.11 -> 3.12 transition.

eendebakpt · 2025-05-20T20:35:07Z

@eigenbrot You are completely right there are two versions of polyval.

In the old version there is an iteration over the coefficients with for pv in p. This leads to scalars pv=np.float64(4.). In the new version the coefficients are transformed via c.reshape(c.shape + (1,) * x.ndim) and the iteration is over c[-i] which leads to arrays array([4.]). I suspect using scalars is slower for small arrays (adding a scalar to an array has quite a bit of overhead).

With your example script I do not get the inflection across 3.11 to 3.12, but I do see another effect: for d of shape (2, 256, 256) the old version is faster, but for shape (3, 26, 256) the new version is faster. If I use a 1d array for d I see a similar effect (depending on the size).

Because I do not see the performance change from 3.10 to 3.11 (or 3.12 or 3.13) on my system it is a bit hard to further investigate this. Could you try to narrow down the issue even more by testing a few different shapes of d and benchmarking only a part of the polyval (I think benchmarking y * x would be interesting, and (with w=y*x) benchmarking w + pv.

eigenbrot added the 00 - Bug label May 12, 2025

jorenham added the component: numpy.polynomial label May 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Polynomial package slower than polynomial module for python <= 3.11. Both are worse for python > 3.11. #28948

BUG: Polynomial package slower than polynomial module for python <= 3.11. Both are worse for python > 3.11. #28948

eigenbrot commented May 12, 2025

charris commented May 12, 2025

Uh oh!

eigenbrot commented May 12, 2025

Uh oh!

ngoldbaum commented May 12, 2025

Uh oh!

eendebakpt commented May 13, 2025

Uh oh!

eigenbrot commented May 13, 2025 •

edited

Loading

Uh oh!

ngoldbaum commented May 13, 2025

Uh oh!

eigenbrot commented May 13, 2025

Uh oh!

ngoldbaum commented May 13, 2025

Uh oh!

eigenbrot commented May 14, 2025

Uh oh!

ngoldbaum commented May 14, 2025

Uh oh!

ngoldbaum commented May 14, 2025 •

edited

Loading

Uh oh!

ngoldbaum commented May 14, 2025

Uh oh!

eendebakpt commented May 18, 2025

Uh oh!

eigenbrot commented May 20, 2025

Uh oh!

eendebakpt commented May 20, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Uh oh!

BUG: Polynomial package slower than polynomial module for python <= 3.11. Both are worse for python > 3.11. #28948

BUG: Polynomial package slower than polynomial module for python <= 3.11. Both are worse for python > 3.11. #28948

Comments

eigenbrot commented May 12, 2025

Describe the issue:

Reproduce the code example:

Error message:

Python and NumPy Versions:

Runtime Environment:

Context for the issue:

charris commented May 12, 2025

Uh oh!

eigenbrot commented May 12, 2025

Uh oh!

ngoldbaum commented May 12, 2025

Uh oh!

eendebakpt commented May 13, 2025

Uh oh!

eigenbrot commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngoldbaum commented May 13, 2025

Uh oh!

eigenbrot commented May 13, 2025

Uh oh!

ngoldbaum commented May 13, 2025

Uh oh!

eigenbrot commented May 14, 2025

Uh oh!

ngoldbaum commented May 14, 2025

Uh oh!

ngoldbaum commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngoldbaum commented May 14, 2025

Uh oh!

eendebakpt commented May 18, 2025

Uh oh!

eigenbrot commented May 20, 2025

Uh oh!

eendebakpt commented May 20, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

eigenbrot commented May 13, 2025 •

edited

Loading

ngoldbaum commented May 14, 2025 •

edited

Loading