BUG: Limit the maximal number of bins for automatic histogram binning #28426

eendebakpt · 2025-03-04T10:39:38Z

We limit the maximum number of bins in the automatic histogram binned. A heuristic rule is used: the minimal bin width is 10% of the Sturges rule

Some code to illustratie the maximum number of bins:

import math
import numpy as np
from numpy.lib._histograms_impl import _hist_bin_sturges, _hist_bin_sqrt

for n in [2, 10, 40, 100, 1_000, 10_000, 100_000, 1_000_000]:
    x = 100*np.arange(n)
    range = None
    sturges_bins = np.ptp(x) / _hist_bin_sturges(x, range)
    sqrt_bins = np.ptp(x) / _hist_bin_sqrt(x, range)
    
    maximum_number_of_bins = 10 * sturges_bins
    items_per_bin = x.size / maximum_number_of_bins
    print(f'{n=} {maximum_number_of_bins=:.1f} {sturges_bins=:.1f} {sqrt_bins=:.1f}')

Output:

n=2 maximum_number_of_bins=20.0 sturges_bins=2.0 sqrt_bins=1.4
n=10 maximum_number_of_bins=43.2 sturges_bins=4.3 sqrt_bins=3.2
n=40 maximum_number_of_bins=63.2 sturges_bins=6.3 sqrt_bins=6.3
n=100 maximum_number_of_bins=76.4 sturges_bins=7.6 sqrt_bins=10.0
n=1000 maximum_number_of_bins=109.7 sturges_bins=11.0 sqrt_bins=31.6
n=10000 maximum_number_of_bins=142.9 sturges_bins=14.3 sqrt_bins=100.0
n=100000 maximum_number_of_bins=176.1 sturges_bins=17.6 sqrt_bins=316.2
n=1000000 maximum_number_of_bins=209.3 sturges_bins=20.9 sqrt_bins=1000.0

numpy/lib/_histograms_impl.py

Co-authored-by: Joren Hammudoglu <jhammudoglu@gmail.com>

seberg · 2025-03-05T09:50:22Z

numpy/lib/_histograms_impl.py

+
+    # heuristic to limit the maximal number of bins
+    maximum_number_of_bins = 2 * x.size / math.log1p(x.size)
+    minimal_bw = np.subtract(*np.percentile(range, (100, 0))) / maximum_number_of_bins


Two comments:

The logarithm rule is basically the sturges rule, I think. So I think you could just re-use that directly with some factor? (I don't have an intuition for how the two rules behave, so not sure what this changes in practice yet.)
There might be a fun difference, in that you use the range, and the sturges rule seems to calculate the min/max?

percentile on the range seems odd, since it should just be two values.

(As said, didn't think about the actual heuristic choice yet, i.e. why does this mix fd and struges? Is fd usually smaller or larger for "reasonable" data?)

ngoldbaum · 2025-03-05T20:41:14Z

I ran the matplotlib, scikit-image, and jax tests against this PR branch and didn't see any issues.

ngoldbaum

I think this is a good change. Thanks for working on it. Since this is a visible behavior change in the Python API, maybe ping the mailing list? That might also shake someone out who is an expert on histogram binning.

eendebakpt · 2025-03-05T22:18:12Z

Message was send to the mailing list. I also updated the description in the first comment. Looking at the numbers now, I might pick a more conservative option (e.g. 50 * Sturges, or a combination of Sturges and sqrt)

seberg · 2025-03-06T07:06:11Z

Yeah, I think limiting bins, maybe based on the sturges estimate makes sense. But the current logic feels a bit awkward.

After looking at the reasoning, FD seems used because we want more bins when many bins may be (almost) empty so that sturges estimates fewer than ideal bins, I think.
It may make sense to me to use min(fd_bw, sturges_bw * factor), but right I think you have a step if you are smaller than sturges_bw * 0.1 you instead go with sturges_bw (i.e. 10x fewer bins).

In general, I am in favor though and don't want to think too much about the ideal logic. "auto" should mostly be used for plotting and there anything beyond a few thousand bins is likely not useful.
Just feel we should avoid obvious artifacts :).

(I don't remember who wrote the initial version, I think someone from matplotlib, if we are worried, would be enough to ping them, I think.)

lorentzenchr · 2025-03-06T07:14:58Z

I also have a use case for plotting where „auto“ produces unpleasantly many bins and sturges seems just fine. So in principle, I‘m in favor of changing the behavior of auto. I think, however, that the goal should become clearer:

Avoid oom errors; or
More pleasant auto option

For the 2. point, aren’t there already enough options? (playing the devil‘s advocate here)

seberg · 2025-03-06T07:49:55Z

I think it makes more sense to eyeball something 2, because that is what "auto" is meant to be used for.
But if that's hard, sure, anything is OK just to not fail in silly ways (heck even something like nbins <= n_points or so).

eendebakpt · 2025-03-07T11:10:09Z

My main objective here is 1. No objections to 2., but I would like to avoid long discussions. I modified to code to make the bin estimation continuous and verified that behavior is (mostly) unchanged on several distributions.

Test script

import math
import numpy as np
from numpy.lib._histograms_impl import _hist_bin_sturges, _hist_bin_sqrt, _hist_bin_fd, _hist_bin_auto
import matplotlib.pyplot as plt
from collections import defaultdict


def uniform_dataset(size):
    return np.random.rand(size)


def poisson_dataset(size):
    return np.random.poisson(size=size)


def abs_normal_dataset(size):
    return np.random.normal(size=size)


def double_gaussian_dataset(size):
    x = np.random.normal(size=size)
    x[:x.size//2] += 10
    return x


def iqr(x):
    return np.diff(np.percentile(x, [25, 75])).item()


x = normal_dataset(10_000)
print(f'IQR for normal dataset: {iqr(x)}/{np.ptp(x)}')


def small_iqr_dataset(size):
    x = np.random.rand(size)
    x[:size//3] = .5
    x[size//3:(2*size)//3] = .5+1e-3
    return x

def _hist_bin_pr(x, range):
    # properties: continuous, behavoir unchanges on several distributions, no out-of-memory
    fd_bw = _hist_bin_fd(x, range)
    sturges_bw = _hist_bin_sturges(x, range)
    sqrt_bw = _hist_bin_sqrt(x, range)
    fd_bw_corrected = max(fd_bw, sqrt_bw / 2)
    return min(fd_bw_corrected, sturges_bw)


iterations = 400
nn = [2, 6, 10, 20, 40, 60, 100, 200, 500, 1_000, 10_000, 100_000, 1_000_000]  # , 10_000_000]
# nn=[10, 20, 40, 60, 100, 200, 500, 1_000, 10_000]
sturges = []
sqrt = []
fd_normal = []
fd_uniform = []
fd_iqr = []
current_main = []
pr = []

methods = {'Sturges': _hist_bin_sturges, 'FD': _hist_bin_fd,
           'Sqrt': _hist_bin_sqrt, 'Main': _hist_bin_auto, 'PR': _hist_bin_pr}
datasets = {'Uniform': uniform_dataset, 'Poisson': poisson_dataset, 'Normal': normal_dataset,
            'Double Gaussian': double_gaussian_dataset, 'Small IQR': small_iqr_dataset}

range_arg = None  # unused
results = defaultdict(dict)

for dataset, dataset_method in datasets.items():
    print(f'Generating data for {dataset}')
    for method_name in methods:
        results[dataset][method_name] = np.zeros(len(nn))
    for idx, n in enumerate(nn):
        print(n)
        for it in range(iterations):
            x = dataset_method(n)
            for method_name, method in methods.items():
                bw = method(x, range_arg)
                if bw == 0:
                    bins = 1
                else:
                    bins = np.ptp(x) / bw
                results[dataset][method_name][idx] += bins
    for method_name in methods:
        results[dataset][method_name] /= iterations


# %%
markers = defaultdict(lambda: '.', {'Main': 'd', 'PR': '.'})
linewidth = defaultdict(lambda: 1, {'Main': 2, 'PR': 2})
for idx, dataset in enumerate(datasets):
    print(f'Plotting data for {dataset}')

    r = results[dataset]
    plt.figure(100+idx)
    plt.clf()
    for name, counts in r.items():
        marker = markers[name]
        plt.plot(nn, counts, '-', marker=marker, linewidth=linewidth[name], label=f'{name}')
    plt.xscale('log')
    plt.yscale('log')
    plt.xlabel('Number of data points')
    plt.ylabel('Number of bins')
    plt.legend()
    plt.title(f'Dataset {dataset}')

For example on a uniform datasets:

And on a dataset with small IQR:

seberg

Thanks, approving just as an indicator that to me this seems like a good approach (I am trusting that the "relaxed FD" exists ;)).

Ping @nayyarv (original author), just in case you want to comment.

numpy/lib/_histograms_impl.py

doc/release/upcoming_changes/28426.change.rst

numpy/lib/_histograms_impl.py

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

numpy/lib/tests/test_histograms.py

seberg · 2025-03-14T09:44:27Z

Unless anyone comments soon, I think we should merge this in the next days (modulo that one doc nit); and I may just apply+merge when I go through the next time.

RonaldAJ · 2025-03-15T21:27:07Z

The Freedman-Diaconis implementation calculates the 25th and 75th percentiles as intermediate data. It surprised me that these don't end up in the location of the outer_edges. Not using this makes the bin edges dependent on two likely outliers: the maximum and minimum of the data.

Enough reason to look up the original paper and the original Freedman-Diaconis paper states:

However, numerical computations, which will be reported
elsewhere, suggest that the following simple, robust rule for choosing the cell
width h often gives quite reasonable results.

1.8) Rule: Choose the cell width as twice the interquartile range of the data,
divided by the cube root of the sample size.

This doesn't spell it out but this only makes sense if outer edges are related to the quartile positions. The cube root of the sample size follows from their theoretical considerations, but those assume that the interval over which the distributions is non-zero is known in advance. Which is clearly not the case here. I haven't found the numerical paper they mention.

RonaldAJ · 2025-03-16T10:25:39Z

I think that my previous comment hints at a conceptual flaw in the whole binning procedure. The whole focus has been shifted from finding bin boundaries to the number of bins only. But the bin boundaries also depend on the interval being divided into bins.

Within the design I think the PR addresses the original bug and the choices are well motivated. Maybe a new issue should be created to address the conceptual problem.

seberg · 2025-03-16T11:17:14Z

@RonaldAJ, my opinion here is that what you are saying makes sense as an auto-binning method, but not at all as the default bins="auto" method. The auto-method is for plotting and I think not include the extreme points is something that the user must choose very explicitly to not be surprised.

So I am very much opposed to think about it here. But it does seem potentially very interesting as a new bins= method.

RonaldAJ · 2025-03-16T11:56:21Z

I agree with the focus on plotting.

RonaldAJ · 2025-03-16T12:11:27Z

Reread the code now and the conceptual problem is not there for Freedman-Diaconis, but the other bin widths rely on the outliers. They could be made tot rely on the quartiles instead. But I guess that also requires some studying of the theory behind those. Also because the quartiles require (partial) sorting there might be some performance impact.

To address @seberg's concern about data outside the bins. It was not my point to limit the total range covered. But the number of bins should vary with the total range covered. That is, contrary to what I believed earlier, actually happening for the Freedman-Diaconis case, where the number of bins between the quartile points is fixed for a fixed number of data points, and extra bins are created to capture both extremes.

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

…numpy#28426) * Limit the maximal number of bins for automatic histogram binning * BUG: Limit the maximal number of bins for automatic histogram binning * fix test * fix test import * Update numpy/lib/_histograms_impl.py Co-authored-by: Joren Hammudoglu <jhammudoglu@gmail.com> * lint * lint * add release note * fix issues with overflow * review comments * remove unused import * remove unused import * typo * use continuous bin approximation * update test * Apply suggestions from code review Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net> * Apply suggestions from code review * Update numpy/lib/tests/test_histograms.py * fix test * Update numpy/lib/_histograms_impl.py Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net> --------- Co-authored-by: Joren Hammudoglu <jhammudoglu@gmail.com> Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

eendebakpt added 2 commits March 4, 2025 11:35

Limit the maximal number of bins for automatic histogram binning

b036bbc

BUG: Limit the maximal number of bins for automatic histogram binning

931bf14

github-actions bot added the 00 - Bug label Mar 4, 2025

eendebakpt added 3 commits March 4, 2025 11:42

fix test

27beb2f

Merge branch 'auto_bins' of github.com:eendebakpt/numpy into auto_bins

9ee5772

fix test import

0e1d2c0

jorenham reviewed Mar 4, 2025

View reviewed changes

numpy/lib/_histograms_impl.py Outdated Show resolved Hide resolved

eendebakpt and others added 2 commits March 4, 2025 12:05

Update numpy/lib/_histograms_impl.py

b50ced7

Co-authored-by: Joren Hammudoglu <jhammudoglu@gmail.com>

lint

4b543f4

seberg added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Mar 4, 2025

eendebakpt added 4 commits March 4, 2025 12:18

lint

31408fb

Merge branch 'auto_bins' of github.com:eendebakpt/numpy into auto_bins

81b668f

add release note

e4f1217

fix issues with overflow

669c938

seberg removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Mar 5, 2025

seberg reviewed Mar 5, 2025

View reviewed changes

eendebakpt added 4 commits March 5, 2025 13:30

review comments

7f9b183

remove unused import

8d53771

remove unused import

ef2346d

typo

a916352

ngoldbaum reviewed Mar 5, 2025

View reviewed changes

eendebakpt added 2 commits March 7, 2025 11:52

use continuous bin approximation

ea4eada

update test

721a696

seberg approved these changes Mar 7, 2025

View reviewed changes

numpy/lib/_histograms_impl.py Outdated Show resolved Hide resolved

doc/release/upcoming_changes/28426.change.rst Outdated Show resolved Hide resolved

numpy/lib/_histograms_impl.py Outdated Show resolved Hide resolved

Apply suggestions from code review

258688c

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

seberg reviewed Mar 10, 2025

View reviewed changes

numpy/lib/tests/test_histograms.py Outdated Show resolved Hide resolved

Apply suggestions from code review

986338f

eendebakpt commented Mar 10, 2025

View reviewed changes

numpy/lib/tests/test_histograms.py Outdated Show resolved Hide resolved

eendebakpt added 2 commits March 10, 2025 14:27

Update numpy/lib/tests/test_histograms.py

4a526bc

fix test

de2438d

Update numpy/lib/_histograms_impl.py

ddd44e9

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

seberg merged commit 53f4cc5 into numpy:main Mar 17, 2025
72 of 73 checks passed

Uh oh!

BUG: Limit the maximal number of bins for automatic histogram binning #28426

BUG: Limit the maximal number of bins for automatic histogram binning #28426

Uh oh!

Conversation

eendebakpt commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

seberg Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Mar 5, 2025

Uh oh!

ngoldbaum left a comment

Choose a reason for hiding this comment

Uh oh!

eendebakpt commented Mar 5, 2025

Uh oh!

seberg commented Mar 6, 2025

Uh oh!

lorentzenchr commented Mar 6, 2025

Uh oh!

seberg commented Mar 6, 2025

Uh oh!

eendebakpt commented Mar 7, 2025

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seberg commented Mar 14, 2025

Uh oh!

RonaldAJ commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RonaldAJ commented Mar 16, 2025

Uh oh!

seberg commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RonaldAJ commented Mar 16, 2025

Uh oh!

RonaldAJ commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

eendebakpt commented Mar 4, 2025 •

edited

Loading

RonaldAJ commented Mar 15, 2025 •

edited

Loading

seberg commented Mar 16, 2025 •

edited

Loading

RonaldAJ commented Mar 16, 2025 •

edited

Loading