Skip to content

BUG: Limit the maximal number of bins for automatic histogram binning #28426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Mar 17, 2025

Conversation

eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented Mar 4, 2025

Fixes #28400.

We limit the maximum number of bins in the automatic histogram binned. A heuristic rule is used: the minimal bin width is 10% of the Sturges rule

Some code to illustratie the maximum number of bins:

import math
import numpy as np
from numpy.lib._histograms_impl import _hist_bin_sturges, _hist_bin_sqrt

for n in [2, 10, 40, 100, 1_000, 10_000, 100_000, 1_000_000]:
    x = 100*np.arange(n)
    range = None
    sturges_bins = np.ptp(x) / _hist_bin_sturges(x, range)
    sqrt_bins = np.ptp(x) / _hist_bin_sqrt(x, range)
    
    maximum_number_of_bins = 10 * sturges_bins
    items_per_bin = x.size / maximum_number_of_bins
    print(f'{n=} {maximum_number_of_bins=:.1f} {sturges_bins=:.1f} {sqrt_bins=:.1f}')

Output:

n=2 maximum_number_of_bins=20.0 sturges_bins=2.0 sqrt_bins=1.4
n=10 maximum_number_of_bins=43.2 sturges_bins=4.3 sqrt_bins=3.2
n=40 maximum_number_of_bins=63.2 sturges_bins=6.3 sqrt_bins=6.3
n=100 maximum_number_of_bins=76.4 sturges_bins=7.6 sqrt_bins=10.0
n=1000 maximum_number_of_bins=109.7 sturges_bins=11.0 sqrt_bins=31.6
n=10000 maximum_number_of_bins=142.9 sturges_bins=14.3 sqrt_bins=100.0
n=100000 maximum_number_of_bins=176.1 sturges_bins=17.6 sqrt_bins=316.2
n=1000000 maximum_number_of_bins=209.3 sturges_bins=20.9 sqrt_bins=1000.0

eendebakpt and others added 2 commits March 4, 2025 12:05
Co-authored-by: Joren Hammudoglu <jhammudoglu@gmail.com>
@seberg seberg added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Mar 4, 2025
@seberg seberg removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Mar 5, 2025

# heuristic to limit the maximal number of bins
maximum_number_of_bins = 2 * x.size / math.log1p(x.size)
minimal_bw = np.subtract(*np.percentile(range, (100, 0))) / maximum_number_of_bins
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments:

  1. The logarithm rule is basically the sturges rule, I think. So I think you could just re-use that directly with some factor? (I don't have an intuition for how the two rules behave, so not sure what this changes in practice yet.)
    There might be a fun difference, in that you use the range, and the sturges rule seems to calculate the min/max?
  2. percentile on the range seems odd, since it should just be two values.

(As said, didn't think about the actual heuristic choice yet, i.e. why does this mix fd and struges? Is fd usually smaller or larger for "reasonable" data?)

@ngoldbaum
Copy link
Member

I ran the matplotlib, scikit-image, and jax tests against this PR branch and didn't see any issues.

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good change. Thanks for working on it. Since this is a visible behavior change in the Python API, maybe ping the mailing list? That might also shake someone out who is an expert on histogram binning.

@eendebakpt
Copy link
Contributor Author

Message was send to the mailing list. I also updated the description in the first comment. Looking at the numbers now, I might pick a more conservative option (e.g. 50 * Sturges, or a combination of Sturges and sqrt)

@seberg
Copy link
Member

seberg commented Mar 6, 2025

Yeah, I think limiting bins, maybe based on the sturges estimate makes sense. But the current logic feels a bit awkward.

After looking at the reasoning, FD seems used because we want more bins when many bins may be (almost) empty so that sturges estimates fewer than ideal bins, I think.
It may make sense to me to use min(fd_bw, sturges_bw * factor), but right I think you have a step if you are smaller than sturges_bw * 0.1 you instead go with sturges_bw (i.e. 10x fewer bins).

In general, I am in favor though and don't want to think too much about the ideal logic. "auto" should mostly be used for plotting and there anything beyond a few thousand bins is likely not useful.
Just feel we should avoid obvious artifacts :).

(I don't remember who wrote the initial version, I think someone from matplotlib, if we are worried, would be enough to ping them, I think.)

@lorentzenchr
Copy link
Contributor

I also have a use case for plotting where „auto“ produces unpleasantly many bins and sturges seems just fine. So in principle, I‘m in favor of changing the behavior of auto. I think, however, that the goal should become clearer:

  1. Avoid oom errors; or
  2. More pleasant auto option

For the 2. point, aren’t there already enough options? (playing the devil‘s advocate here)

@seberg
Copy link
Member

seberg commented Mar 6, 2025

I think it makes more sense to eyeball something 2, because that is what "auto" is meant to be used for.
But if that's hard, sure, anything is OK just to not fail in silly ways (heck even something like nbins <= n_points or so).

@eendebakpt
Copy link
Contributor Author

My main objective here is 1. No objections to 2., but I would like to avoid long discussions. I modified to code to make the bin estimation continuous and verified that behavior is (mostly) unchanged on several distributions.

Test script
import math
import numpy as np
from numpy.lib._histograms_impl import _hist_bin_sturges, _hist_bin_sqrt, _hist_bin_fd, _hist_bin_auto
import matplotlib.pyplot as plt
from collections import defaultdict


def uniform_dataset(size):
    return np.random.rand(size)


def poisson_dataset(size):
    return np.random.poisson(size=size)


def abs_normal_dataset(size):
    return np.random.normal(size=size)


def double_gaussian_dataset(size):
    x = np.random.normal(size=size)
    x[:x.size//2] += 10
    return x


def iqr(x):
    return np.diff(np.percentile(x, [25, 75])).item()


x = normal_dataset(10_000)
print(f'IQR for normal dataset: {iqr(x)}/{np.ptp(x)}')


def small_iqr_dataset(size):
    x = np.random.rand(size)
    x[:size//3] = .5
    x[size//3:(2*size)//3] = .5+1e-3
    return x

def _hist_bin_pr(x, range):
    # properties: continuous, behavoir unchanges on several distributions, no out-of-memory
    fd_bw = _hist_bin_fd(x, range)
    sturges_bw = _hist_bin_sturges(x, range)
    sqrt_bw = _hist_bin_sqrt(x, range)
    fd_bw_corrected = max(fd_bw, sqrt_bw / 2)
    return min(fd_bw_corrected, sturges_bw)


iterations = 400
nn = [2, 6, 10, 20, 40, 60, 100, 200, 500, 1_000, 10_000, 100_000, 1_000_000]  # , 10_000_000]
# nn=[10, 20, 40, 60, 100, 200, 500, 1_000, 10_000]
sturges = []
sqrt = []
fd_normal = []
fd_uniform = []
fd_iqr = []
current_main = []
pr = []

methods = {'Sturges': _hist_bin_sturges, 'FD': _hist_bin_fd,
           'Sqrt': _hist_bin_sqrt, 'Main': _hist_bin_auto, 'PR': _hist_bin_pr}
datasets = {'Uniform': uniform_dataset, 'Poisson': poisson_dataset, 'Normal': normal_dataset,
            'Double Gaussian': double_gaussian_dataset, 'Small IQR': small_iqr_dataset}

range_arg = None  # unused
results = defaultdict(dict)

for dataset, dataset_method in datasets.items():
    print(f'Generating data for {dataset}')
    for method_name in methods:
        results[dataset][method_name] = np.zeros(len(nn))
    for idx, n in enumerate(nn):
        print(n)
        for it in range(iterations):
            x = dataset_method(n)
            for method_name, method in methods.items():
                bw = method(x, range_arg)
                if bw == 0:
                    bins = 1
                else:
                    bins = np.ptp(x) / bw
                results[dataset][method_name][idx] += bins
    for method_name in methods:
        results[dataset][method_name] /= iterations


# %%
markers = defaultdict(lambda: '.', {'Main': 'd', 'PR': '.'})
linewidth = defaultdict(lambda: 1, {'Main': 2, 'PR': 2})
for idx, dataset in enumerate(datasets):
    print(f'Plotting data for {dataset}')

    r = results[dataset]
    plt.figure(100+idx)
    plt.clf()
    for name, counts in r.items():
        marker = markers[name]
        plt.plot(nn, counts, '-', marker=marker, linewidth=linewidth[name], label=f'{name}')
    plt.xscale('log')
    plt.yscale('log')
    plt.xlabel('Number of data points')
    plt.ylabel('Number of bins')
    plt.legend()
    plt.title(f'Dataset {dataset}')

For example on a uniform datasets:

image

And on a dataset with small IQR:

image

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, approving just as an indicator that to me this seems like a good approach (I am trusting that the "relaxed FD" exists ;)).

Ping @nayyarv (original author), just in case you want to comment.

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
@seberg
Copy link
Member

seberg commented Mar 14, 2025

Unless anyone comments soon, I think we should merge this in the next days (modulo that one doc nit); and I may just apply+merge when I go through the next time.

@RonaldAJ
Copy link
Contributor

RonaldAJ commented Mar 15, 2025

The Freedman-Diaconis implementation calculates the 25th and 75th percentiles as intermediate data. It surprised me that these don't end up in the location of the outer_edges. Not using this makes the bin edges dependent on two likely outliers: the maximum and minimum of the data.

Enough reason to look up the original paper and the original Freedman-Diaconis paper states:

However, numerical computations, which will be reported
elsewhere, suggest that the following simple, robust rule for choosing the cell
width h often gives quite reasonable results.

1.8) Rule: Choose the cell width as twice the interquartile range of the data,
divided by the cube root of the sample size.

This doesn't spell it out but this only makes sense if outer edges are related to the quartile positions. The cube root of the sample size follows from their theoretical considerations, but those assume that the interval over which the distributions is non-zero is known in advance. Which is clearly not the case here. I haven't found the numerical paper they mention.

@RonaldAJ
Copy link
Contributor

I think that my previous comment hints at a conceptual flaw in the whole binning procedure. The whole focus has been shifted from finding bin boundaries to the number of bins only. But the bin boundaries also depend on the interval being divided into bins.

Within the design I think the PR addresses the original bug and the choices are well motivated. Maybe a new issue should be created to address the conceptual problem.

@seberg
Copy link
Member

seberg commented Mar 16, 2025

@RonaldAJ, my opinion here is that what you are saying makes sense as an auto-binning method, but not at all as the default bins="auto" method. The auto-method is for plotting and I think not include the extreme points is something that the user must choose very explicitly to not be surprised.

So I am very much opposed to think about it here. But it does seem potentially very interesting as a new bins= method.

@RonaldAJ
Copy link
Contributor

I agree with the focus on plotting.

@RonaldAJ
Copy link
Contributor

RonaldAJ commented Mar 16, 2025

Reread the code now and the conceptual problem is not there for Freedman-Diaconis, but the other bin widths rely on the outliers. They could be made tot rely on the quartiles instead. But I guess that also requires some studying of the theory behind those. Also because the quartiles require (partial) sorting there might be some performance impact.

To address @seberg's concern about data outside the bins. It was not my point to limit the total range covered. But the number of bins should vary with the total range covered. That is, contrary to what I believed earlier, actually happening for the Freedman-Diaconis case, where the number of bins between the quartile points is fixed for a fixed number of data points, and extra bins are created to capture both extremes.

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
@seberg seberg merged commit 53f4cc5 into numpy:main Mar 17, 2025
72 of 73 checks passed
MaanasArora pushed a commit to MaanasArora/numpy that referenced this pull request Apr 11, 2025
…numpy#28426)

* Limit the maximal number of bins for automatic histogram binning

* BUG: Limit the maximal number of bins for automatic histogram binning

* fix test

* fix test import

* Update numpy/lib/_histograms_impl.py

Co-authored-by: Joren Hammudoglu <jhammudoglu@gmail.com>

* lint

* lint

* add release note

* fix issues with overflow

* review comments

* remove unused import

* remove unused import

* typo

* use continuous bin approximation

* update test

* Apply suggestions from code review

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

* Apply suggestions from code review

* Update numpy/lib/tests/test_histograms.py

* fix test

* Update numpy/lib/_histograms_impl.py

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

---------

Co-authored-by: Joren Hammudoglu <jhammudoglu@gmail.com>
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: numpy.histogram tries to allocate 98TB of memory with bins="auto"
6 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy