-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
BUG: Limit the maximal number of bins for automatic histogram binning #28426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Joren Hammudoglu <jhammudoglu@gmail.com>
numpy/lib/_histograms_impl.py
Outdated
|
||
# heuristic to limit the maximal number of bins | ||
maximum_number_of_bins = 2 * x.size / math.log1p(x.size) | ||
minimal_bw = np.subtract(*np.percentile(range, (100, 0))) / maximum_number_of_bins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments:
- The logarithm rule is basically the sturges rule, I think. So I think you could just re-use that directly with some factor? (I don't have an intuition for how the two rules behave, so not sure what this changes in practice yet.)
There might be a fun difference, in that you use the range, and the sturges rule seems to calculate the min/max? - percentile on the range seems odd, since it should just be two values.
(As said, didn't think about the actual heuristic choice yet, i.e. why does this mix fd and struges? Is fd usually smaller or larger for "reasonable" data?)
I ran the matplotlib, scikit-image, and jax tests against this PR branch and didn't see any issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good change. Thanks for working on it. Since this is a visible behavior change in the Python API, maybe ping the mailing list? That might also shake someone out who is an expert on histogram binning.
Message was send to the mailing list. I also updated the description in the first comment. Looking at the numbers now, I might pick a more conservative option (e.g. 50 * Sturges, or a combination of Sturges and sqrt) |
Yeah, I think limiting bins, maybe based on the sturges estimate makes sense. But the current logic feels a bit awkward. After looking at the reasoning, FD seems used because we want more bins when many bins may be (almost) empty so that sturges estimates fewer than ideal bins, I think. In general, I am in favor though and don't want to think too much about the ideal logic. "auto" should mostly be used for plotting and there anything beyond a few thousand bins is likely not useful. (I don't remember who wrote the initial version, I think someone from matplotlib, if we are worried, would be enough to ping them, I think.) |
I also have a use case for plotting where „auto“ produces unpleasantly many bins and sturges seems just fine. So in principle, I‘m in favor of changing the behavior of auto. I think, however, that the goal should become clearer:
For the 2. point, aren’t there already enough options? (playing the devil‘s advocate here) |
I think it makes more sense to eyeball something 2, because that is what |
My main objective here is 1. No objections to 2., but I would like to avoid long discussions. I modified to code to make the bin estimation continuous and verified that behavior is (mostly) unchanged on several distributions. Test script
For example on a uniform datasets: And on a dataset with small IQR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, approving just as an indicator that to me this seems like a good approach (I am trusting that the "relaxed FD" exists ;)).
Ping @nayyarv (original author), just in case you want to comment.
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
Unless anyone comments soon, I think we should merge this in the next days (modulo that one doc nit); and I may just apply+merge when I go through the next time. |
The Freedman-Diaconis implementation calculates the 25th and 75th percentiles as intermediate data. It surprised me that these don't end up in the location of the outer_edges. Not using this makes the bin edges dependent on two likely outliers: the maximum and minimum of the data. Enough reason to look up the original paper and the original Freedman-Diaconis paper states:
This doesn't spell it out but this only makes sense if outer edges are related to the quartile positions. The cube root of the sample size follows from their theoretical considerations, but those assume that the interval over which the distributions is non-zero is known in advance. Which is clearly not the case here. I haven't found the numerical paper they mention. |
I think that my previous comment hints at a conceptual flaw in the whole binning procedure. The whole focus has been shifted from finding bin boundaries to the number of bins only. But the bin boundaries also depend on the interval being divided into bins. Within the design I think the PR addresses the original bug and the choices are well motivated. Maybe a new issue should be created to address the conceptual problem. |
@RonaldAJ, my opinion here is that what you are saying makes sense as an auto-binning method, but not at all as the default So I am very much opposed to think about it here. But it does seem potentially very interesting as a new |
I agree with the focus on plotting. |
Reread the code now and the conceptual problem is not there for Freedman-Diaconis, but the other bin widths rely on the outliers. They could be made tot rely on the quartiles instead. But I guess that also requires some studying of the theory behind those. Also because the quartiles require (partial) sorting there might be some performance impact. To address @seberg's concern about data outside the bins. It was not my point to limit the total range covered. But the number of bins should vary with the total range covered. That is, contrary to what I believed earlier, actually happening for the Freedman-Diaconis case, where the number of bins between the quartile points is fixed for a fixed number of data points, and extra bins are created to capture both extremes. |
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
…numpy#28426) * Limit the maximal number of bins for automatic histogram binning * BUG: Limit the maximal number of bins for automatic histogram binning * fix test * fix test import * Update numpy/lib/_histograms_impl.py Co-authored-by: Joren Hammudoglu <jhammudoglu@gmail.com> * lint * lint * add release note * fix issues with overflow * review comments * remove unused import * remove unused import * typo * use continuous bin approximation * update test * Apply suggestions from code review Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net> * Apply suggestions from code review * Update numpy/lib/tests/test_histograms.py * fix test * Update numpy/lib/_histograms_impl.py Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net> --------- Co-authored-by: Joren Hammudoglu <jhammudoglu@gmail.com> Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
Fixes #28400.
We limit the maximum number of bins in the automatic histogram binned. A heuristic rule is used: the minimal bin width is 10% of the Sturges rule
Some code to illustratie the maximum number of bins:
Output: