A "pop-up choice" menu will appear on the left when the applet is finished loading. This may take a minute or two depending on the speed of your internet connection and computer. Please be patient.

Click and hold the choice menu and choose a dataset. After a slight delay, a window will open containing a histogram and cross validation plots.

If you wish to enter or paste in your own data, choose "Enter Data." A window will open for you to enter your data. You should have one data point per line.

The histogram is the most important graphical tool for exploring the shape of data distributions. Textbooks generally provide detailed instructions on histogram construction, but generally only offer a few examples. Research in nonparametric density estimation (Scott, 1992) has provided a wealth of research on how to identify "good" histograms and "bad" histograms. We have selected a so-called "cross-validation" criterion here (see below). The histogram applet automatically provides auxiliary graphs of the cross-validation functions that estimate the quality of the currently displayed histogram. Smaller values of the cross-validation function generally imply smaller errors in the approximation. Clicking on these graphs (or the plus/minus widgets) takes you to other histograms with different bin widths, or to histograms with the same bin width but shifted bin edge locations. These graphs predict the quality of those other histograms.

The graph on the left shows the cross-validation values of 30 bin widths for the value of the "lower limit of the first bin" specified on the right graph. The graph on the right shows the cross-validation values of 20 "lower limits of the first bin" for the bin width specified on the left graph.

The default histogram uses the lowest of the lower limits shown on the right graph in conjunction with the bin width that gives the lowest cross-validation value of the 30 original bin widths. Typically, the lowest of the lower limits will not produce the lowest cross-validation value. Finding the lowest value is an iterative process. Try out different combinations of the lower limit and the bin width and observe the results.

You can change the lower limit and/or bin width parameters in three ways: (1) clicking on a point in a cross-validation graph, (2) clicking the "+/-" button, and (3) entering a value in the text field and hitting return. The red points on the cross-validaton graphs show the values used by the histogram displayed above.

For problems such as calibrating a histogram, conventional parametric algorithms such as maximum likelihood do not apply. Nonparametric problems such as the histogram are more amenable to so-called minimum distance criteria. The idea is to find a histogram that is close point-by-point to the true (but unknown) density function, f(x). One popular criterion is the integrated squared error:

where the bin width of the histogram estimator is given by the subscript, h. This criterion can be expanded into three terms, which can be considered individually:

The first term can be computed easily and exactly as the bin width (or the bin location) is changed. The third term involves the unknown density function, but fortunately can be ignored since it does not change as various histogram parameters are tried. Ignoring this term means that the level of the criterion is dropped and is usually negative. However, this term just represents a constant change in the level of the cross-validation curve, and dropping it does not change the location of the best bin width on the curve. Therefore, the second term is where the action lies. Ignoring the factor of -2, the integral represents the average height of the histogram. Rudemo (1982) suggested a leave-one-out estimator for this integral. Specifically, he suggested constructing the histogram with one data point held out, and then evaluating this histogram at the extra data point. This can be repeated n times, once for each data point, and the n histogram evaluations averaged.

The cross-validation criterion is given by:

.

For a histogram with bin counts {c_{k}}, this expression can be computed
explicitly as

.

Observe that the bin counts must be recomputed after any change in the bin width and/or the bin location parameter. In our particular implementation, we have chosen 20 bin shifts and 30 bin widths. The largest is guaranteed to be too large (see the topic of oversmoothing in Scott, 1992). If you look carefully, you will see the bin widths available are not equally spaced. This is important because it is relative changes (rather than absolute changes) in the bin widths that are comparable across the range. The bin widths available, if plotted on a logarithmic scale, would be equally spaced.

Rudemo, M. (1982) Empirical choice of histograms and kernel density estimates.Scandinavian Journal of Statistics,9, 65-78. Scott, D. W. (1992)Multivariate Density Estimation, John Wiley & Sons, New York.

The simulations were developed as part of a grant from NSF to David Lane of Rice University. Partial support for this work was provided by the National Science Foundation's Division of Undergraduate Education through grant DUE 9751307. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.