Most people are familiar with Excel Histograms and understand that data is divided into a series of intervals, or bins and then how many values are in each bin are counted. These counts or frequencies are then typically shown on a chart that uses a bar equal to the count for every bin.
The above is the standard Excel histogram produced in Excel 2013. Confusing Excel uses the midpoint of the bin to plot the data, and has this habit of putting the lowest and highest values in bins on their own. Excel 2016 does seem to handle histograms slightly better.
One important point to note is then the number of bins chosen is important. Too few and the distribution is not revealed, too many and the histogram looks flat with too many counts of 1. The following plot gives an example or different bin numbers, using the same data set as used in the Excel example.
Excel has chosen 11 bins but it seems from the third histogram that a few more would help in highlighting the dip in the middle of the distribution.
So typically when producing a histogram it takes a few attempts to refine the best number of bins. There are methods to determine the number of bins which can be useful to get things started. Excel Histograms use a bin count equal to the square root of the number of data points. Excel 2016 uses Scott’s normal reference rule instead.
Histograms are difficult to compare for 2 reasons. The most obvious is the design makes putting more than one on the same axis impractical, bars are not the best graphic to overlap, although with 2 data sets it can be useful. The second is that the frequency is dependent on the data set and the number of bins. This can be significantly different between data sets.
Instead of using the frequency to scale the Y axis the probability density can be used instead. Scaling to the probability density results in the area of the histogram being 1. This can make comparisons between data sets easier as the Y axis is now a similar scale.
This idea of the probability density can be taken a step further with density curves. This is where instead on binning data an algorithm is used to estimate a density curve, which can be thought of a smoothed out histogram.
As with histograms density plots are also subjected to issues similar to the number of bins. Density curves are calculated using a value called the bandwidth. A too small bandwidth and the curve shows all sorts of lumps, too bin and the underlying distribution is smoothed away. So there is always a subjective element in both types of plots.
As with histograms bandwidth calculations are used. By sticking to one, density curves make it much easier to compare different data sets as having many on the chart is easy.
Frequently a normal curve is laid on a histogram to see how normally distributed the data is. A density curve is not the same as showing a normal distribution curve. Density is an estimate of the density of the data and so does not follow the normal shape, unless the data is normally distributed.
After all that discussion here are some details of how those ideas can be incorporated in Sharp-R and made available to any Excel user, so everyone can have a better Excel histograms. I’ve also included these function into another function file, Histograms.XML which is include in the Sharp R download.
Histogram with Normal curve
Plot a histogram with a normal distribution curve, based on code from a Stack Overflow question. Leave the number of bins set to zero for R to choose.
Histogram with density curve
Similar to the above function except this shows a density curve instead.
This function allows histograms for 2 data sets to be shown on the same chart. The R code it is based on was taken from R Bloggers article here. As with the previous function leave the number of bins set to zero for the defaults. Note that sometimes R decides that the number of bins you enter are not suitable and leaves one of the histograms alone. Probably another reason why over lapping histograms isn’t a good idea.
Kernel Density Plot
Allows you to do away with histograms all together and replace then with kernel density curves. Allows multiple curves on one chart. You can select any number of columns of data. The function is based on code from a Stack Overflow question . Leave the ‘Bw Value’ set to zero for the default value, dependent on the ‘Bw Method’.