The post Bar Charts, Error Bars and R appeared first on Sharp Statistics.

]]>I’ve written about the issue with dynamite plots before, so won’t revisit the details, I’ll leave that to the interested reader. My issue with the article as with the others is that the instruction shows how to draw the chart using Ggplot2. I can understand that if you are in Excel or some other package with limited chart options then you may resort dynamite plots, but if you are using R and Ggplot2 and are still required to show the bars then why not try something different?

So the article used the ‘mtcars’ data set and produced the following chart,

The code to produce the chart can be found from the above link. Mine is slightly different as I’ve used theme_bw().

Now Ggplot2 is all about generating a plot in layers, so instead of error bars how about showing the actual data points against the bar of the mean, as shown below.

In this case as there are only a few data points so it clearly shows the distribution of each group of data and highlights that some groups have very few data points. Reducing the alpha level of the bars allows all the data points to be shown. The code to produce the above chart is also much shorter than having to pre calculate means and standard deviations.

First I added an extra column to the ‘mtcars’ data frame,

mtcars$CylGear<-paste(mtcars$cyl,"/",mtcars$gear)

and then to generate the plot,

p<-ggplot(mtcars)+theme_bw()+geom_point(aes(CylGear,mpg),color="#FF0000",size=4)+stat_summary(aes(x=CylGear,y=mpg),fun.y=mean,geom="bar",fill="#548DD4",alpha=0.4)

If you had a large amount of data then you could add a box plot to each bar

Personally I would lose the bars altogether and just have the box plot, but if users insist on having the bars then Ggplot2 allows you to produce a chart that allows a compromise. The code,

p<-ggplot(mtcars)+theme_bw()+geom_boxplot(aes(CylGear,mpg),outlier.size =3,notch=FALSE, notchwidth =0.5,fill="#FF0000",alpha=0.8)+stat_summary(aes(x=CylGear,y=mpg),fun.y=mean,geom="bar",fill="#548DD4",alpha=0.4)

There are other options on displaying data distributions and the flexibility in R using Ggplot2 means you don’t have to stick with how things have always been done.

Note that my Ggplot2 statements may be a bit long winded and in an odd order as they have been generated automatically from an user interface that I’m currently developing but more on that another time.

The post Bar Charts, Error Bars and R appeared first on Sharp Statistics.

]]>The post Excel and R Data appeared first on Sharp Statistics.

]]>Getting data from Excel into R is typically accomplished by one of the many data import R packages. These work well for repetitive data import and export but are often cumbersome for a quick look at some Excel data in R or to get a couple of R variables into Excel.

I’ve come at the problem from the other side by building an Excel Add in that gives access to an R workspace directly from the worksheet. Frequently I’m given data in Excel files and the nature of the data often means I want to just export a small amount into R to do a quick first look analysis.

The Add In works with Excel 2010 and 2013 and allows data to be selected directly from the worksheet as you would for any other Excel operation and sent to R where it can be saved as a workspace. You can also load an existing workspace and then have the data required copied straight to the worksheet. All this can be done without the need to type any R code.

There is an overhead in converting data from Excel to a format so may not be suitable for importing and export large data sets but variables with a million elements can be transferred in a second or 2.

You can read more and find the download link here.

The post Excel and R Data appeared first on Sharp Statistics.

]]>The post Excel R Add In appeared first on Sharp Statistics.

]]>The huge range of functions available from R means making a user interface can be difficult, and so there are several approaches on how best to build R function into Excel. The easiest approach is to hard code an interface into the Excel ribbon giving the user control over the required functions. As an example here is an experimental interface to perform Bayesian linear regression in an Excel Worksheet.

This has been achieved by using RJags package to link Excel to the Jags system. The interface gives the user the ability to control the 3 priors and give some control over the MCMC component. For the sake of this demonstration the R script for this example has been taken from the rather good book ‘Doing Bayesian Data Analysis’ by J.K Kruschke so perhaps does not show all the output required for a proper tool, but proves that complicated R tools can be run from Excel.

A fixed interface that is built around the required functions does work well but also means that the huge flexibility of the R system is locked away and any changes to the functions controls result in having to build a new version of the Excel R Add In.

Sharp Statistics have recently built a dynamic interface to an Excel R Add In that allows user defined R scripts to be run through Excel and obtain the results in a worksheet. This is achieved by wrapping the R script into a file that contains additional commands that tell the Excel R Add In which data frames and charts to import into Excel after the script is run. The user selects the required data as they would for any other Excel operation and then the Excel R Add In sends the data to R and runs the script.

The implementation also has additional commands that allow extra text descriptive text to be displayed and to specify it the output should be cut short or highlighted if a condition is not met, so users can be warned that there is not enough data or the result of a test is outside of a specified limit. This system allows non R users to run a R analysis without having ever used R or have any knowledge of it and saves the time and effort of coding a fixed interface. The biggest benefit is now the user can modify the analysis if needed and build completely new ones without there being any need to change the Add In.

Sharp Statistics have experience of using R and building software tools that use R for computation, so if this type of system could be useful for your company or you have an idea for an R Excel Add In, please get in contact to see if we can help. We also offer a free Add In that allows you to import and export data from a Excel worksheet to an R workspace.

The post Excel R Add In appeared first on Sharp Statistics.

]]>The post Singular Spectrum Analysis in Excel appeared first on Sharp Statistics.

]]>Singular Spectrum Analysis (SSA) is a technique for analysing time series. The method is relatively simple to implement and relies on applying some linear algebra. There is no requirement to do any pre-processing before applying SSA so practical implementation is straight forward.

SSA can be used for extracting the underlying trend of a time series or depending on the requirement the seasonality or in fact any oscillation that may be present in the data. SSA is not model based and the resulting extracted series is obtained from the data itself. The method can also be used to look for points of change as well as for forecasting.

The SSA algorithm involves using the whole of the selected time series to construct a matrix. The construction of this Trajectory matrix requires selecting a window length to split the data in to overlapping sections. A matrix decomposition called Singular Value Decomposition (SVD) is then used to extract the eigenvectors of the trajectory matrix, as would be done in Principal Component analysis. The eigenvectors are then sampled depending on what oscillations are required to be extracted and this subset is then diagonally averaged to reconstruct the required signal.

To illustrate how SSA works, the animated chart below shows screen shots from a custom task pane built to calculate Singular Spectrum Analysis in Excel. As the number of retained components is increased the SSA progressively captures more and more of the detail of the trend. As the number of components collected approaches the size of the window then the closer the SSA extracted trend (Green) approaches the original (Blue). This example just uses consecutive components but it is possible to select any number of components so the required part of the trend can be extracted and examined.

SSA can also be used for forecasting future values, and the picture below shows an example using the same 1 minute data of the Euro/US Dollar exchange rate as used above. The data used for the SSA and forecast was a subset of the original series and the Excel chart on the left shows the last 10 points of the SSA reconstruction and to 10 points of the forecast against the original data. The vertical line indicates the change from signal to forecast.

There is much more to SSA than the brief example shown here as there are various measures that can be captured to determine the best components to use to extract and forecast a trend. The example also demonstrates how complex analysis tools can be easily built into Excel with added GUI controls in a custom task pane, enabling easy importing and further investigation of any results generated.

The post Singular Spectrum Analysis in Excel appeared first on Sharp Statistics.

]]>The post Linear Regression and Matrix Operations in Excel appeared first on Sharp Statistics.

]]>Linear regression is about trying to find linear combinations of variables, predictors that can be used to model a response variable. So given a response y and predictor x model takes the form of,

The equation above should all contain a error term to denote the differences between the model predictions and the actual values. I’ve omitted it in these notes as the error term is not used to calculate the model.

Basic statistical texts detail that the way to find the regression coefficients alpha and beta is with the following two formulas,

In practice the above formula isn’t used as it can’t be extended for more predictor variables. More advanced books detail how regression is calculated using matrices, but is often buried in complex linear algebra, which means using matrices. Excel allows calculations with matrices and in doing so gives a nice simple method for illustrating how linear regression is actually calculated in software

As an example here is how to fit a quadratic function,

using matrices and a very simple data set with just 4 observations, x=1,2,3,4 and y =0.45,0.72,0.83 and 0.95. Note that even though the equation is a quadratic it is still classed and linear regression as the model relates y to the three coefficients in a linear way so simple addition.

The first step is to form the X matrix from the predictor variables. The X matrix contains 1 in the first column as this determines the intercept. Each subsequent column contains the x variable corresponding to the required formula of the regression model that is being fitted. In this case the second column contains x and the third column contains x squared, matching the formula above. So the X Matrix becomes,

The Y Matrix is simply the values given for y so,

So now the regression coefficient can be calculated from,

where X and Y are the matrices just described, the superscript T indicates the matrix transpose and -1 indicates an inverse matrix. The result of plugging the data into this formula is a column matrix with the same number of rows as the coefficients. The first value is alpha, second beta with gamma last.

One point to note is that performing matrix inversion in software is a topic in it’s self and I don’t know the method used by Excel, but generally when inverting matrices for regression something called a QR decomposition is used as this is less prone to errors than a direct method. Typically direct inversion methods don’t cause problems for small simple models but can have an impact when trying to fit high order polynomial functions.

To perform the same calculation in Excel is straightforward, first build the X and Y matrix,

When calculating with matrices in Excel it can be done manually by writing the index of each matrix cell to an Excel cell but this is a bit tedious. Instead use can be made of Excels array function. To calculate the transpose of the X matrix, select an empty range with 3 rows and 4 columns and then in the formula bar enter the formula =TRANSPOSE(array) where array is the range of the X matrix. In the example shown above =TRANSPOSE(B8:D11) is used. Now while the cursor is still in the formula bar press Alt+Shift+Enter and the formula should get wrapped in curly brackets and the range selected for the result will be filled in, as shown in the above picture.

Once all the other parts of the calculation are filled in the 3 regression coefficients returned are the same as those generated when a quadratic curve is fitted in a scatter chart of the data as shown below.

Now using matrices allows the calculation of very complex regression equations but it also has the advantage of being able to calculate some more advanced diagnostic data. Something called a hat matrix,

which can be generated and this has the interesting property that the diagonal elements are the leverages which can be used to see how much each influence each point in the regression has. Generally the leverages are used in calculations to generate studentized residuals and Cooks Distance, which can be used to see to if any data point is having too much influence and distorting the model.

I’m not advocating manually performing regression in Excel as shown above but if you ever needed to look at leverages or something more complex then it is possible in Excel, but for those that like to know how things work the above demonstrates how regression is actually performed in many software packages.

The post Linear Regression and Matrix Operations in Excel appeared first on Sharp Statistics.

]]>The post SharpER, Connecting R to Excel appeared first on Sharp Statistics.

]]>As well as being a useful tool in it’s self SharpER also illustrates that an R connection to Excel could be very powerful,as it opens up the opportunity for carrying out a whole range of sophisticated statistical routines from within Excel. Repeated analysis could have custom ribbon and task pane controls which would allow non R users to access the extensive methods available without needing to go through the steep learning curve R presents.

SharpER allows you run R code directly from a worksheet and quickly import any R variables into Excel, as well as sending Excel data to R as a dataframe. R Workspaces can also be saved and loaded from the Add In. SharpER does not handle R graphics but any graphical output is displayed by the standard R graphics window and can be easily pasted into Excel as an image. It is a basic tool but for quickly getting the fell of an Excel data set it can be useful.

If you want to use SharpER you can find more details and the download link on the SharpER page. If you do download it please let us khow you get on, either with a comment to this post, or using the contact form.

Update:

SharpER is now obsolete and we offer another free Add In SharpER Data that allows data to be exchanged between an R workspace and Excel.

The post SharpER, Connecting R to Excel appeared first on Sharp Statistics.

]]>The post Sharp Statistics helps Aquafuel appeared first on Sharp Statistics.

]]>Aquafuel Research Limited have developed technology that makes it possible to run standard diesel powered combined heat and power (CHP) plants on renewable bio fuels. To be able to analyse the performance of installed CHP units and to assess the requirements to replace existing conventionally fuelled installations a certain amount of data analysis is required.

Excel files containing over a year of data are received but the values are stored in a matrix making analysis difficult. As it is possible have a data point every half an hour, each file can have over 17,000 data points, and to perform any meaningful analysis, data from several files needs to be combined.

To enable Aquafuel to quickly manage these data files Sharp Statistics have built an Excel Add In that makes handling the data quick and efficient. Several data files can now be selected and the data imported and formatted onto a single worksheet with automatic adjustment to make sure the time stamps from each file line up correctly.

Once the data is formatted the Add In allows the user to adjust the parameters of the required calculation allowing different situations to be quickly assessed. As the Add In can treat the data in memory rather than having to loop through each excel worksheet cell, calculations can be performed very quickly.

Viewing thousands of data points in a standard Excel plot is diffcult as the trend lines just look like a coloured mass. As the Add In has been built using VSTO instead of the more common VBA approach it has been easy to add and interactive time series chart for Excel that allows the user to zoom in and scroll through the whole period allowing periods of interest to be examined. Paul Day, Aquafuel’s CEO says

Sharp Statistics have allowed us to analyse our data in an effefficient and timely manner that was previously impossible.

If you need some help to make your data analysis simpler and quicker contact us for a free no obligation consultation. To see a sample of our work, download our free demonstration Add In that seamlessly adds some simple statistical plots into Excel along with an interactive linear regression plot.

The post Sharp Statistics helps Aquafuel appeared first on Sharp Statistics.

]]>The post Data Analysis in Excel appeared first on Sharp Statistics.

]]>Typically when using a statistical software package data is loaded in as a file and then to produce the analysis and charts required the software extracts the data. The data file is not manipulated and remains separate from the analysis and the report. Excel however allows users to mix the data with calculations and the presentation of the results.

Opening a new workbook presents the user with an empty sheet with no guide on where to put data. Some types of data are frequently meet by certain industries that they have a standard layout, for example accountants. With engineering and scientific data the layout seems to come down to personal preference of the user. This freedom means it becomes very easy to mix up the data with the mechanics of the analysis which can make it difficult for others to analyse the data, to update the analysis when more data is obtained, or apply another analysis method when required.

When putting data into a new worksheet keep it a simple as possible, aim for a rectangular matrix of values and if this means repeating values that is fine as they can often be used later for filtering. Don’t think about how your are going to present the data, that can be done on another sheet. This type of layout will also make is straightforward to save the data as a CSV file to be imported into other packages if needed. Think of this sheet as your raw data file and add further sheets to perform the analysis and to present results.

Use the top row to place meaningful labels for each column of data and if you know the data may be used by others using different software it is a good idea to make sure the labels do not contain spaces. Use increase and decrease decimal to make sure all the values in a column are displayed the same, and align text right then all the values are lined up with the decimal place in the same position, this makes spotting errors and data entry mistakes easier.

Here is an example based on the Iris data set and although the situation very simple and made up it is based on experience. The user has laid out the results by separating the 3 species to perform the required calculations. So the layout of the data is a direct result of how the user wants to perform the analysis, no thought has gone into how others might want to use the values or how new data might be added.

Now look at a simpler layout devoid of formatting and presentation. As the species is repeated it makes it easy to filter the data to get subsets, use lookup tables and use pivot tables if needed. If more data is collected it can easily be added on as extra rows . This layout also makes adding more variables easy. If more data is added it is easy to add a column which can be used to denote the sample set.

In this example the data set is small so time saved is negligible but as the amount of data increases having a simple standard layout independent of the analysis can save time and effort when data needs to be investigated with different methods, different data sets need to be combined or data exported to other software packages.

The post Data Analysis in Excel appeared first on Sharp Statistics.

]]>The post Excel Chart Demo appeared first on Sharp Statistics.

]]>It can produce 3 simple statistical charts, box, kernel density and normal quantile plot that are automated versions of Excel charts along with two types of scatter plots.

Click here for more information and a download link.

The post Excel Chart Demo appeared first on Sharp Statistics.

]]>The post Wily data analysis appeared first on Sharp Statistics.

]]>If you rely on this simple method to compare data it will inevitably back fire as was always the case with Wile.E. and his plans.

The above plot shows an example dynamite plot where the error bars overlap and, if the data was a comparison of say Method A against Method B then the researcher decides that Method B does not give significantly higher results. If a t-test is used on the data it suggests there is a difference in the means.

Why does the dynamite plot not highlight the difference?

There are several reasons.

- The plot hides all the data
- The plot assumes the data is symmetrically distributed.
- The mean and standard deviation are calculated using all the data so any extremely high or low values influence the result.

A better way of visually comparing data is using a dot or a box plot, as these plots gives a better idea of the distribution of the data set. A dot plot (sometimes called a strip plot)is simply a plot of the value from each variable as a dot, and works well with only a few data points. Box plots are more complicated and are more useful when the data sets are large. Both plots are shown below using the same data as the first figure.

The dot plot simply shows all the data points and hides nothing, so extreme values can be seen and the full spread of the data is exposed, and in this case with a 100 points the density of data points can be seen.

Box plots are based on percentile values rather than the statistics like the mean which are based on all the values. There are various different ways of drawing them but in general the horizontal bar in the middle of the box is the median (50th percentile). The box represents the inter-quartile range which indicates where 50% of the data lies. The whiskers then show the tails of the distribution and points considered extreme are plotted as dots.

It can be seen from the dot and box plots when compared to the dynamite plot that they give a much better indication to the spread of the data and like the t-test suggests that there is indeed a difference between the means.

A vital consequence of using a dot or box plot is that any extreme values are highlighted instantly, which remain hidden in the dynamite plot. The next 2 figures illustrate this with an outlier in the group B data being completely obscured in the dynamite plot but exposed by the box plot, leading the people who only looked at only one of these plots to very different conclusions.

I think the main reason dynamite plots are used is due to the wide use of spreadsheets for data analysis which don’t tend to have statistical plotting capabilities. You can draw a box plot in Excel using a stacked column plot and some clever formatting, details can be found from Google, and dot plots are simple to do.

Using dot and box plots will make sure that your analysis doesn’t result in your own Wile.E.Coyote moment. If you are working in Excel dot plots are simple to implement but box plots are and issue, check out out Excel chart demo that has an option for box plots, alternatively contact us to see how we can build R plots the same as those used in this article into an R Excel Add In for your data.

The post Wily data analysis appeared first on Sharp Statistics.

]]>