Explaining Statistical Methods – Bootstrapping

I’m trying to write a series of posts explaining, in laymen’s terms, statistical methods.  Bootstrapping is one of my favorite techniques, something that isn’t necessarily in the common parlance among non statisticians.

Most people are probably familiar with the word bootstrap, as in pulling oneself up by their bootstraps, which means, basically, to use what you’ve got in order to make your way.  That’s what statistical bootstrapping is.  When we’re not confident about what our distribution is or we don’t have a method for determining the parameter confidence interval, we can take what we have, a sample, and assume the sample is the best estimate for your population, then resample the initial sample with replacement to generate a population distribution that can be used to estimate a confidence interval.

There are 4 commonly used bootstrapping methods for estimating confidence intervals.

  1. Standard -uses mean squared error to estimate interval. can be a good method if estimator of variance is not available.
  2. Percentile – simple method.  estimate parameter using each new sample then take middle range of values (for desired 95%, take middle 95% of values.  Can be bias and skewed.
  3. T-Pivot – requires a pivot quantity relating values. For example, the t distibution does not depend on the mean or variance, so t can be used as a pivot to estimate either.  Best estimates if the pivot quantity exists.
  4. BCA – Corrects percentile method for bias and skew.

Bootstrapping is useful because it doesn’t assume anything about the distribution, unlike many common statistical tests, but it can give you useful estimates for the confidence intervals.

This Is Why I’m an R User

“While Excel (ahem) excels at things like arithmetic and tabulations (and some complex things too, like presentation), R’s programmatic focus introduces concepts like data structures, iteration, and functions. Once you’ve made the investment in learning R, these abstractions make reducing complex tasks into discrete steps possible, and automating repeated similar tasks much easier. For the full, compelling argument, follow the link to the Yhat blog, below.”

https://www.r-bloggers.com/the-difference-between-r-and-excel/

Friday ShoutOut

Every data science post I’ve been looking at references Python, so I started poking around looking at where I could learn about Python.   Enter Code Academy.

logo_blue_dark

I found a simple educational module on Code Academy that has me off and running with picking up Python.  It looks like they have about 25 types of code that you can pick up.

I’m a big fan.  Check it out.

Explaining Statistical Tests – Wilcoxon Rank Sum (and Kruskal Wallis)

So, mostly on here, I’ve talked about parametric tests, or tests based on the normal distribution, but now we’re going to venture into a simple nonparametric test.  Much like the t test and its cousin the F Test, the Wilcoxon Rank sum (WRS) and Kruskal Wallis Test (KW) are tests of difference.  They are both rank tests, so the first step is to dump all of the values together and rank them while retaining which group they came from.  Then you sum the ranks of each group, treating each tie as an average of the ranks that tied (ie if you tied 3,4,5 then all three would be assigned a 4).  The score is the difference of the sum.  To find a p value for this test, you assume that each value is equally likely to end up with either group if the null hypothesis is true, so you do all the possible combinations of values and find scores for those combinations; your p value is how extreme your original value is, that is if there are 32 combinations and your original sum was higher than all but 3, your p value is 3/32 or .094.

This test is most useful when it’s hard to make a lot of assumptions about the distribution.  This test only assumes that the values are randomly assigned.

Data Cleaning in R

So I ran across a bit of a strange problem this week, and I thought I’d share my code in case others needed it.

I was trying to download a data set that has some 16 digit ID #s as one of the variables and transfer this data to another system.  The “easiest” way would be to just open the files and clean them up in Excel.  However, for reasons I cannot figure out, Excel rounds the last digit to the nearest zero value, which renders these IDs effectively useless.

So, as a workaround, I wrote a little R function to clean up those files.

Now, it’s not a perfect function and does require checking the numbers on your files, but it does do what I want to, which is take the files from a csv to txt and leave the name of the activity and date on the first line with the ID #s on the following lines.

If you want to use it, submit $25 to my Paypal.  Just kidding.  Here you go.

for (i in 9:10){
filemade<-paste(“AttendanceByEvent “,”(“,i,”)”,”.csv”, sep=””)
event<-read.csv(file=filemade, skip=6, colClasses = c(“NULL”, “character”, rep(“NULL”,11)), skipNul = T)
details<-read.csv(file=filemade,nrow=4)
name<- paste(strtrim(details[1,1], 25), details[4,1])
file.name<-paste(“event”,i,”.txt”, sep=””)
names (event)<-name
write.table(event, file=file.name, row.names=FALSE, quote=F)
}