Basketball Predictions – How’d It Go?

Here’s the post on methodology I used for assessing tournament teams this year.

How’d it do?

First round: pretty good!  27/32 (84%).  I had a lot of these games as outright winners and very few coinflip games.  I thought my approach for picking coin flip games was pretty basic, that is, if a team’s upper bound was over their opponent’s mean, then I picked by flipping a coin.

2nd round: 9/16 (56%).  Yeesh.  I got more and more coinflip games as the tournament went on, but even some that I had as outright winners didn’t go the way I projected.  It’s logical if you think about it though.

Take, for instance, Duke and FSU.  Both teams had been erratic all season, laying occasional stinkers.

FSU got absolutely stomped by Xavier, with Xavier shooting .657 true shooting for the game.  Among the best teams in college basketball, that was .04 better than the best shooting team in the country (UCLA).  Scorching.  FSU meanwhile laid a dud (.433 TS).

Duke’s achilles heel all season has been Grayson Allen, though he is occasionally brilliant.  The problem with Grayson is that he’s trick or treat, he’s either executing and hitting shots or he’s dribbling the ball off his feet and throwing it into the stands.  They had 18 turnovers last night (nearly 1 per 2 minutes or approximately 1 per 3 possessions), which led to South Carolina taking 10 more shots than them over the course of the game.

This leads me to the biggest limitation in my model.  I don’t think I included enough variance.  The reality is that these teams aren’t all that different, outside of a few really good ones and a few really bad ones.  There’s a lot of in the middle and when you combine a fairly short game and a one and done model, you’re going to get some bizarre stuff.

Let’s take as an example the NBA playoffs from 2016.  There were 15 series, 2 of which were sweeps.  Of the other 13, 3 were 4-1.  That means of 15 series, 10 of them (2/3 or 66%) were at least 4-2!

Let’s look at just the Cavs series with Golden State.  The series ended 4-3 and the margins were 15, 33, 30, 11, 15, 14, and 4 points.  The average margin was 17 points!  That’s why it gets complicated when you say that one team beating another one badly on a given day means they’re better.  Particularly in small samples, there’s just a lot of variance in a game that can be decided by a couple errant passes or bad bounces, and playing these games with college freshmen that have been playing together for about 15 minutes can only increase the variance.

In conclusion, my picked champ, Duke, lost in the 2nd round, but I can take some solace that Vegas picked the same champ.  Even Vegas doesn’t know!  It’s just a coin flip when it comes down to it.  The house (and the field) always wins.

Advertisements

NCAA Tournament Data

For your perusal:

tournament projections

Methodology:

Last year I did my picks with KenPom (and did well!), the year before I did them with RPI and it didn’t go so well.  This year, I’m using a different method and adding my own special sauce.

  1. I aggregated the RPI, KenPom, Sagarin, and ESPN’s BPI, then I took a simple average.
  2. With that simple average, I rescaled the rankings using Sagarin’s ratings to make a reasonable scale (1-100 rankings didn’t make any sense).
  3. Next, I took Sagarin’s conference ratings and scaled the ratings based on conference difficulty.  My going theory is that playing tougher teams makes you better and vice versa.
  4. Next, I took the true shooting percentage and scaled teams either up or down.  True shooting is heavily affected by made 3 pointers, so I wanted to give a boost to teams that make threes.
  5. Lastly, I added some variance.  With a 7 game series, the variance is greatly reduced and the best team tends to win.  In college, with a 1 game series, variance plays a bigger role.  I’m coming up with a method for how to deal with this, but I’m thinking that if a lower team’s upside is greater than a higher team’s median, I’m going to treat that as a coin toss.  My variance is 2.5 points. .

Adjusted College Basketball Top 100

Last year I did my picks with KenPom (and did well!), the year before I did them with RPI and it didn’t go so well.  This year, I’m using a different method and adding my own special sauce.

  1. I aggregated the RPI, KenPom, Sagarin, and ESPN’s BPI, then I took a simple average.
  2. With that simple average, I rescaled the rankings using Sagarin’s ratings to make a reasonable scale (1-100 rankings didn’t make any sense).
  3. Next, I took Sagarin’s conference ratings and scaled the ratings based on conference difficulty.  My going theory is that playing tougher teams makes you better and vice versa.
  4. Next, I took the true shooting percentage and scaled teams either up or down.  True shooting is heavily affected by made 3 pointers, so I wanted to give a boost to teams that make threes.
  5. Lastly, I added some variance.  With a 7 game series, the variance is greatly reduced and the best team tends to win.  In college, with a 1 game series, variance plays a bigger role.  I’m coming up with a method for how to deal with this, but I’m thinking that if a lower team’s upside is greater than a higher team’s median, I’m going to treat that as a coin toss.  My variance is 3 points. I need to play with this range a bit.
    SCHOOL Conference Scalar Adjusted Aggregate TS% TS Deviance Total Aggregate Low Range High Range
    Villanova 0.991439 93.80008 0.616 0.5679 94.36798 91.36798 97.36798
    North Carolina 0.996344 94.07667 0.556 -0.0321 94.04457 91.04457 97.04457
    Kansas 1 93.48151 0.58 0.2079 93.68941 90.68941 96.68941
    West Virginia 1 93.29343 0.548 -0.1121 93.18133 90.18133 96.18133
    Louisville 0.996344 93.3271 0.543 -0.1621 93.165 90.165 96.165
    Duke 0.996344 92.57753 0.587 0.2779 92.85543 89.85543 95.85543
    Virginia 0.996344 92.76492 0.553 -0.0621 92.70282 89.70282 95.70282
    Florida 0.985055 92.64026 0.555 -0.0421 92.59816 89.59816 95.59816
    Baylor 1 92.54111 0.563 0.0379 92.57901 89.57901 95.57901
    Kentucky 0.985055 92.45499 0.567 0.0779 92.53289 89.53289 95.53289

 

NCAA Men’s Basketball Data

We have almost a full season of men’s basketball and are well into tournament time.  I wanted to play around a bit with the data and see if anything interesting was there.  I used the statistics from college basketball reference http://www.sports-reference.com/cbb/seasons/2017-advanced-school-stats.html.

I wanted to look at what correlated highly with win percentage.  To even be the least bit intellectually honest, you have to assume that you can hold any of these independent from each other (since I’m looking at these separately), when there’s all kinds of correlation.  For instance, true shooting percentage and assist rate are correlated because teams that make assists get assists because people are hitting shots.  This is just exploratory playing around on an internet site, so I think we can violate few rules of reality in the name of a little fun.  It’s sports!

First, I looked at pace, which is a measure of possessions per 40 minutes.

Oof.  Rough start.  A slight negative correlation and a fair amount of variance.  Interestingly, some of the outliers that play slower are better.  Anyway, not a lot here.

Assist Percentage is a measure of how many field goals made were assisted by a teammate.  An assist is a pass that leads to a score (not I pass it to you and you dribble around for 15 seconds).

This one’s a bit of a mess.  There is a positive correlation but there’s a ton of variance.  This might be one to look at in a bit more detail.

3 point attempt rate measures the percentage of field goal attempts measures the percentage of shots that were 3 pointers.  If you’ve watched the college game lately, there is an absolute plague of 3 point chuckers who are not that good but take a lot of threes since the line is so close.

Yeah, that’s about what I guessed.  A decent amount of variance, but not a lot of relationship to winning.  By the looks of this plot, this one might could use some transformation.

A better measure of whether you’re actually a good 3 point shooter is true shooting percentage which measures your shooting percentage, but is then weighted for the value of your shots.  That is, shooting 40 percent on threes is considered as 60 percent.  Likely, if you have a high true shooting percentage you’re not only taking a lot of threes, you’re actually good at them.

Ahh…there we go.  You take shots and hit them, especially threes, it stands to reason that you’re going to win a lot which is why the 3 pointer has also overtaken the NBA game.

But there’s one that always will be highly correlated.

 

 

I did this in R and I’m pasting my code below in case anyone wants to use this same data.

 

ncaa<-read.table(file=”untitled.txt”,sep=”,”, header=T, quote=””)
ncaa<-ncaa[,-c(1,17)]
names(ncaa)<-c(“school”,”G”,”W”,”L”,”WLPct”,”SRS”,”SOS”, “ConfWins”,”ConfLoss”,”HomeWins”,”HomeLoss”,”AwayWins”,”AwayLoss”,
“TeamPts”,”OppPts”,”Pace”,”ORtg”,”FTrate”,”ThreeAttRate”, “TrueShoot”,”TotalRb”,”AssPct”, “StlPct”, “BlkPct”,
“eFGPct”, “TOPct”,”ORBPct”,”FTPerFGA”)

#Pace
plot(ncaa$WLPct,ncaa$Pace, xlab=”Win Loss Percentage”, ylab=”Pace”, main=”Pace and Winning”)
abline (lm(ncaa$Pace~ncaa$WLPct))

#Assists
plot(ncaa$WLPct,ncaa$AssPct, xlab=”Win Loss Percentage”, ylab=”Assist Percentage”, main=”Assist % and Winning”)
abline (lm(ncaa$AssPct~ncaa$WLPct))

#Three Attempt Rate
plot(ncaa$WLPct,ncaa$ThreeAttRate, xlab=”Win Loss Percentage”, ylab=”Three Point Attempt Rate”, main=”3 Point Attempt Rate
and Winning”)
abline (lm(ncaa$ThreeAttRate~ncaa$WLPct))

#TrueShooting
plot(ncaa$WLPct,ncaa$TrueShoot, xlab=”Win Loss Percentage”, ylab=”TrueShooting”, main=”True Shooting and Winning”)
abline (lm(ncaa$TrueShoot~ncaa$WLPct))

#Point Differential
ncaa$PtDiff<-(ncaa$TeamPts-ncaa$OppPts)/ncaa$G
plot(ncaa$WLPct,ncaa$PtDiff, xlab=”Win Loss Percentage”, ylab=”Point Differential”, main=”Point Differential and Winning”)
abline (lm(ncaa$PtDiff~ncaa$WLPct))

Premiere League Data Visualization

I’m a Premiere League fan.  Who doesn’t enjoy getting up at 7am to watch some of the best players in the world and some of the best fans go to battle across the pond?

I got a little annoyed this morning watching the coverage as they discussed who was in what place.  Certainly, the Premiere League is not the only place you hear this, but I think the way we discuss who’s “winning” is occasionally, well, stupid.  If a team has played two less games, it’s a bit asinine to worry about who’s ahead of whom.  I realize it’s a bit of a brass tacks, you either have the points or you don’t, but in a game where teams all eventually play the same number of games, it seems a bit dumb to discuss who’s got the most points when teams haven’t played the same number of games.

As an illustration, here’s the current leaderboard plotted by games played and points.

As you can see Manchester City is the lone team still at 25 games, yet according to commentators had “fallen” to 4th in the league.  Liverpool has had the opportunity at 6 more points!

Now let’s look at this if we plotted points per match.

 

While Chelsea still dominates, it’s pretty obvious that they’ve run away from the pack.  Now, notice that Liverpool actually sits just barely in the top four, only 2 thousandths of a point actually.

The place I think this gets the most interesting is in the relegation zone.  The bottom 3 teams are relegated at the end of the season to the league below.  The generally accepted threshhold to get over is 40 points which, in a 38 game season, amounts to about 1 point per game.  If we look at points per match as a metric and then plot it, it’s interesting what emerges.

You’ll notice that Burnley and Watford, while certainly still in danger, are above the 1 point per game line.  Bournemouth, Leicester and Swansea are still in danger, as they are right at 1 point per match, and 4 teams are below the 1 point per match line.  Look out Crystal Palace, Middlesborough, Hull and Sunderland.

One more since I’m enjoying this…

I think most folks would agree that given enough sample, point differentia or, in the case of football, goal differential is a strong measure of whether a team is good.  Now, you can certainly be determinist about this that points and winning is all that matters.  It’s true that there is no point differential with more goals.  Let’s test this.

As you can see, Manchester City, Swansea and Hull are outperforming their goal differential in points (as well as other teams), perhaps hinting at a coming slide, and Middlesborough and Spurs are underperforming their points given their goal differential, perhaps hinting at some improvement to come.

Explaining Statistical Methods – Bootstrapping

I’m trying to write a series of posts explaining, in laymen’s terms, statistical methods.  Bootstrapping is one of my favorite techniques, something that isn’t necessarily in the common parlance among non statisticians.

Most people are probably familiar with the word bootstrap, as in pulling oneself up by their bootstraps, which means, basically, to use what you’ve got in order to make your way.  That’s what statistical bootstrapping is.  When we’re not confident about what our distribution is or we don’t have a method for determining the parameter confidence interval, we can take what we have, a sample, and assume the sample is the best estimate for your population, then resample the initial sample with replacement to generate a population distribution that can be used to estimate a confidence interval.

There are 4 commonly used bootstrapping methods for estimating confidence intervals.

  1. Standard -uses mean squared error to estimate interval. can be a good method if estimator of variance is not available.
  2. Percentile – simple method.  estimate parameter using each new sample then take middle range of values (for desired 95%, take middle 95% of values.  Can be bias and skewed.
  3. T-Pivot – requires a pivot quantity relating values. For example, the t distibution does not depend on the mean or variance, so t can be used as a pivot to estimate either.  Best estimates if the pivot quantity exists.
  4. BCA – Corrects percentile method for bias and skew.

Bootstrapping is useful because it doesn’t assume anything about the distribution, unlike many common statistical tests, but it can give you useful estimates for the confidence intervals.

This Is Why I’m an R User

“While Excel (ahem) excels at things like arithmetic and tabulations (and some complex things too, like presentation), R’s programmatic focus introduces concepts like data structures, iteration, and functions. Once you’ve made the investment in learning R, these abstractions make reducing complex tasks into discrete steps possible, and automating repeated similar tasks much easier. For the full, compelling argument, follow the link to the Yhat blog, below.”

https://www.r-bloggers.com/the-difference-between-r-and-excel/