Big Ten Forecasting & Stats

#1      

danielb927

Orange Krush Class of 2013
Rochester, MN
Thought it might be nice to have a thread for stats-heavy discussions as they come up this year, rather than leaving them at the end of various pre- and post-game threads. I haven't run any B1G outcome forecasts yet this year as I've been working more on the data visualization things, but those will come back at some point once the conference race becomes a bit more interesting!

In the meantime, I wanted to start off with a post from deep in the Mizzou postgame thread about free-throw shooting. Maybe it's too long to be interesting, but it was right up the alley of a stats book I'd just finished reading, so I couldn't resist the chance to practice using what I'd learned a bit!
 
#2      

danielb927

Orange Krush Class of 2013
Rochester, MN
There was a post in the Mizzou postgame thread from @CurbeloYourEnthusiasm saying the following:

_____________________________________________________________________
[Kofi] had a 6 game stretch where he hit 28/32 FTs. That’s 88%. He’s clearly capable of making them. Not many 50-60% foul shooters ever have that good stretch.
------------------------------------------------------------------------

That got me thinking, how unlikely is this? In other words, given that Kofi had a stretch of 28 for 32 FTs made, what's the probability that he's actually a <60% FT shooter (or was last year, anyway)?

TL;DR -- using only the 28-for-32 information about Kofi, combined with just a little league-wide information, we can estimate that there's about a 1% Kofi is really a 60% FT shooter or worse. This is a bit less likely, but still close, to the value of 2% we would have estimated using the full 2019-20 season data.

Full Analysis

Well, first let's look at the initial statement. "Not many 50-60% foul shooters ever have that good stretch." How many really do?

The "binomial distribution" can give us the probability of N made FTs, out of a random set of 32, for a 60% FT shooter:

1608067987755.png



The probability of 28 exactly is about 0.05%. If we also add in the probabilities from 29 to 32, we get 0.0007, or a 0.07% chance. That's pretty small, but there are a couple of important caveats here.

First, this assumes every single foul shot is a completely independent, unrelated occurrence. There are a ton of reasons that assumption might not be true: confidence, tiredness in a game or a season, injuries, "streaky shooting". Those almost certainly matter, but for the purposes of a simple analysis, let's let them go.

Second, we didn't just take any random set of shots, but cherry-picked the best stretch (whole games only) from Kofi's freshman season. Again, this would make the probability quite a bit higher.

Both of these are really important, and probably mean that our 0.07% estimate could be way off. If we really wanted to address the question, "how many 50-60% shooters have a stretch like 28/32", we would need to account for this. But that's not what I'm after here, and in fact, all we need is some prior data about NCAA-wide foul shooting to come up with a useful correction for our purposes.

(Other note: there's probably a way to correct for the cherry-picking of the "best" stretch directly, but it's either a bit of math I don't know yet, or it would likely involve a lot of tedious counting -- in either case, I won't use it here.)

So far we have found P(make 28+ of 32, given a "true" FT% < 60%). Here I sloppily swapped <60% for =60%, since worse shooters certainly won't be any more likely to hit that many. But what we really want is the opposite: P("true" FT% is 60%, given 28+ made of 32). For that, we can use Bayes' Theorem:

P(A | B) = P(B | A) x P(A) / P(B)

If you haven't seen this notation before, P(A | B) means "probability of A occurring, given that B did occur".

In this case, A = "true FT% < 60%", and B = "made 28+ of 32". So P(A) is the probability of any player having a true FT% under 60%, and P(B) is the probability of any player making 28+ of 32. How does this change our beliefs based only on the 28-for-32 stat? Well, first we need some estimates for P(A) and P(B).

Kofi is a true center. In the NBA (only place I could find position-specific data), the bottom 1/4 of centers shoot about 65% or worse. Since college players usually have room for improvement, let's say that's about 5% higher than the NCAA. This gives us P(A) = 25%. We could also just use intuition here; to me, this number seems pretty reasonable (maybe a bit aggressive) given that Kofi is <60% this year, but was well over 60% in a full season last year.

We'll estimate P(B) in the same way we figured the first probability -- using the binomial distribution. This way, the same errors we made in finding P(B | A) will also show up in P(B), and they should (hopefully) cancel out to some degree. Since an "average" college hoops player is a 70% FT shooter, we can plug that into the binomial distribution to get P(B) = 1.9%.

Putting that all together, we get:
0.07% (odds a 60% FT shooter would make 28+ of 32)
x
25% (odds a college center is a <60% FT shooter)
/
1.9% (odds an average college player would make 28+ of 32).
=
A 1% chance that Kofi is truly a <60% FT shooter, given his 28-for-32 stretch last year.

Is there any way to check whether this is right? Well, we can never know Kofi's "true" FT%, but we can at least use more data. We can also use something called the "beta distribution", which will give us what we want directly: a distribution on the likelihood of a certain true FT%, given a certain number of makes and misses.

If we plug Kofi's 2019-20 stats in (111 of 164), we get the following distribution:

1608070849915.png



As we can see, there's a small but non-zero chance he's a "true" 60% FT shooter or worse, even taking 150+ free throws into account. If we add up the probabilities from 0 to 0.6, we get a better estimate: a 2% chance that Kofi is truly a <60% FT shooter, given his entire 2019-20 season.

These aren't exactly the same, but given that we were handed some very cherry-picked stats to work with, the fact that we ended up off by only a factor of 2 is pretty cool! And it's actually a bit comforting to know that our estimate was still smaller given only the 28-for-32 stat, since that's a pretty good stretch. This was all honest math -- I didn't go back and tweak P(A) or P(B) to get a close answer.

By contrast, if we had just used the binomial distribution alone, or if we had plugged in the 28 of 32 stretch to the beta distribution, we would have guessed something 50-100x smaller. Using information about other free throw shooters helped us use that bit of information with more context, leading to a better guess.
 
Last edited:
#4      

danielb927

Orange Krush Class of 2013
Rochester, MN
Oops, sorry about the missing figures there, didn't get those copied over from the other post correctly.
 
#5      

danielb927

Orange Krush Class of 2013
Rochester, MN
Kofi shot 67.7 percent from the line last year

Right on, the last bit of analysis above takes a look at that exact point. Given the # of free throws Kofi shot, you'd expect a "true" 60% shooter to make 67.7% or better about 2% of the time.
 
#6      

danielb927

Orange Krush Class of 2013
Rochester, MN
Here's the stats breakdown from the Minnesota game! I don't have time for much analysis here, except to say that we completely owned on a shot-for-shot basis in all phases.

20201215_il_minn_home_full-png.6821
 
#7      
Second, we didn't just take any random set of shots, but cherry-picked the best stretch (whole games only) from Kofi's freshman season. Again, this would make the probability quite a bit higher.

Both of these are really important, and probably mean that our 0.07% estimate could be way off. If we really wanted to address the question, "how many 50-60% shooters have a stretch like 28/32", we would need to account for this. But that's not what I'm after here, and in fact, all we need is some prior data about NCAA-wide foul shooting to come up with a useful correction for our purposes.

(Other note: there's probably a way to correct for the cherry-picking of the "best" stretch directly, but it's either a bit of math I don't know yet, or it would likely involve a lot of tedious counting -- in either case, I won't use it here.)

I quickly created a spreadsheet in excel which simulates 10,000 164 shot seasons for a 60% shooter.

(Feel free to check my work or recreate / add to it if you want)
* Across the Top Row I assigned a season number from 1 to 10,000
* The next 164 rows have the equation =ROUND(1.2*RAND(),0). If I'm not mistaken this should report a 1 for a shot made and a 0 for a shot missed for a 60% shooter.
* The next 133 rows have the equation =IF(SUM("Row2 Thru 165 of current column")>27,1,0). If I'm not mistaken this should report a 1 for any consecutive 32 shots where 28 or more were made.
* The next Row sums "Row 2 thru 165 of the current column" to determine the number of made shots that season
* The next Row takes the number of made shots divided by 164 shots to determine the % of made shots that season
* The final Row sums the 133 rows that determine if 28+ shots were made in each possible 32 consecutive shot period to determine how many times that particular season the player hit 28+ of 32.
* At the very end I report the min/max values for Shots made per season, % made per season, and # of times 28+ of 32 shots were made in each season. I also summed the total number of seasons with at least 1 time 28+shots were made in 32 consecutive shots.

The Results:
Shots Made Per Season: Min 72 / Max 118 / Average 95.66
% Shots Made Per Season: Min 44% / Max 72% / Average 58%
Number of times per season 28+ out of 32 shots were made: Min 0 / Max 7
Number of season with at least one 28+ out of 32 shot period: 25

So it happens in 0.25% of seasons based on my 10,000 simulated seasons.
 
#8      
I quickly created a spreadsheet in excel which simulates 10,000 164 shot seasons for a 60% shooter.

(Feel free to check my work or recreate / add to it if you want)
* Across the Top Row I assigned a season number from 1 to 10,000
* The next 164 rows have the equation =ROUND(1.2*RAND(),0). If I'm not mistaken this should report a 1 for a shot made and a 0 for a shot missed for a 60% shooter.
* The next 133 rows have the equation =IF(SUM("Row2 Thru 165 of current column")>27,1,0). If I'm not mistaken this should report a 1 for any consecutive 32 shots where 28 or more were made.
* The next Row sums "Row 2 thru 165 of the current column" to determine the number of made shots that season
* The next Row takes the number of made shots divided by 164 shots to determine the % of made shots that season
* The final Row sums the 133 rows that determine if 28+ shots were made in each possible 32 consecutive shot period to determine how many times that particular season the player hit 28+ of 32.
* At the very end I report the min/max values for Shots made per season, % made per season, and # of times 28+ of 32 shots were made in each season. I also summed the total number of seasons with at least 1 time 28+shots were made in 32 consecutive shots.

The Results:
Shots Made Per Season: Min 72 / Max 118 / Average 95.66
% Shots Made Per Season: Min 44% / Max 72% / Average 58%
Number of times per season 28+ out of 32 shots were made: Min 0 / Max 7
Number of season with at least one 28+ out of 32 shot period: 25

So it happens in 0.25% of seasons based on my 10,000 simulated seasons.
Random number scenarios are marvelous !
They allow you to model complicated situations without having a PhD in Stats.
Thanks
 
#10      

danielb927

Orange Krush Class of 2013
Rochester, MN
I've seen a few of these posted for various games, and still have zero idea what it means lol.

It's admittedly a bit of a work in progress, and even though I'm the one making these, I'm also still trying to figure out how they're useful (if at all ;)). Mostly, I make them because I think they look cool.

Broadly, each chart is a graphical representation of the entire game, organized in terms of "shot opportunities". If you think of a basketball game as a series of chances to score points for each team, this chart helps answer two questions: 1) how many chances to score did each team have? and 2) how did each team use their chances to score?

The first question is represented by the length of each team's 3P/2P/FT attempt bars. These can differ due to offensive rebounds (purple, add scoring chances) and turnovers (red, take away scoring chances), which is represented at the right side of the chart.

The second question is represented by the breakdown between the types of shot, and the amount of filled-in space. The boxes are also sized so that the area in each box is proportional to the # of points it represents -- so if you took out all the white space, whoever has more shaded area would have won the game.

The number values are makes and misses in the three categories of scoring (3Ps, 2Ps, and FTs), and then "points per shot" (PPS) or "points per (shooting) foul" (PPF) are just the average # of points per attempt in the respective category of scoring, over the full game.
 
#11      

danielb927

Orange Krush Class of 2013
Rochester, MN
I quickly created a spreadsheet in excel which simulates 10,000 164 shot seasons for a 60% shooter.

(Feel free to check my work or recreate / add to it if you want)
* Across the Top Row I assigned a season number from 1 to 10,000
* The next 164 rows have the equation =ROUND(1.2*RAND(),0). If I'm not mistaken this should report a 1 for a shot made and a 0 for a shot missed for a 60% shooter.
* The next 133 rows have the equation =IF(SUM("Row2 Thru 165 of current column")>27,1,0). If I'm not mistaken this should report a 1 for any consecutive 32 shots where 28 or more were made.
* The next Row sums "Row 2 thru 165 of the current column" to determine the number of made shots that season
* The next Row takes the number of made shots divided by 164 shots to determine the % of made shots that season
* The final Row sums the 133 rows that determine if 28+ shots were made in each possible 32 consecutive shot period to determine how many times that particular season the player hit 28+ of 32.
* At the very end I report the min/max values for Shots made per season, % made per season, and # of times 28+ of 32 shots were made in each season. I also summed the total number of seasons with at least 1 time 28+shots were made in 32 consecutive shots.

The Results:
Shots Made Per Season: Min 72 / Max 118 / Average 95.66
% Shots Made Per Season: Min 44% / Max 72% / Average 58%
Number of times per season 28+ out of 32 shots were made: Min 0 / Max 7
Number of season with at least one 28+ out of 32 shot period: 25

So it happens in 0.25% of seasons based on my 10,000 simulated seasons.

Cool stuff @Goode-for-3 -- nice work!

I think ROUND(1.2 * RAND(), 0) will actually give you something a bit lower than a 60% shooter. The inner value rounds to 0 for [0, 0.5] and 1 for [0.5, 1.2]. This means the made FT percentage is (1.2 - 0.5) / 1.2, or 7/12, which is 58.3%. Probably won't change the outcome a ton but I'd be interested to see how much it does.

You could instead do IF( RAND() < 0.6, 1, 0), and change the 0.6 to refer to a cell if you want to update the FT% on the fly quickly. Just an idea!

Even with the slight difference, the result seems pretty reasonable -- the odds of making 28 in a random 32 are 0.07%, while the simulation shows that over the course of ~5x that many shots, your odds would go up by something like 4x. Very cool!
 
#12      
Cool stuff @Goode-for-3 -- nice work!

I think ROUND(1.2 * RAND(), 0) will actually give you something a bit lower than a 60% shooter. The inner value rounds to 0 for [0, 0.5] and 1 for [0.5, 1.2]. This means the made FT percentage is (1.2 - 0.5) / 1.2, or 7/12, which is 58.3%. Probably won't change the outcome a ton but I'd be interested to see how much it does.

You could instead do IF( RAND() < 0.6, 1, 0), and change the 0.6 to refer to a cell if you want to update the FT% on the fly quickly. Just an idea!

Even with the slight difference, the result seems pretty reasonable -- the odds of making 28 in a random 32 are 0.07%, while the simulation shows that over the course of ~5x that many shots, your odds would go up by something like 4x. Very cool!

Thanks I was curious why my overall shooting average was coming out to 58% even when I increased the number of seasons (columns) from 1,000 to 10,000 and I figured it had to do with my equation being wrong. I just updated my file per your suggested equation =IF(RAND()<0.6,1,0) and

The corrected results are:

Shots Made Per Season: Min 74 / Max 123 / Average 98.40
% Shots Made Per Season: Min 45% / Max 75% / Average 60%
Number of times per season 28+ out of 32 shots were made: Min 0 / Max 16 (total times occurring over the 10,000 seasons = 195)
Number of season with at least one 28+ out of 32 shot period: 52

So it happens in 0.52% of seasons based on my 10,000 simulated seasons. Note that there's a lot of variation in the results when I force Excel to re-compute the RAND() function which to me indicates that 10,000 simulated seasons is not enough to narrow in on the real solution. Unfortunately excel only allow 16,000 columns and my computer doesn't like when I tried to create 10 different 10,000 column pages and then find the cumulative values of all those pages. I could probably write down the results for 10,000 and then force excel to update and write those down 10x and then manually compute the combined averages to figure out what it looks like over 100,000 seasons - but I'm not convinced even that would really be enough to get rid of the noise. I guess that's where the value in actually knowing how to do the math comes in!
 
#13      
It's admittedly a bit of a work in progress, and even though I'm the one making these, I'm also still trying to figure out how they're useful (if at all ;)). Mostly, I make them because I think they look cool.

Broadly, each chart is a graphical representation of the entire game, organized in terms of "shot opportunities". If you think of a basketball game as a series of chances to score points for each team, this chart helps answer two questions: 1) how many chances to score did each team have? and 2) how did each team use their chances to score?

The first question is represented by the length of each team's 3P/2P/FT attempt bars. These can differ due to offensive rebounds (purple, add scoring chances) and turnovers (red, take away scoring chances), which is represented at the right side of the chart.

The second question is represented by the breakdown between the types of shot, and the amount of filled-in space. The boxes are also sized so that the area in each box is proportional to the # of points it represents -- so if you took out all the white space, whoever has more shaded area would have won the game.

The number values are makes and misses in the three categories of scoring (3Ps, 2Ps, and FTs), and then "points per shot" (PPS) or "points per (shooting) foul" (PPF) are just the average # of points per attempt in the respective category of scoring, over the full game.
Are the purple and red blocks accidentally reversed? Purple OR should add to scoring opps and red TO should reduce them. Am I thinking about this correctly?
 
#14      

danielb927

Orange Krush Class of 2013
Rochester, MN
Are the purple and red blocks accidentally reversed? Purple OR should add to scoring opps and red TO should reduce them. Am I thinking about this correctly?

Someone else asked a similar question on Twitter. The blocks are in the correct places, but I think I've been inconsistent in defining a "scoring opportunity", and that's probably the cause of the confusion. One definition would be "any time you gain possession of the ball, either from the other team or after you take a shot". That definition would include turnovers, which is how things are organized on the plot. Another, the kind you're talking about, is just "field goal attempts + shooting fouls".

I should probably come up with different terms for these two things (I think the second is better termed a "scoring opportunity" or perhaps a "weighted shot"). They're also both different from possessions, which are basically just "times we got the ball back from the other team".

Anyway, the key thing is that the end of the purple/red bars will always align. So, using your definition for scoring opportunities (field goal attempts + shooting fouls), think of the chart this way:

- Add +1 TO -> add one red block -> have to remove a 3P, 2P, or FTs to make room for the TO -> -1 "scoring opportunity"
- Add +1 OR -> add one purple block -> have to add a 3P, 2P, or FTs to fill that extra space -> +1 "scoring opportunity"
 
#15      

danielb927

Orange Krush Class of 2013
Rochester, MN
Thanks I was curious why my overall shooting average was coming out to 58% even when I increased the number of seasons (columns) from 1,000 to 10,000 and I figured it had to do with my equation being wrong. I just updated my file per your suggested equation =IF(RAND()<0.6,1,0) and

The corrected results are:

Shots Made Per Season: Min 74 / Max 123 / Average 98.40
% Shots Made Per Season: Min 45% / Max 75% / Average 60%
Number of times per season 28+ out of 32 shots were made: Min 0 / Max 16 (total times occurring over the 10,000 seasons = 195)
Number of season with at least one 28+ out of 32 shot period: 52

So it happens in 0.52% of seasons based on my 10,000 simulated seasons. Note that there's a lot of variation in the results when I force Excel to re-compute the RAND() function which to me indicates that 10,000 simulated seasons is not enough to narrow in on the real solution. Unfortunately excel only allow 16,000 columns and my computer doesn't like when I tried to create 10 different 10,000 column pages and then find the cumulative values of all those pages. I could probably write down the results for 10,000 and then force excel to update and write those down 10x and then manually compute the combined averages to figure out what it looks like over 100,000 seasons - but I'm not convinced even that would really be enough to get rid of the noise. I guess that's where the value in actually knowing how to do the math comes in!

Very cool! Interesting that just a 2% change in FT shooting makes that outcome almost 2x more likely.

The variability at 10k simulations is also a really good thing to note. Given that you're only hitting this outcome about 50 times, it's not too surprising that it's still noisy. And in modeling, knowing how certain you are of your result is just as important the result itself -- sometimes even more, if your uncertainty is very large!

If you're able to provide a handful of results from 10k trials each, we can still use some pretty simple statistics to come up with our confidence on the average value.
 
#16      

danielb927

Orange Krush Class of 2013
Rochester, MN
Update following the Rutgers game.

We scored more efficiently in every phase of the game — but this game is a good example of why that alone isn't always enough. Two reasons why:

1) Scoring opportunities are important. Even though each team gets roughly the same # of possessions in a game, grabbing offensive boards and not turning the ball over can lead to big differences in the number of times you shoot the ball towards the hoop. In this case, Rutgers did that 9 more times than we did.

2) Distribution is important. For both teams in this game, 2-pointers were the least efficient way to score (about 1 point per shot). But while 2-point attempts made up 68% of our total scoring attempts, they were only 52% of Rutgers'. This meant that the Scarlet Knights' offense spent comparatively more time on higher-efficiency 3P and FT attempts (especially free throws, as we all saw...)

20201220_IL_Rutgers_Away_Full.png
 
#17      

danielb927

Orange Krush Class of 2013
Rochester, MN
Latest chart, this time for the PSU win. First time I can remember us having more ORs than TOs against good competition this year — that's nice to see! We generated 8 more real scoring opportunities than Penn State, and also scored more efficiently (mainly by getting to the line so much). That led to a 17 point win that has us up to #7 in KenPom!

20201223_IL_PSU_Away_Full.png
 
#18      

danielb927

Orange Krush Class of 2013
Rochester, MN
Had some time with the holiday today, so I am currently "scraping" data from (I think) every game going back to the 2012-2013 — including the team and player lines from each game. Hoping to use this as the basis for future analysis, rather than manually doing everything game-by-game as I've been doing.
 
#19      
Latest chart, this time for the PSU win. First time I can remember us having more ORs than TOs against good competition this year — that's nice to see! We generated 8 more real scoring opportunities than Penn State, and also scored more efficiently (mainly by getting to the line so much). That led to a 17 point win that has us up to #7 in KenPom!

View attachment 6889
Is like to see that 3PPS stay above 1.25 in most of our remaining games. We’re a more fun team when we make our 3s!
 
#22      

danielb927

Orange Krush Class of 2013
Rochester, MN
@danielb927

Any version of the B1G standings being created yet?

Or still far too many possibilities?

They could exist, I just haven't made the time for it yet, sorry! Was hoping to automate a few things this year so I don't have to generate these by hand every night, but alas, didn't get as much done over the holidays as I'd hoped to....
 
#23      
They could exist, I just haven't made the time for it yet, sorry! Was hoping to automate a few things this year so I don't have to generate these by hand every night, but alas, didn't get as much done over the holidays as I'd hoped to....

No worries, appreciate whatever you put out there!
 
#25      

danielb927

Orange Krush Class of 2013
Rochester, MN
Daniel,
Are you going to add a random "game cancelled" factor to the process? /S

Hah, probably should! Torvik's site even has a page for crowd-sourcing the cancellations this year. I'm not sure how realistic the B1G forecast would look given all the TBD/PPD games right now, that's not a thing I can imagine many stats people have considered how to build in to a model.