by Jess Behrens
© 2005-2018 Jess Behrens, All Rights Reserved
"I think you should always bear in mind that entropy is not on your side."
Musk is right about this: when it comes to understanding things, nature's tendency is towards disorder, which gets in the way of prediction. But, do we make this tendency worse by misunderstanding what we are trying to predict? When it comes to the NCAA Tournament, the answer to that question is a resounding yes. I'll get to why I say that in a little bit. First, though, I'm going to briefly outline what I will be doing over the next few posts:
1. Introduce the idea of entropy and why it is the ideal tool for quantifying tournament results.
2. Cover the difference in entropy by tournament seed vs. Evolutionary Game Theory (EGT) Strategy.
3. Show how the Tournaments group together into 2 clusters (Seaborn clustermap) based on regression
results from EGT simulations.
4. Examine the 2nd Round Game Theory equilibrium state in these two clusters.
5. Successfully predict the entropy in each of the Tournaments using EGT simulation totals within a
In this post, I will be covering points 1 & 2. Chapter 22 will cover 3 & 4; Chapter 23 will focus on 5.
So, right to it: why am I so enthused about entropy? And how does it show that tournament seed confuses our understanding of the tournament?
Entropy (calculated in SciPy) works really well as a tool for uniquely quantifying each tournament by year. This is because the seeds, and by extension the Evolutionary Game Theory strategies, represent a natural way to group the teams, something that is necessary for calculating entropy. So, all 1 seeds grouped together, all 2 seeds together, 3 seeds, etc. up to and including groupings by EGT strategy.
Since entropy measures disorder, comparing entropy scores for the same tournaments grouped by seed vs. strategy will tell us which of the two methods are more predictable. And because the tournament is divided into successive rounds, we can calculate a separate entropy score for every round in a given year.
Finally, and probably most importantly, we can inject a null hypothesis into this process. Seeding is a null hypothesis. It's one that the Tournament committee invariably knows is incorrect, but just because you believe something is wrong does not mean that it isn't a hypothesis. By seeding each team, the tournament is effectively telling us how the tournament 'should' go: 1 beats 16, 8 beats 9, 5 beats 12, etc. and so on up to having only 1 seeds in the Final Four. We can do this same thing, determine how a tournament 'should' go, for entropy calculated by EGT strategy as well. Thus, it is possible to measure the entropy for each round in an 'ideal' tournament. This 'ideal' amount can then be subtracted from the actual entropy, by round, as a measure of the degree to which a given Tournament varies from what is expected. If you then sum all of these round adjusted entropy values, you get a measure for the total amount a given Tournament year varies from expected. Figures 1 shows the expected entropy by round as well as the total adjusted entropy scores for each tournament year by seed & EGT strategy.
Figure 1. Expected Entropy by Round & Total Adjusted Entropy by Tournament Year, Seed & Strategy
As a quick gut check for accuracy, you can see that 2008 has the lowest entropy (the highest amount of order) when measured by seed. This should make sense given that in 2008 all 4 One seeds made the Final Four. In Figure 1, the larger the number, the greater the Entropy. And since entropy measures disorder, entropy by seed is much more 'random' than by strategy. If we follow Elon Musk's advice in the quote above, it's tempting to conclude that seed based entropy is more 'accurate' because, well, it's one of Newton's laws. I'm not so sure that is the case here.
Figures 3 & 4 show the coefficient of variation for link/triangle formation within my NCAA Tournament database for the 'win' & 'loss' networks by seed and strategy. These two graphics are reproduced from the previous chapter where I covered 'homophily'. While not identical to entropy, the coefficient of
Figure 2. Coefficient of Variation (COV) for In vs. Out Caste Link & Triad Formation by Seed & Strategy, Win & Loss Networks
variation is a good measure of the degree to which teams bind together by class (i.e., seed or strategy). A low coefficient of variation is an indication of consistency, while higher COV's indicate that the group has a lot more variation in terms of the rate at which it forms links & triangles. Thus, COV by strategy is a much more consistent 'grouping' method than by seed.
I believe that the difference in the entropy scores shown in Figure 1, with entropy by 'seed' being so much higher than by 'strategy', is a function of the fact that the teams do not consistently bind together by 'seed'. Thus, much of the 'surprise' we feel when a higher seed loses to a lower seed in the tournament comes from the fact that the Tournament committee is required to make very fine level differentiation's among teams that are more or less the same. In other words, the method we choose to use in seeding the tournament is a function of our own 'rules', not what actually makes a team 'good' or 'bad'.
That's it for the basics of entropy! Next up, I will cover how the Tournament years group together into clusters as measured by a machine learning 'clustermap'. I will then look to see if economic game theory indicates that one or more EGT strategies have an advantage when seed & strategy are considered simultaneously. Based on results as described in my previous posts, my focus will be on the 2nd Round.