2019 NCAA Tournament Lessons #4, cont'd: Seaborn Clustermaps

Updated: Jun 4, 2019


by Jess Behrens

© 2005-2019 Jess Behrens, All Rights Reserved

Continuing in the vein of my last post, I'm going to present two more examples of how Seaborn Clustermaps can be used to identify important patterns within the tournament. I will use the same approach in this post as I did in the last.

Few tournaments were as memorable as 2011 & 2014. Both of these saw surprising teams play in the final for the championship. I'm sure if you are reading this blog, you recall Butler's seemingly impossible run to the final in 2011 & the cardiac 'Cats in 2014, who won multiple games in a row on last second shots. Of course, UConn dispensed with each of these challengers & won both national championships.

But they weren't the only teams to march improbably through the tournament in these years. As I mentioned in my previous post, 2011 VCU & 2014 Dayton were beneficiaries of the Index 16, Ranks 18-28 split, falling more or less in the exact same location. As you might guess (or why would I write this blog post?), both Butler & Kentucky, as well as both UConn teams, also fell in nearly identical locations as well. In fact, the two UConn teams were, more or less, identical, at least as far as my data vector is concerned.

As I described previously, I believe that the strength of 2011 Butler, etc., & the tournament results we all now admire, is definitively tied to the proportions of Evolutionary Game Theory species' within the community of those specific years. Furthermore, I will show that it is possible to use the output from the same regression analysis, derived from the EGT simulation energy totals I described in the last post, to identify groupings of tournament years as a possible explanation for how those magical tournament runs happened.

Figure 1 shows the Seaborn Clustermap of the regression slope for three different 'species' measures. As I described in my previous post, these regression analyses utilized not only the energy totals for the 4 species' included in the EGT 'game', but several products of those energy totals as well. In Figure 1,

1. the Dove Total (DT),

2. Owl Total - Hawk Total + Dove Total (OT-HT+DT)

3. Owl Total - Hawk Total (OT-HT)

were used. Obviously, regression slopes calculated using these energy totals would have a significant amount of auto-correlation. The Dove Total (DT) is included items 1 & 2 while OT-HT is included in items 2 & 3. And that's


Figure 1. Seaborn Clustermap, Dove, OT-HT+DT, & OT-HT, Regression slope

the point. It's the degree to which these values auto-correlate that produce the the Clustermap seen in Figure 1. I've also included a second Clustermap, Figure 2, with each of these three energy totals included, but without any of the calculated values - so no purposeful auto-correlation (or co-linearity, if you prefer). As you can see, the order of the years as columns are similar, but still substantially different.


Figure 2. Seaborn Clustermap, Dove, Owl, & Hawk, Regression slope

The Index/Rank combinations that coinside with Figure 1 are Index 10, Ranks 51-61. Of primary importance to Figure 1 is the large block of bright red on the left side that runs from column 2005 - column 2013. The three years producing these bright red, high correlations are rows 2010, 2017, & 2019, which are all years with a large number of Hawks. Tournaments 2005 - 2013 all have fewer than 10 Hawks, which drops as low as 3 in 2013. Within the regression results derived from the EGT simulations, the combination of these years produces slopes that are at a minimum of 2 - so very sharp. What is happening is that, since these are all binary regressions of quantities that are dependent on one another (i.e., DT is a part of both the independent & dependent variables), the third quantity is actually growing with, building on, the skew associated with including the independent variable in the independent variable. The slope is large & positive because the Hawk Total is small (i.e. when HT << OT then OT-HT -> OT).

From the standpoint of identifying groups of tournaments within Figure 1:

Cluster A. Starts with 2013 & ends at 2016 - Effect is the strongest.

Cluster B. Starts with 2010 & ends at 2019.

Cluster C. Includes everything else - 2005 to 2008 & 2015 to 2018.

Comparing Cluster A to Cluster B:

1. Seed T-Test: p < 0.71

2. Wins T-Test: p < 0.21

There is no significant difference between Cluster A & B, whether based on Seed or Tournament Wins. However, the Odds Ratios of finding a Final Four team within Index 10, Ranks 51-61 is significantly different:

Odds Ratio, Final Four Teams, Cluster A to Cluster B:

1. Cluster A: OR = 5.02, Range 1.97 - 12.77, Z-Score: 3.382, p = 0.0007

2. Cluster B: OR = 1.01, Range 0.21 - 4.86, Z-Score: .023, p = 0.982

So, in terms of overall wins, there is statistically no difference between Cluster A & B, but Cluster A is much more likely to include a Final Four Team within this range. This is because Cluster A includes 2013 Michigan, 2006 UCLA, 2011 Butler & UConn, 2014 Kentucky & UConn, 2006 LSU, & 2011 VCU (who also fell in the Index 16, Ranks 18-28 range). Interestingly, it also includes 2016 Louisville, who did not participate in the tournament. Given their location within the vector, it's highly probable that they would have done quite well had they been allowed to participate.

Comparing Cluster A & B to Cluster C:

1. Seed T-Test: p < 0.72

2. Wins T-Test: p < 0.05

The story changes, however, when you combine Clusters A & B and compare them to Cluster C, with the two together producing significantly stronger results than Cluster C. This is because Cluster B does include 2 Final Four teams, 2010 Michigan State & 2019 Auburn, as well as a bunch of teams who lost close games to the eventual champion or runner-up.

For example, in 2010, Baylor, Kansas State, Xavier, Tennessee, & San Diego State all fall within this range. San Diego State fell to Tennessee in a very, very close game in the first round; Tennessee then fell to Michigan State in the Elite 8. Baylor fell to eventual champion Duke. Xavier fell to Kansas State in overtime in the Sweet 16; & Kansas State then fell to Butler in the Elite 8.

Likewise, the Butler team who was leading eventual champion North Carolina late in the game in the 2017 Sweet 16 also falls in this range. And Purdue from this last season, who fell to Virginia in overtime before Virginia beat Auburn, by one point, is also included in the range. Thus, the significant Odds Ratio for finding a Final Four team falls in a narrow, 5 year window where the auto-correlation among the three regression terms included in Figure 1 make it possible for them to sneak into the Final Four.

Another significant tournament wide phenomena that may be related to Evolutionary Game Theory involves years where a Hawk wins the championship. As I pointed out in my last post, Hawks have won the tournament 6 times in the last 15 years. This is despite the fact that nearly half (14 of 30) of the teams who have played in the final were Hawks. In fact, in 11 of the 15 tournaments, a Hawk has played an Owl in the championship game, something that is significant to p < 0.000001.

The other four tournaments have involved teams of the same species (2005 & 2012 - Owls; 2010 & 2019 - Hawks). Thus, of the 6 Hawks who have won the championship, only 4 of them actually beat an Owl (2018, 2016, 2013, & 2006). Of course, as I pointed out in the Lotka-Volterra posts as well as the post defining the EGT species, the network is set up to support Owls. They are, energetically speaking, dominant over all of the other species.

It turns out that the Seaborn Clustermap tool can be used to group the 6 Hawk champions together in the same way Figure 1 identifies the years where Index 10, Ranks 51-61 are especially strong. Figure 3 is the Clustermap that groups these Hawk champion years together, and it involves looking at the regression slope of:

1. Owl Total: OT

2. Dove-Owl Total: DOT

3. Owl Total - Dove Total: OT - DOT

Of course, these are all the species who utilize sharing of energy as their Evolutionary Game Theory strategy. The only energy total missing in some form from Figure 3 are the Hawks. Which means that, at least from the


Figure 3. Seaborn Clustermap, Dove-Owl, Owl, & OT-DT, Regression slope

perspective of this Clustermap, the increased strength of Hawks in these six years is most likely a result of how these threes species 'share' energy in the simulations. As you can see in Figure 3, the 6 Hawk champions are in consecutive columns, beginning at 2019 & ending with 2018. Of note is the fact that 2008, when Memphis (Hawk) took Kansas (Owl) to overtime before falling, is right next to 2018. Also, the two years in which 2 Hawks made the final (2010 & 2019) are right next to each other.

Because a Hawk champion represents a tournament wide phenomena, there is no need to consider the relative seed distribution of each group (Hawk Champion years vs. Owl Champion Years). However, the poisson distribution for grouping all 6 Hawk Champions is p < 0.05 & p < 0.002 if you consider the 2008 championship game part of the cluster.

That will do it for this post. As you can see, the EGT simulation data can be used in conjunction with standard python (Seaborn) tools to identify & help understand the processes behind system wide (tournament level) phenomena.

<--Lesson 4 Lesson 5-->

#KeyPlayer #EvolutionaryGameTheory #MensCollegeBasketball #NCAA #NetworkAnalysis #Seaborn #NCAATournament #MonteCarloSimulations

©2018 by jessbehrens.com. Proudly created with Wix.com