by Jess Behrens
© 2005-2018 Jess Behrens, All Rights Reserved
Back in Chapter 6, I talked about how the primary factor affecting who wins the NCAA Tournament is the linear nature of competition between Hawks & Owls. I've done a bit of work since I posted that, and I need to update it with new graphics.
The overall conclusions based on the Poisson charts I posted in Chapter 6 are still the same. However, using Seaborn & Matplotlib's Cluster Map tool, I discovered that two of the years need to be switched. Figure 1 shows a cluster map developed from a year by year pairwise regression analysis I did for all 14
Figure 1. Cluster Map, Hawk vs. Owl Fitness, Regression R-Value Year by Year Analysis
tournaments. The process involved regressing Hawk & Owl total population fitness (energy) from the Monte Carlo simulations for each pair of tournaments (i.e. 2005 & 2006, 2005 & 2007, 2005 & 2008, etc.). This resulted in 182 (14 years by 13 years) separate sets of regression slopes, intercepts, r-values, & standard deviations.
After a bit of exploratory analysis, I discovered that the r-value & slope had the strongest relationship to tournament champion & the linear model I talked about in Chapter 6. Don't be confused by Figure 1. Think of Figure 1 as a modified network, where the nodes are each of the 14 years in the tournament data set. This hypothetical network would have a link between each node that is weighted by the r-value from the regression for those two years. Year One (rows), then, is just one of the nodes linked to the second node, labeled Year Two (columns). Each square contains the r-value for the two years (row & column). Seaborn's Cluster Map application has rearranged the 14 tournament years based on the similarity given this joint r-value. The lines at the top and the side indicate increasingly fine clusters, of which there are 4 (1 has a subset of two smaller clusters).
If you look at the bottom row, from left to right, you'll see that the first 8 columns are identical to the linear years I outlined in Chapter 6 (2006, 2007, 2008, 2009, 2011, 2014, 2017, & 2018) save for one: 2007. It has been shifted over to the first 6 years running right to left & replaced by 2005. Those years, of course, match almost perfectly with the Less Linear years I talk about in Chapter 6 (2005, 2010, 2012, 2013, 2015, & 2016) save for 2005 which, of course, has been shifted to the linear group. Figure 2 shows the competition plot for the new set of linear years (2006, 2005, 2008, 2009, 2011, 2014, 2017, & 2018) & Figure 3 the same for the new set of Less Linear years (2007, 2010, 2012, 2013, 2015, & 2016). I've
Figure 2. Hawks vs. Owl, Cluster Map Linear Years
included 2014 in the linear group despite being grouped with the Less Linear cluster in Figure 1 (far right lines on the top of the plot) because of its effect on the competition plot when a regression is performed on all the linear years lumped together. Figure 1 only shows results of regressions
Figure 3. Hawks vs. Owl, Cluster Map Less Linear Years
performed on pairs of years. When you include 2014 in the Linear Years group, it improves the r-value and standard deviation, thus reducing the stochasicity. That makes 2014 more likely to be a linear rather than a less linear year. When you run the two groups this way, Tables 2 & 3 in Chapter 6, which show the Poisson significance of tournament champions by species type in linear & less linear years, remains exactly the same.
While figure 2 is almost identical to the linear years competition plot from Chapter 6, Figure 3 is significantly different. The distribution in Figure 3 is more stochastic than in the Less Linear plot show in Chapter 6 & you can see that in the Pearson (0.55 in Chapter 6; 0.45 in Figure 3). Obviously, the cluster map is a useful tool! Using the results that came out of it have reaffirmed the patterns discussed in Chapter 6.
One of the most striking aspects of the cluster map seen in Figure 1 is the line of red squares. These red squares occur when a tournament year, say 2005's column, matches up with its own row. Obviously, the correlation between something and itself is always 1, so that square will always be bright red. As you move from upper left to lower right, you'll see that they begin to break from a 1 to 1 linear relationship in the last five or 6 column/row combinations. This is due to the increasingly stochastic nature of the fitness relationship between Hawks & Owls. As I talked about extensively in Chapter 6, high levels of stochasticity co-occur in years with a correspondingly high number of upsets among 1-3 seeds. Switching 2005 to linear and 2007 to less linear, however, has reduced this significance. The Poisson value for the 1-3 seed losses in the newly updated less linear years drops from 0.02 to 0.053, just barely missing significance at the 95% level.
While figure 1 can be used to reconfigure the linear & less linear tournament years, figure 4 is more directly related to the number of major upsets (Seeds 1-3) that occur in the first round. Figure 4, a cluster map developed using the same method that produced Figure 1, but showing the regression slope rather than regression r-value, makes this relationship between upsets and Less Linear years
Figure 4. Cluster Map, Hawk vs. Owl Fitness, Regression Slope Year by Year Analysis
even clearer. As you can see, the line of red squares no longer breaks apart. But now the final 6 years all include at least one 1-3 seed loss (2014, 2016, 2010, 2012, 2013, & 2015) and the final five have two each (if you count the overtime win by Villanova as a loss, as I did in Chapter 6). The significance level for these final 6 years is p<0.05 & p<0.02 for the final 5 years. Figure 4 also groups the 4 un-separated years (lower left corner-> 2006, 2017, 2009, 2018), making it clear once and for all that something different is happening in those years as well.
What all of this means is that when the relationship between the Hawks & Owls, as seen in the regression, is more certain, there are fewer upsets among the top seeds in the tournament. As Figure 4 shows, the blues lighten as you move from the upper left to the lower right, which is an indication of a reduction in or a flattening out of the regression slope. Seaborn transforms the slope values so that they fall between 0 & 1. As we know from Figures 2 & 3, the actual relationship/slope of a Hawk/Owl regression is negative (upper left to lower right) because the two species are competing. And to the extent that the relationship is linear, Hawks & Owls can not have high fitness at the same time. Thus, values that are dark blue are in fact negative while those that approach a light blue are closer to 0. In other words, 0.5 is in fact a slope of near 0. While more detailed analysis is needed to say for sure, it is likely that the uncertainty in the relationship between these two groups, is the primary driver behind these major upsets. Since the regression results shown in Figures 1-4 are derived from Monte Carlo simulations which only consider the relative percentage of each species type within the total tournament population in a given year, the implication is that major upsets are not exclusively a function of the game in which the upset occurs. Instead, these results point to the fact that the entire seasons' results, as manifested in the teams invited to the tournament, actually destabilize the most highly seeded teams in the tournament. While that may seem heretical, based on my experience playing basketball, it very much makes sense.