by Jess Behrens
© 2005-2018 Jess Behrens, All Rights Reserved
"Similarity breeds connection. This principle—the homophily principle—structures network ties of every type, including marriage, friendship, work, advice, support, information transfer, exchange, comembership, and other types of relationship."
-Abstract, McPherson, et. al. 2001
One of the, if not THE, primary force driving network growth & analysis is the homophily principle. The above quote, which I pulled from the abstract for the Science article listed below, describes homophily better than I ever could. It is one of the underlying 'facts' of nature. You can find a version of it in every field, including Geography (my academic home, if you will) where it is nothing more than a restatement of Tobler's First Law of Geography. In fact, much of the techniques described throughout these posts about basketball can be applied to geography, specifically finding regions of maps that are not random. My hope is that these methods can be applied to planning & other population level analyses.
As such, homophily's role in the development of the NCAA Tournament networks I have put together must be examined. Effectively, given the method for how I define the 'queries' used in building both networks (Win & Loss), homophily IS what I'm measuring. I quantify ranges of teams using two, sometimes more, indexes who exhibit the same or similar results in terms of tournament wins & losses. The tendency to lose in the first round is built into the 'Loss' Network & to win at least one game into the 'Win' Network. The thousands of individual queries I've written are then combined to build the networks based on how the teams, who are the 'nodes' in both networks, overlap & wire themselves in systemic totality. Each of the queries is effectively a measure of homophily, with each of them effectively saying, "the teams in this query are the same and they are linked together with this strength."
I then divide the teams into Evolutionary Game Theoretic (EGT) 'Hawk/Dove/Owl' game strategies using the 'Key Player' metric & other network centralities with an eye toward their risk of winning or losing in the first round. While I do say that a first round win/loss is considered, I do not consider it explicitly when grouping the teams into their strategies. Instead, I divide both networks into 5 ranges based on each teams relationship to the yearly average betweenness centrality weighted Key Player metric (reference below). These risk ranges are then used to define the different strategy types. Tournament success or failure is never explicitly considered, but is noted. Note: I also add another strategy type to this game, the 'Dove-Owl', which behaves like an Owl when facing a Dove & like a Dove when facing either a Hawk or an Owl. As in the EGT game, the Hawk strategy incurs an additional cost term that the other strategies do not. This cost, effectively, places the Hawks at greater risk. As such, the teams are divided into their categories based on the following breakdown of risk:
Dove - greatest risk of loss in the first round;
Dove-Owl - second greatest risk of loss in the first round;
Owl - least risk of a loss in the first round;
Hawk - greater risk of a first round loss, but still very likely to win at least one game.
As with all other attempts at understanding & predicting tournaments, these strategies must be compared against the one and only absolute thing you ever really know about a team in any tournament: it's seed. Seeding not only determines who plays whom in the first round, it also represents a partially objective 'label' that most definitely impacts team psychology. Any attempt at tournament prediction, or analysis of the factors affecting success or failure, must be measured relative to the role of seeding.
Thus, a fantastic question that needs to be answered is: does network homophily increase or decrease when teams are grouped based on the Evolutionary Game Theory Strategy vs. their Seed? Is homophily significantly high or low within the system? How does homophily in the Win network compare to the Loss network?
While this post does not provide complete answers to those questions, the results I present here do point us in the direction of concluding that Evolutionary Game Theory is a better descriptor of team identity than seed. Figures 1 & 2, below, show Out vs. In triads & links for both networks. Figure 1 depicts homophily based on EGT Strategy & Figure 2 on tournament seeding.
Table 1. Evolutionary Game Theory Strategy by Seed
The ratios shown in Figures 1 & 2 depicted were taken from work by Chandrasekhar & Jackson (cited below), and show the ratio of Out 'caste' triads or links / In 'caste' triads or links. In the article listed below, which examines rural Indian villages, Chandrasekhar & Jackson hypothesize that out caste link formation will grow faster than out caste triad formation among previously homophilous links. They theorize that this is due to the presence of a social pressure, or stigma, being associated with out caste relationships.
The idea is that two members of the same caste with the same friend from a different caste (a triad) may face a social stigma whereas simply having a link with a member of another caste will not. A link to a member of another caste can be kept secret, while a mutually shared link, a triad, can not. Thus, the explicit assumption here is that NCAA tournament teams are exposed to a similar sort of pressure, and that their season, as encapsulated by the two networks, provides association with a 'caste' that exists beyond their seeding.
This is in spite of the fact that all of the teams consciously tell themselves that they belong to the dominant caste; they're beliefs about how good they are and what they are going to do (win a championship) don't vary by team or player. Having played as much basketball as I have, I can speak from experience on this: you go out there to win and invariably tell yourself that's what you're going to do. However, it's the unknown reality of where the team actually sits within each network, which archetype they are most like, that determines how well they actually do rather than any individual predisposition or earnest decision to 'play well'.
Thus, this post is intended to examine my dual network method of analyzing the NCAA tournament for the explicit purpose of understanding whether or not the teams form links & triads in a non-random manner that is, essentially, hidden from the players themselves (exists beyond seeding, which the players know). It also seeks to see if the method of link/triad formation is commensurate with the method described by Chandrasekhar & Jackson. Further, it tests to see if the manner in which subgraphs form among teams is more effectively described by strategy (the factor that is outside of the teams knowledge and is beyond seeding) than by seed.
Thus, the results presented here are examining the Sub-graph based model for network formation presented by Chandrasekhar & Jackson, in which larger graphs are composed of smaller graphs (i.e. dyads & triads) linked together. The triad ratio in Figure 1 is depicted on the y-axis and links along the x. To be consistent with the method outlined in Chandrasekhar & Jackson, the link ratio has been adjusted, using the natural log, to the 3/2 power. This is to compensate for the fact that it is much easier to form a link than a triad when considering the process as random.
Figure 1. Out vs. In Connections by Evolutionary Game Theory Strategy
For the purposes of this analysis, 'caste' is defined as a teams EGT strategy in Figure 1, while tournament seed represents caste in Figure 2. Again, to be consistent with Chandrasekhar & Jackson, only triads formed from homophilous links are considered. A one to one line is shown as well. This is because the process should be a coin flip, or Poisson process. Thus, the number of ratio points above and below the 1 to 1 should approximate a 50/50 split if the networks are forming randomly.
While I'm considering seed & strategy separately, they are simply different classifications of the same teams. Table 1 shows the breakdown of Evolutionary Game Theory Strategy by tournament seed. The distribution is highly significant (assumes equal distribution by strategy - i.e. 25% for each) with only the grayed out cells not significant at p < 0.1 or better. If you're wondering who the three 'Dove-Owl' 1 seeds are, they are 2010 Kansas, 2013 Gonzaga, & 2018 Virginia; 2016 Michigan State & 2013 Georgetown are the two seeded Doves. Thus, all of the teams in this dataset, which include all tournament teams from 2005-2018 (14 tournaments), have been assigned a strategy independent of their seed.
I've added Figure 3a, which includes query results from the Loss network, to illustrate how all of this works. The point I'm trying to make is that in order to determine Game Theory Strategy, an entire teams connections must be considered. As you can see, 2015 Duke & 2007 Southern Illinois are paired with 2018 Virginia, 2012 Missouri, 2016 Georgetown, 2010 Kansas, etc. in this query. The difference in results for these teams comes down to their entire profile in both the Win & Loss Network. The difference between losing in the first round & winning the entire thing can be very, very small.
Figure 3a. Loss Query w/Highly Seeded Doves & Dove-Owls
Moving on, Figures 1 & 2 simply show the link/triad rate when it has been aggregated separately for these two team 'type' designations. For seeds 1-10 & 15, there are 56 teams in each seed (14 tournaments * 4 seeds each tournament = 56 teams). There are more teams seeded 11, 12, 13, 14, & 16 because of the move to 68 teams in 2011.
Figure 2. Out vs. In Connections by Tournament Seed
If the process of homophilous link/triad formation approaches 50/50, then the Loss Network in Figure 1 appears to be pretty close to random, a fact that is mirrored in the loss network line in Figure 2. Appearances can be deceiving, however. As Figure 3 shows, only the Owls fall within their 95 % confidence interval (CI). Figure 3 includes the color coded CI for all 4 strategies: black for doves; grey for dove-owls; blue for owls; & green for hawks. The CI for doves is so small that it almost can't be seen in Figure 3. Thus, 3 of the four strategies in the loss network are significantly different from random or 50/50, with 2 (hawk & dove-owl) occurring below the line & 1 above it (dove).
Figure 1 & 2 are born out by some of the basic statistics for both the Win & Loss Networks. As of December 2018, the loss network is composed of 3161 dual index queries & 1 tri-index query, while the Win network is made up of 3561 dual index & 4 tri-index queries. All 934 teams from 2005-2018 are represented in both networks, making the Metcalfe number, 435,711, identical in both networks. I mention the date here because constructing the network, growing the sample size, is an ongoing project. The whole point of calculating the network weights is to be able to quantify the relative value of ongoing observations of teams vis a vis their positions within the network. The query list is, more or less, a historical record of these observations.
Despite having nearly 400 more queries, the win network is thinner, having a graph density of 0.251 from 110,739 total unique links compared to a graph density of 0.453 from 197,479 unique links in the Loss network. The global clustering coefficient's & average path links also follow this same pattern; the win network having values of 0.56 & 1.768, respectively, while the loss network comes in at 0.722 & 1.549. Thus, it's no wonder that the Doves, who are most at home in the Loss network, are located above the 1 to 1 out- vs. in-caste line in Figure 1: the loss network is denser and has more triads. Likewise, less dense Win network, constructed from more queries that result in fewer links, bears out the heavily non-random preference for out caste links in that network, a fact that is apparent in Figures 1 & 2.
Moving on to an examination of significance, a version of Figure 3, but for homophily by seed, was not included because all of the Loss Network seeds in Figure 2 fall within their 95% confidence interval. The difference in significance clearly shows that seeding is an inferior tool for understanding subgraph formation in the NCAA Tournament. As further support of the fact that strategy provides improved measures of homophily and subgraph formation, Figure 4,
Figure 3. Homophily by Strategy, Triads from Homophilous Links, Loss Network w/Error Bars
which shows the coefficient of variation for link/triad formation by strategy in the loss network, also shows dramatic improvement over seeding. Figure 4 provides a window into what classification by evolutionary game theory type is actually doing: reclassifying teams according to their network position, much of which has to do with the variation in how they form links & triads. Essentially, seeds are being split. Although not the main consideration in classification, teams that have inconsistent link/triad formation are grouped & moved to a very high coefficient of variation (Hawks & Owls), while those with consistent formation are moved to a very low coefficient of variation (Doves & Dove-Owls).
Figure 4. Coefficient of Variation by Seed & Strategy, Loss Network
The win network, by contrast, very clearly has a non-random & significant (p < 0.001) preference for the creation of non-homophilous links over non-homophilous triads by both strategy and seed. As expected, Figure 1 shows that, by strategy, the doves & dove-owls have the highest growth of out-caste links & triads in the win network. Of the four strategies, doves & dove-owls are also the weakest in terms of tournament success, which should show up as a greater dependence on out-caste links & triads. In effect, this is saying what we know: doves & dove-owls are not comfortable or at home in the win network. Figure 5 supports this assertion, showing that the coefficient of variation for doves & dove-owls in the win network is the highest of the four strategy types. As with Figure 4, the coefficient of variation by seed has been split and grouped into strategies with very high levels (doves & dove-owls) and very low levels (hawks & owls). Furthermore, as expected, this split is a mirror image of the same process found in the loss network.
Figure 5. Coefficient of Variation by Seed & Strategy, Win Network
Figures 2-4 present a compelling visual case, but what do the numbers actually say? Based on 2 lectures I found online (referenced below), I created Tables 2 & 3, which attempt to show the significance of homophilous link formation independent of triads, in both networks, by strategy and seed. I include it here because it is an accepted model for measuring homophily in networks, even though it is not as robust as that presented by Chandrasekhar & Jackson. Table 2 & Table 3 again use the existing
Table 2. Out vs. In Link Homophily by EGT Strategy & Network vs. Expected
percentage of teams by type (strategy or, respectively, seed) as p for estimating the expected Mean & Variance. According to the lecture by Feldman (cited below), these two values are estimated based on the number of independent parameters, or 'independent trials'. So, for the lecture by Donglei, who considered the binary classification of 'male' or 'female', the number of 'independent trials' is 2 (one for male; one for female), which is why the 3rd column in both Tables 2 & 3 reads 2p(1-p). However, in the case of strategy & seed, the '2' in that column will vary. So, for the portion of table 2 that looks at all four strategies, '2' becomes '4', because there are 4 states, or 'trials' that the variable can take. Below the top row, there are three 'trials', because Owls & Hawks have been grouped together. There are two for the remaining rows in Table 2 & 16 independent trials (for the 16 seeds in the tournament) for Table 3.
The 95% confidence interval crosses zero in Table 1 for some of the strategies, enabling only a one way test for significance. However, in most cases, the standard deviation sits well enough above zero to enable a two way test of significance. Also, in Table 3, the Standard Deviation is not reported for the 11 or the 16 seeds. This is due to the fact that after 2011, there have been 3 additional teams in
Table 3. Out vs. In Link Homophily by Tournament Seed & Network vs. Expected
the tournament. This has resulted in enough teams to push the expected mean above 1, resulting in a negative value for the variance. While the equation is not posted in the table, the variance is calculated as:
Variance = 16p(1-p)*(1-(16p(1-p))
Obviously, if there are 4 independent states, the 16 in the above equation becomes 4, 2 for 2, etc. If you sum all of the percentages in the p column, they do indeed sum to 100%.
Table 2 shows that only Doves, when all 4 strategies are considered, have significantly high levels of homophily. Furthermore, this significance carries through when you subtract the Z-score for Doves in the Win network from the Z-score for the Loss network, which reflects their preference for the Loss network as a strategy. Beyond the Doves, only the Dove-Owls record a Z-score greater than 1 in either network. What this reflects is that it is more important for teams in the tournament to be associated with 'strong win probability', to have the right EGT strategy, than it is to have a high level of seed cohesion.
Finally, Table 3 clearly shows absence of homophily by seed in the tournament, which may be one of the reasons people who predict the tournament based on seed produce irregular success rates. Every single seed has 'normal' homophily; in fact, the Z-scores are almost perfectly random (near zero). Obviously, seeding really doesn't say much about identity in the NCAA Tournament it seems.
While more work needs to be done to verify these patterns, these results do point in the direction that using the network to move from Seed to Evolutionary Game Theory Strategy improves the underlying fundamental property of homophily. As for the questions asked earlier in this post:
Does homophily increase or decrease when teams are grouped based on the Evolutionary Game Theory Strategy vs. their Seed? It increases for all four strategies, as can be seen in the magnitude of the Z-Score for each, but is only significant for the Dove strategy within the Loss network.
Is homophily significantly high or low within the system? It's normal for all states, strategy & seed, except for Doves within the Loss network.
How does homophily in the Win network compare to the Loss network? The difference is muted, except for Doves within the Loss Network. Dove-Owls also have Z-Scores that are greater than 1, but are not sigificant.
McPherson, M, Smith-Lovin, L, and Cook, JM. 2001. "Birds of a Feather Homophily in Social Networks."
Annual Review of Sociology, 27: 415-444.
Ballester, C, Calvo-Armengol, A, and Zenou, Y. 2006. "Who's who in Networks. Wanted: The Key Player,"
Econometrica, 74(5): 1403-1417.
Chandrasekhar, AG and Jackson, MO. 2016. "A Network Formation Model Based on Subgraphs."
Physics and Society: 1-41.
Donglei, D. "Lecture 6: Social Network Analysis: Homophily," University of New Brunswick College of
Business Administration, E3B 9Y2,
Feldman, J. 2000. "Mean and Variance of Binomial, Random Variables." University of British Columbia.