#PSBigData Better than Gold: Unlike gold, big (basketball) data can be mined repeatedly by multiple methods

Where are we? What are we going to do?

During the 1880s and 1890s, Francis Galton collected one sample of response time from each of 17,000 Britons. Clearly, he had no concept of intertrial variability, so one sample seemed to suffice. Times have changed and we live in a world where not only can we sample large numbers of people but we can also collect hundreds or thousands or more records from each of them.

The issue now isn’t as much “what are we going to do” but “how are we going to do it” and “what sorts of things can we expect to learn about human abilities if we bother?” Indeed, with our modern definition of “huge data set”, we might say of ourselves what Ronald Johnson and colleagues said in their re-analysis of Galton’s data: “He had obviously solved the problem of large-scale data acquisition. However, the problem of analyzing this huge data set was not adequately solved” (pp. 875-876).

Into this breach step Nemanja Vaci, Dijana Cocić, Bartosz Gula, and Merim Bilalić with their intriguing, if complex paper, Large Data and Bayesian Modeling: Aging Curves of NBA Players that appeared in the special issue of Behavior Research Methods on Big Data that occasioned this digital event. This article has been the subject of a previous post, and here I provide my own take on it and compare it to recent work in my own laboratory.

While writing this paper Vaci and colleagues were wearing two hats. The first hat was that of researchers (using data collected by the National Basketball Association, NBA) to describe the skill curves of elite basketball players across their careers. In particular, Vaci and colleagues wished to describe both the acquisition of “elite skill” and its deterioration. That is, they attempt to look across the career of NBA athletes to study the rise and fall of expert performance; this is related in detail in our earlier post here.

The other hats have something to do with being cheer leaders of Big Data, tutors of Bayesian Structural Modeling, as well as explorers of the host of issues surrounding choices among methods as well as the statistical and/or theoretical assumptions underlying each choice.

As Vaci and colleagues get deeper into their topic, their number of choices broaden as well. It is not enough to have choices among various statistical functions and parameters but they also must contend with the multitude of ways in which performance in their example domain of basketball may be quantified. The various metrics they consider include Win Shares (WS), Value Over Replacement Player (VORP), and Player Efficiency Rating (PER).

Seldom have I come across a paper with more stats and metrics per square inch than this one, or more ways in which the various statistics could be combined, transformed, or otherwise munged. However, as someone who has never played basketball (unless mandated by a High School Gym teacher) and who has almost never watched basketball, I am amazed at what these data tell us about the growth and decline of human expertise.

Indeed, my slight confession here is that my most recent PhD, Matt Sangster (June 2019), uses Basketball data as a second domain in which to develop and study methods for determining the contributions of the roles played by individuals to the outcome of a team task.

That is, for experimental “paradigms” Sangster used two tasks; (a) NBA Basketball, and (b) the Multiplayer Online Battle Area (MOBA) game of League of Legends (LoL). In contrast to Vaci and colleagues, Sangster’s work is less hypothesis-driven and more exploratory (e.g., he uses Exploratory Factor Analysis as his main statistical technique). Unlike the issue of the rise and fall of individual expertise covered by Vaci and colleagues, the contribution of individual expertise to team outcome is less explored, and much of the vast literature on teams is based on questionnaire data or on qualitative observations of a small number of teams for short periods of time. Matt Sangster’s work addresses that gap.

Clearly, the details of Matt’s study would fill a PhD Thesis – and those interested in his project will have to wait for Matt’s publication of that work. However, like Vaci and colleagues, Matt has a lot of NBA data. Specifically, Matt has 507,446 records from 30 NBA teams across 21 seasons. Each record contains the box-score (end-of-game) statistics for a single player on a single team in a single game. The average number of games played across the 1,894 players is 268 and each player played for an average of 2.88 (SD= 2.0) teams. For this work, Matt used a holdout set of 5 randomly selected seasons (2005-06, 2007-08, 2011-12, 2012-13, and 2016-17). The NBA data set separates players into 7 positions, based on combinations of the three types of role on the team; that is, Guards, Centers, and Forwards.

Sometimes we refer to Matt’s work as “finding the ‘I’ in team”. This term emphasizes Matt’s focus on both identifying and measuring the contribution of individual roles to the team’s success. For this work, Matt’s preferred statistic is Exploratory Factor Analysis (EFA). His three sets of analyses are shown in graphic form in the figure below. For the Role Performance set (left-most plot in the figures) Matt produces one model for each of the 7 roles (i.e., one for each of the 7 NBA positions). The resulting factors provide inputs for a logistic regression on match outcome.

The second EFA set (the middle set in the figure) takes factors from a single EFA (middle of the middle figure) as inputs to a mixed-effects model using team as a random effect. This second method results in one model, general across each position. The right-most model, includes a single EFA, applies its results to each position in the training set, and then trains a mixed effects model (with team as a random effect) on each of the seven sets separately. This “Individual-in-Team” method results in one model for each of the seven NBA positions.

Where are we? What are we going to do? How are we going to do it? The true point of the ambitious paper by Vaci and colleagues is to showcase a variety of new, Bayesian or Bayesian inspired, statistical techniques by using those techniques in a variety of analyses of NBA Basketball data.

Likewise, the true point of my very limited description of Sangster’s work, also based on the NBA Basketball data, is to show that Big Data can be better than gold. Whereas gold can only be mined once, the utility of Big Data is limited only by the methods we bring to bear on its analysis and by our research questions of interest.

Psychonomics article highlighted in this post:

Vaci, N., Cocić, D., Gula, B. & Bilalić, M. (2019). Large Data and Bayesian Modeling – Aging Curves of NBA Players. Behavior Research Methods, DOI: 10.3758/s13428-018-1183-8.

Author

Wayne Gray

View all posts

#PSBigData Better than Gold: Unlike gold, big (basketball) data can be mined repeatedly by multiple methods

Author

You may also like

Keeping an eye on it. The importance of standardized guidelines

Learning to classify better than a Student’s t-test: The joys of SVM

To create social good, psychology needs credible evidence