SportsTribution - rambling about sport and data

Man versus Machine (Part I)

What the comparison of two annotation sources can tell us about SportVU and psychology

Preamble: I wrote this post in summer 2015. It never got published at Nylon Calculus, as we did not want to anger the NBA and SportsVU gods to take our data away. Alas, away they took. So, while there is a discussion about tracking data on Twitter right now, here's the article... (I'm sorry, links won't work and grammar errors will remain. Also, I think there is a figure missing in the beginning. I think it died with my hard drive.) (Also, Part II was about assists iirc. That's also gone.)

When you start scraping, analyzing and visualizing nba.com's tracking data, you feel like a kid in a candy store. But after the first sugar rushes start to go away, you realize that some of the candies may be dirty and you probably should gobble them a bit more carefully. Before you now start to mistrust everything that has ever been written on Nylon Calculus, let me assure you that the number of bad candies we are talking about is rather small and should not affect any of the previously done results. We are basically talking about one booger between a thousand bonbons. But obviously it would be nice to eliminate as much discrepancy between data and reality as possible.

The easiest extractable SportVU data is shot related. And the information that SportVU gives us about each shot is the shot distance, touch time and dribblings before the shot occurred. In the following I will first tackle the shot distance and then look at problems with touch time and dribblings. I will compare the manual annotation of assists with the information we have given by SportVU in a later article, as assists are a more philosophical question.

The data comprises all shots of the 2014-15 regular season (mandatory hat tip to Darryl). It might be that there are some hiccups that occurred outside of the SportVU data. But I re-checked several entries manually and always found the bizarre results to already exist in the nba.com database.

Shot distance

Shot distance exists already for the time before SportVU, as there have always been hardworking people that manually annotated every shot taking place on an NBA field. If we compare the shot distance given by those worker bees with the shot distance we receive from our new electronical overlord, we get the following picture:

At first sight, this might look worse to you than it actually is. You have to be aware that the color scale is a log scale, so everything that is between blue and yellow are only a fraction of the close to 200'000 shots. Almost all shots occur in the slightly skewed diagonal rectangle in the middle, which is also reflected in the histogram measuring the difference between manual annotated shot distance and SportVU distance.

So, as a batch measurement, SportVU seems to produce useful distances (Phew!). Yet, we can see two clear regions of artifacts in the 2D histogram, which I underlined red. The reason for the upper one seems obvious and a bit embarrassing: SportVu simply does not believe that somebody could shoot from the own side of the field. This explains the negative slope for shots that are manually declared as more than 50 feet away from the basket (with the length of an NBA court being 94 feet).

The problem we see on the lower right corner, where manual annotations estimate a distance of less than 5 feet and SportVu goes up to 30 becomes more obvious when we use an additional information given by or worker bees. Because they gave every shot an action type label, we can for example look at shots that are labeled as “Jump Shot” or shots labeled as “Dunk Shot”. In the following plot, I compare all shots for which the action type contains the words “Layup” or “Dunk” with all shots containing “Pullup”, “Fadeaway” or “Stepback”:

As you can see, we have a clear problem for Layups and Dunks. For the manual annotation, none of them where declared to be further away than 5 feet (Note: Actually, 3 of 10'000 were. One example is a miss by Young, followed by a missed put back http://stats.nba.com/cvp.html?GameID=0021400385&GameEventID=199# ; no idea what happens on the other two). In comparison, 30% of all dunks alone where apparently made from outside the charging area (4 feet circle), according to SportVU. That is an interesting definition of a dunk. Looking at some of the biggest offenders of dunk distance, we can get an idea what goes wrong: You often either have very straight drives to the basket (Noel, SportVU distance 23.9 feet: http://stats.nba.com/cvp.html?GameID=0021400043&GameEventID=388# ; Olynik, SportVU distance 23.8 feet), or alley hoops (Jordan, SportVU distance 24.5 http://stats.nba.com/cvp.html?GameID=0021400391&GameEventID=071#). But some of them just do not make any sense (like this 25 feet dunk from Gerald Henderson, where the ball is never further away from the basket than may 8 feet http://stats.nba.com/cvp.html?GameID=0021400742&GameEventID=049# ). Under these circumstances, it can of course lead to the problem that we inflate shooting percentages from shots that we think are from 3 to 5 feet.

In comparison, for shots for which we can expect motion that is not directed towards the basket seem to work quite well. An interesting observation is that for all jump shots we have a good agreement between manual and SportVU distance for those shots that are from 23 to 25 feet – basically all three pointer. For midrange jump shots on the other hand, SportVU sees the player a little bit further away from the basket.

There are two scenarios for this. The more likely one is that manual observations lack the precision when the 3 point line is not there to guide your estimation. The ugly alternative would be that SportVU has some kind of ridge regression, pulling shots that are likely 3 pointers towards the 3 point line. Let's not hope that that's the case...

For a few shots I looked at, where manual annotation and SportsVU strongly disagreed, the manual annotation was more often right than not.

As examples where the manual one is correct:

SportVU says 38.4 ft distance, manual 23 http://stats.nba.com/cvp.html?GameID=0021400714&GameEventID=157#

SportVU says 37.3 ft distance, manual 24 http://stats.nba.com/cvp.html?GameID=0021400946&GameEventID=197#

SportVU says 42.9 ft distance, manual 19 http://stats.nba.com/cvp.html?GameID=0021401154&GameEventID=323#

On the other hand I am pretty sure that this shot by is closer to 25.3 feet than to 31 feet

http://stats.nba.com/cvp.html?GameID=0021400108&GameEventID=085#

not a 31 footer

[1] http://stats.nba.com/cvp.html?GameID=0021400108&GameEventID=085#

[1] "player ID:2564; Scoring Player:Boris Diaw; ShotType: Pullup Jump shot; Touch time: 0.9; Shot clock: 2.6; Dribbles: 0; SHOT_DISTANCE: 31.0; SHOT_DIST: 25.3; Distance defender: 6.6; Game Clock: 2:23"

much closer manually

manual is right

[1] http://stats.nba.com/cvp.html?GameID=0021400714&GameEventID=157#

[1] "player ID:2045; Scoring Player:Hedo Turkoglu; ShotType: Jump Shot; Touch time: 0.0; Shot clock: 24.0; Dribbles: 0; SHOT_DISTANCE: 23.0; SHOT_DIST: 38.4; Distance defender: 21.0; Game Clock: 8:07"

[1] http://stats.nba.com/cvp.html?GameID=0021400946&GameEventID=197#

[1] "player ID:201228; Scoring Player:CJ Watson; ShotType: Jump Shot; Touch time: 4.4; Shot clock: 24.0; Dribbles: 0; SHOT_DISTANCE: 24.0; SHOT_DIST: 37.3; Distance defender: 7.4; Game Clock: 2:48"

[1] http://stats.nba.com/cvp.html?GameID=0021401154&GameEventID=323#

[1] "player ID:202339; Scoring Player:Eric Bledsoe; ShotType: Pullup Jump shot; Touch time: 5.6; Shot clock: 18.3; Dribbles: 6; SHOT_DISTANCE: 19.0; SHOT_DIST: 42.9; Distance defender: 20.8; Game Clock: 6:04"

closer automatic

[1] http://stats.nba.com/cvp.html?GameID=0021400058&GameEventID=313#

[1] "player ID:101112; Scoring Player:Channing Frye; ShotType: Jump Shot; Touch time: 0.0; Shot clock: 15.8; Dribbles: 0; SHOT_DISTANCE: 27.0; SHOT_DIST: 19.0; Distance defender: 9.4; Game Clock: 5:36"

[1] http://stats.nba.com/cvp.html?GameID=0021400301&GameEventID=391#

[1] "player ID:101139; Scoring Player:CJ Miles; ShotType: Jump Shot; Touch time: 0.0; Shot clock: 24.0; Dribbles: 0; SHOT_DISTANCE: 26.0; SHOT_DIST: 14.8; Distance defender: 11.2; Game Clock: 8:12"

[1] http://stats.nba.com/cvp.html?GameID=0021400394&GameEventID=162#

[1] "player ID:201583; Scoring Player:Ryan Anderson; ShotType: Jump Shot; Touch time: 0.0; Shot clock: 24.0; Dribbles: 0; SHOT_DISTANCE: 26.0; SHOT_DIST: 12.1; Distance defender: 4.2; Game Clock: 8:04"

[1] http://stats.nba.com/cvp.html?GameID=0021400606&GameEventID=499#

[1] "player ID:201163; Scoring Player:Wilson Chandler; ShotType: Jump Shot; Touch time: 0.0; Shot clock: 24.0; Dribbles: 0; SHOT_DISTANCE: 27.0; SHOT_DIST: 14.9; Distance defender: 11.3; Game Clock: 1:42"

[1] http://stats.nba.com/cvp.html?GameID=0021400668&GameEventID=289#

[1] "player ID:203081; Scoring Player:Damian Lillard; ShotType: Jump Shot; Touch time: 10.2; Shot clock: 14.0; Dribbles: 12; SHOT_DISTANCE: 25.0; SHOT_DIST: 19.9; Distance defender: 6.5; Game Clock: 8:44"

[1] http://stats.nba.com/cvp.html?GameID=0021400694&GameEventID=281#

[1] "player ID:201155; Scoring Player:Rodney Stuckey; ShotType: Jump Shot; Touch time: 2.8; Shot clock: 24.0; Dribbles: 0; SHOT_DISTANCE: 26.0; SHOT_DIST: 18.1; Distance defender: 6.1; Game Clock: 7:06"

[1] http://stats.nba.com/cvp.html?GameID=0021401137&GameEventID=207#

[1] "player ID:203496; Scoring Player:Robert Covington; ShotType: Jump Bank Shot; Touch time: 5.6; Shot clock: 17.7; Dribbles: 5; SHOT_DISTANCE: 25.0; SHOT_DIST: 17.1; Distance defender: 4.1; Game Clock: 6:01"

[1] http://stats.nba.com/cvp.html?GameID=0021401145&GameEventID=381#

[1] "player ID:204060; Scoring Player:Joe Ingles; ShotType: Jump Shot; Touch time: 2.0; Shot clock: 5.8; Dribbles: 1; SHOT_DISTANCE: 25.0; SHOT_DIST: 17.7; Distance defender: 5.9; Game Clock: 1:16"

[1] http://stats.nba.com/cvp.html?GameID=0021401158&GameEventID=087#

[1] "player ID:203897; Scoring Player:Zach LaVine; ShotType: Jump Shot; Touch time: 2.2; Shot clock: 10.1; Dribbles: 1; SHOT_DISTANCE: 25.0; SHOT_DIST: 17.0; Distance defender: 3.2; Game Clock: 2:27"

Distant dunks

Yes

[1] http://stats.nba.com/cvp.html?GameID=0021400043&GameEventID=388#

[1] "player ID:203457; Scoring Player:Nerlens Noel; ShotType: Dunk Shot; Touch time: 3.5; Shot clock: 21.6; Dribbles: 0; SHOT_DISTANCE: 0.0; SHOT_DIST: 23.9; Distance defender: 5.8; Game Clock: 11:02"

Tough

[1] http://stats.nba.com/cvp.html?GameID=0021400173&GameEventID=236#

[1] "player ID:203100; Scoring Player:Tony Wroten; ShotType: Dunk Shot; Touch time: 0.0; Shot clock: 14.9; Dribbles: 0; SHOT_DISTANCE: 0.0; SHOT_DIST: 24.9; Distance defender: 0.5; Game Clock: 2:56"

Yes

1] http://stats.nba.com/cvp.html?GameID=0021400371&GameEventID=353#

[1] "player ID:203482; Scoring Player:Kelly Olynyk; ShotType: Driving Dunk Shot; Touch time: 0.9; Shot clock: 18.7; Dribbles: 0; SHOT_DISTANCE: 0.0; SHOT_DIST: 23.8; Distance defender: 4.2; Game Clock: 11:28"

Yes

[1] http://stats.nba.com/cvp.html?GameID=0021400391&GameEventID=071#

[1] "player ID:201599; Scoring Player:DeAndre Jordan; ShotType: Alley Oop Dunk Shot; Touch time: 4.0; Shot clock: 20.6; Dribbles: 0; SHOT_DISTANCE: 0.0; SHOT_DIST: 24.5; Distance defender: 5.0; Game Clock: 5:50"

Tough

[1] http://stats.nba.com/cvp.html?GameID=0021400406&GameEventID=143#

[1] "player ID:101123; Scoring Player:Gerald Green; ShotType: Driving Dunk Shot; Touch time: 0.0; Shot clock: 24.0; Dribbles: 0; SHOT_DISTANCE: 0.0; SHOT_DIST: 24.0; Distance defender: 3.7; Game Clock: 12:00"

[1] http://stats.nba.com/cvp.html?GameID=0021400547&GameEventID=349#

[1] "player ID:203084; Scoring Player:Harrison Barnes; ShotType: Alley Oop Dunk Shot; Touch time: 4.8; Shot clock: 18.6; Dribbles: 0; SHOT_DISTANCE: 0.0; SHOT_DIST: 22.4; Distance defender: 4.4; Game Clock: 4:34"

[1] http://stats.nba.com/cvp.html?GameID=0021400691&GameEventID=189#

[1] "player ID:201148; Scoring Player:Brandan Wright; ShotType: Dunk Shot; Touch time: 3.4; Shot clock: 24.0; Dribbles: 1; SHOT_DISTANCE: 0.0; SHOT_DIST: 23.6; Distance defender: 7.6; Game Clock: 6:22"

[1] http://stats.nba.com/cvp.html?GameID=0021400742&GameEventID=049#

[1] "player ID:201945; Scoring Player:Gerald Henderson; ShotType: Alley Oop Dunk Shot; Touch time: 0.0; Shot clock: 9.3; Dribbles: 0; SHOT_DISTANCE: 0.0; SHOT_DIST: 25.0; Distance defender: 8.4; Game Clock: 6:42"

[1] http://stats.nba.com/cvp.html?GameID=0021401101&GameEventID=147#

[1] "player ID:202687; Scoring Player:Bismack Biyombo; ShotType: Dunk Shot; Touch time: 0.6; Shot clock: 9.1; Dribbles: 0; SHOT_DISTANCE: 0.0; SHOT_DIST: 24.6; Distance defender: 10.1; Game Clock: 7:41"

http://stats.nba.com/cvp.html?GameID=0021400009&GameEventID=486#

http://stats.nba.com/cvp.html?GameID=0021400189&GameEventID=537#

walking the dog

http://stats.nba.com/cvp.html?GameID=0021400053&GameEventID=509#

takes after shot clock reset

http://stats.nba.com/cvp.html?GameID=0021400045&GameEventID=552#

misses the skip pass

http://stats.nba.com/cvp.html?GameID=0021400234&GameEventID=015#

shot clock stops

http://stats.nba.com/cvp.html?GameID=0021400730&GameEventID=008#

momentarily unclear position

http://stats.nba.com/cvp.html?GameID=0021400047&GameEventID=498#

http://stats.nba.com/cvp.html?GameID=0021400194&GameEventID=336#

http://stats.nba.com/cvp.html?GameID=0021400908&GameEventID=088#

shot clock reset

http://stats.nba.com/cvp.html?GameID=0021400601&GameEventID=307#

no idea

Alley Hoops

http://stats.nba.com/cvp.html?GameID=0021400111&GameEventID=293#

"player ID:203500; Scoring Player:Steven Adams; ShotType: Alley Oop Dunk Shot; Touch time: 14.1; Shot clock: 24.0; Dribbles: 0; SHOT_DISTANCE: 0.0; SHOT_DIST: 5.6; Distance defender: 3.9; Game Clock: 8:10"

"player ID:202685; Scoring Player:Jonas Valanciunas; ShotType: Alley Oop Layup shot; Touch time: 9.8; Shot clock: 24.0; Dribbles: 0; SHOT_DISTANCE: 1.0; SHOT_DIST: 1.5; Distance defender: 0.8; Game Clock: 4:25"

Dribble Alley Hoops

[1] http://stats.nba.com/cvp.html?GameID=0021400456&GameEventID=239#

[1] "player ID:203081; Scoring Player:Damian Lillard; ShotType: Alley Oop Layup shot; Touch time: 11.3; Shot clock: 12.8; Dribbles: 9; SHOT_DISTANCE: 2.0; SHOT_DIST: 4.1; Distance defender: 3.9; Game Clock: 11:49"

[1] http://stats.nba.com/cvp.html?GameID=0021400500&GameEventID=415#

[1] "player ID:201163; Scoring Player:Wilson Chandler; ShotType: Alley Oop Dunk Shot; Touch time: 7.8; Shot clock: 16.9; Dribbles: 6; SHOT_DISTANCE: 0.0; SHOT_DIST: 3.1; Distance defender: 1.9; Game Clock: 7:52"

[1] http://stats.nba.com/cvp.html?GameID=0021400909&GameEventID=406#

[1] "player ID:201566; Scoring Player:Russell Westbrook; ShotType: Alley Oop Layup shot; Touch time: 12.2; Shot clock: 12.0; Dribbles: 10; SHOT_DISTANCE: 1.0; SHOT_DIST: 4.3; Distance defender: 3.4; Game Clock: 6:59"

[1] http://stats.nba.com/cvp.html?GameID=0021400031&GameEventID=327#

I had a few weeks on where I used Twitter mostly on my phone. So I started blindly favoriting tweets that could be usefull. This blog post is mostly for me to curate all these data related links. If it comes in handy for others the better. I try to sort them a bit after topics... hat tips go to all the data scientists that show an immigrant like me some interesting things (too lazy to list them right now...)

General Methods and algorithms

Data Elixir - what I am doing with this blog post in big
Machine learning primer - did just skim over it, but it seems this series is great to communicate very important and central concepts with people new to the field
Statistical learning overload - haven't watched the videos yet, but the Hastie book (freely available) and all the things coming with it are probably a first step for any data scientist immigrant
Statistical Data Mining Tutorials collection
Machine learning cheat sheets - a great combination with the Hastie book. Use some of the quick cheat sheet information first and then get down to the more gory details using the book and videos. Check out this sheet for example
Hyper parameter selection
A lot of data science cheat sheets (which basically forces you to read more links)
Machine Learning Visualizations (made in Python and R, horray)

R

R introduction plus text mining course - @StatsInTheWild is probably my favorite twitter handle
dplyr tutorial - I still live in a dplyr less world. Which I guess I should regret every time I write df[df[,'Stat1']>0 & df [,'Stat2']>2,] - people might call that dumb, I call it oldschool
ggrepel: I finally can use the ggplot package for messy textplots