How-to: Analyze Fantasy Sports using Apache Spark and SQL
As part of the drumbeat for Spark Summit West in San Francisco (June 6-8), learn how analyzing stats from professional sports leagues is an instructive use case for data analytics using Apache Spark with SQL.
In the United States, many diehard sports fans morph into amateur statisticians to get an edge over the competition in their fantasy sports leagues. Depending on one’s technical chops, this “edge” is usually no more sophisticated than simple spreadsheet analysis, but some particularly intense people go to the extent of creating their own player rankings and projection systems. Online tools can provide similar capabilities, but it’s not often transparent where the numbers come from.
Although the data involved is not large in volume, the types of data processing, data analytics, and machine-learning techniques used in this area are common to many Apache Hadoop use cases. So, fantasy sports analytics provides a good (and fun) use case for exploring the Hadoop ecosystem.
Apache Spark is a natural fit in this environment. As a data processing platform with embedded SQL and machine-learning capabilities, Spark gives programmatic access to data while still providing an easy SQL access point and simple APIs to churn through the data. Users can write code in Python, Java, or Scala, and then use Apache Hive, Apache Impala (incubating), or even Cloudera Search (Apache Solr) for exploratory analysis.
In this two-part series, I’ll walk you through a common big data workflow: using Spark and Spark SQL for ETL and complex data processing, all while using fantasy NBA basketball as a contextual backdrop. In particular, we’ll do a lot of data processing and then determine who our system says was the best player in 2015-16 NBA season. (If it is anyone other than Stephen Curry, we’ll need to go back to the drawing board.) For those of you who follow professional sports other than basketball (or don’t follow sports at all), don’t pay too much attention to the subject area itself because the dataset patterns involved are highly similar to those involving other sports as well as real use cases.
All of the code (Scala) in this blog can be found over at GitHub.
Data Processing: Setup
We begin by grabbing our data from Basketball-Reference.com, which allows us to export season-based statistics by year as CSV files. We’ll grab data from the current season all the way back to the 1979-1980 season, which is when the NBA adopted the three-point shot. (Prior to the merger of the ABA and NBA in 1976, the three-point shot was used in the ABA but not the NBA. Blocks and steals are also not available for all historic seasons. Starting from the 1979-1980 season gives us the same information to use on all seasons.) This gives us 26 full seasons of data to use for our analysis. We note that the CSV files have some missing data, repeated headers, and so on, which we will have to account for and scrub out during our data processing. (No dataset is perfect!)