There is a growing interest in Apache Spark, so I wanted to play with it (especially after Alexander Rubin’s Using Apache Spark post).
To start, I used the recently released Apache Spark 1.6.0 for this experiment,
and I will play with “Airlines On-Time Performance” database
from
http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time. You
can find the scripts I used here https://github.com/Percona-Lab/ontime-airline-performance. The
uncompressed dataset …