At Pinterest we’re building the world’s most comprehensive discovery engine, and part of achieving a highly personalized, relevant and fast service is running thousands of jobs on our Hadoop/Spark cluster. To feed the data for computation, we need to ingest a large volume of raw data from online data sources such as MySQL, Kafka and Redis. We’ve previously covered our logging pipeline and moving Kafka data onto S3. Here we’ll share lessons learned in moving data at scale from MySQL to S3, and our journey in implementing Tracker, a database ingestion system to move content at massive scale.
History
To give an idea of the challenge, let’s first look at where we were coming from. MySQL is the main data source for storing the most important objects in Pinterest: Pins, Pinners and boards. Every day we collect more than 100 terabytes …
[Read more]