Big Data works the same way humans do when we analyze something: We grab what we know, take some time to think about it, and produce something new. The difference between us and our systems is, for the most part, the amount of data that is being processed for the analysis or the amount of data that is being produced from the analysis. Today we are going to take a look at how we can make our systems analyze data quickly and inexpensively.
Different in every situation
Before we get started, remember, data problems are unique to every customer. While there are core concepts we can use for every customer, it is critical to approach each problem in its proper context. Not everyone will have all their data in one single repository, or have their data up to date, or have their data registered consistently across their database, etc.
All about parallel in-memory processing
It's all about running the functions that analyze your data in parallel, as fast as possible. To run things in parallel, we need a highly durable data source. To run things quickly, we need to load data into RAM and process it there. Knowing this begs the question: What are the right tools for this?
The answer to that question depends a lot on your use-case, but for the most part, an Amazon S3 + Apache Spark combination will be good enough.
Yes, you read that right, S3 as a database (or data lake) and Spark for in-memory processing. While you could use something like Amazon RedShift or Google BigQuery, we have found it's easier to deal with S3 alone. Why? It is meant for massive amounts of concurrent reads and requires little work in configuration, it’s cheap, and it'll get the job done.
Apache Spark is the relatively new and popular solution to processing data in RAM. It comes packed with a lot of features but, most importantly, does what we need it to do: quickly load data into RAM and process it.
What about MongoDB?
Using MongoDB for massive parallel processing is 100% doable, but you will need to scale accordingly. MongoDB's aggregation pipeline runs on RAM but you risk read locks after you reach a set number of concurrent connections. In the grand scheme of things, MongoDB's aggregation pipeline is meant for smaller analytical tasks — even MongoDB supports using Spark for analytical processing here.
Amazon S3 - A durable data source
First, you need to capture all the data in S3 and index it. Any format from JSON to CSV to Parquet is fine. There are 2 ways you can index these files as they land:
- You can build an elasticsearch index with file metadata.
- You can enforce a filename structure.
Building an elasticsearch index is simple via Amazon Lambda. Simply pick up on S3's s3:ObjectCreated event to bring the location and metadata over to your elasticsearch. This will make it easy for you to identify which files to use from S3 to run your calculations. While this is awesome for Data Lake's, it's not a must-have if you only want something with high durability and can enforce an index via filenames to meet your data calculation requirements.
You can easily load data into S3 via their interface (if you want to do it manually), their SDK, one of the many S3 plugins in the open-source community, or via ETL (have a look at our friends' solution: astronomer.io).
Amazon EMR - In-memory data processing
Next, you need to build your Spark Application in Scala and deploy it. We are not going to cover how to build a Spark App here, but there are plenty of resources out there. Your Spark Application is going to load S3's documents into memory and run calculations. Using Amazon EMR, you will be able to set up how much memory your Spark Application is using and the quality of memory it uses. Try different instance types to find out what works for your data needs. Faster instance types will be more expensive, but you will be able to get to near real-time data calculations the closer you get to the top or the more parallel you make your calculations.
Here is a quick tutorial for launching an Amazon EMR Spark cluster.
Amazon Lambda - The serverless developer's glue
Once you've set up your clusters, you need a way to trigger calculations. This is where Amazon Lambda comes to the rescue. With Lambdas, it is easy to setup CRON jobs and events that trigger Amazon EMR. Here are some ideas of what you could do with this glue:
Run a simulation of events to identify data that needs to be calculated and funnel it into Amazon SQS, then have an SQS Consumer Lambda function trigger calculations (read more).
- Run a CRON to calculate data in the background.
- Capture a request, calculate using Amazon EMR, and respond to the request.
Be careful! Lambdas can run out of memory and have a 3 minute maximum timeout limit. This is why developers use it mostly as glue to connect different Amazon services! Avoid working with databases in Lambdas as well; while they may pool connections with a globally scoped database connection object during a "cold start," there is no easy way to identify when a Lambda has gone cold and react to it to properly close the connection. You can read more about this issue here.
The downsides of Apache Spark + Amazon S3
First of all, you need to learn a new language: Scala. There was a project bringing Apache Spark programming into Node using eclairJS but that project has been killed :'(.
Local development is painful. As you learn, you will find yourself pushing to production (or some form of staging) to debug and, because of that, some mistakes will cost you. I'll be writing up a separate blog post to help you set up your local development environment and stage.
Dealing with security can be cumbersome to those of us who aren't comfortable in Amazon's documentation (I know when I first started programming, Amazon's documentation looked complex).
Big Data Analytics is complex and requires multiple attempts to get it right
No matter how you look at it, the nature of a Big Data problem will require multiple attempts before reaching a viable solution. It doesn't matter how advanced your technology stack is or how skilled your developers are; the nature of the data is always different and, therefore, the solutions will vary depending on what the development team knows how to use best.
At Differential, we have dealt with Big Data problems that can be resolved with simple on-demand aggregations and other Big Data problems that require massive simulations and pre-calculations to make the end-product viable. I'm certain the next Big Data problem we face will have its own set of unique hurdles that will force our team to shift gears and learn yet another tool to produce the results that our customers want. Investing in Differential to help solve these kinds of issues will speed up your solution’s time to market as we identify the best strategy for your Big Data problem.