Wednesday, September 3, 2014

Hadoop vs. Redshift

Childhood dreams do come true - in 2015 "Batman vs. Superman" will bring the world’s biggest superheroes to battle on-screen, finally solving that eternal debate who will prevail (I put my Bitcoins on Batman).
The Big Data world has its own share of epic battles. In November 2012 Amazon announced Redshift, their cutting edge data warehouse-as-a-service that scales for only $1,000 per terabyte per year. Apache Hadoop, created in 2005, is not the only big data superhero on the block anymore. Now that we have our own Superman vs. Batman, we gotta ask, how does Hadoop compare with Amazon Redshift? Let’s get them in the ring and find out.




In the left corner wearing a black cape we have Apache Hadoop. Hadoop is an open source framework for distributed processing and storage of Big Data on commodity machines. It uses HDFS, a dedicated file system that cuts data into small chunks and spreads them optimally over a cluster. The data is processed in parallel on the machines via MapReduce (Hadoop 2.0 aka YARN allows for other applications as well).
In the right corner wearing a red cape we have Redshift. Redshift’s data warehouse-as-a-service is based on technology acquired from ParAccel. It is built on an old version of PostgreSQL with 3 major enhancements:
  1. Columnar database - this type of database returns data by columns rather than whole rows. It has better performance for aggregating large sets of data, perfect for analytical querying.

  2. Sharding - Redshift supports data sharding, that is, partitioning the tables across different servers for better performance.

  3. Scalability - With everything running on the cloud, Redshift clusters can be easily up/down sized as needed.

Traditional solutions by companies like Oracle and EMC have been around for a while, though only as $1,000,000 on-premise racks of dedicated machines. Amazon’s innovation, therefore, lies in pricing and capacity. Their pay-as-you-go promise, as low as $1,000/TB/year, makes a powerful data warehouse affordable for small to medium businesses who couldn’t previously manage it. Because Redshift is on the cloud, it shrinks and grows as needed instead of having big dust gathering machines in the office that need maintenance.
Enough said; time to battle. Are you ready? Let’s get ready to rumble!

Round 1 - Scaling

The largest Redshift node comes with 16TB of storage and a maximum of 100 nodes can be created. Therefore, if your Big Data goes beyond 1.6PB, Redshift will not do. Also, when scaling Amazon’s clusters, the data needs to be reshuffled amongst the machines. It could take several days and plenty of CPU power, thus slowing your system for regular operations.
Hadoop scales to as many petabytes as you want, all the more so on the cloud. Scaling Hadoop doesn’t require reshuffling since new data will simply be saved on the new machines. In case you do want to balance the data, there is a rebalancer utility available.
First round goes to Hadoop!

Round 2 - Performance

According to several performance tests made by the Airbnb nerds, a Redshift 16 node cluster performed a lot faster than a Hive/Elastic Mapreduce 44 node cluster. Another Hadoop vs. Amazon Redshift benchmark made by FlyData, a data synchronization solution for Redshift, confirms that Redshift performs faster for terabytes of data.
Nonetheless, there are some constraints to Redshift’s super speed. Certain Redshift maintenance tasks have limited resources, so procedures like deleting old data could take a while. Although Redshift shards data, it doesn’t do it optimally. You might end up joining data across different nodes and miss out on the improved performance.
Hadoop still has some tricks up its utility belt. FlyData’s benchmark concludes that while Redshift performs faster for terabytes, Hadoop performs better for petabytes. Airbnb agree and state that Hadoop does a better job of running big joins over billions of rows. Unlike Redshift, Hadoop doesn’t have hard resource limitations for maintenance tasks. As for spreading data across nodes optimally, saving it in a hierarchical document format should do the trick. It may take extra work, but at least Hadoop has a solution.
We have a tie - Redshift wins for TBs, Hadoop for PBs

Round 3 - Pricing

This is a tricky one. Redshift’s pricing depends on the choice of region, node size, storage type (newly introduced), and whether you work with on-demand or reserved resources. Paying $1000/TB/year only applies for 3 years of a reserved XL Node with 2TB of storage in US East (North Virginia). Working with the same node and the same region on-demand costs $3,723/TB/year, more than triple the price. Choosing the region of Asia Pacific costs even more.
On premise Hadoop is definitely more expensive. According to Accenture’s "Hadoop Deployment Comparison Study", the total cost of ownership of a bare-metal hadoop cluster with 24 nodes and 50 TB of HDFS is more than $21,000 per month. That’s about $5,040/TB/year including maintenance and everything. However, it doesn’t make sense to compare pears with pineapples; let’s compare Redshift with Hadoop as a service.
Pricing for Hadoop as a service isn’t that clear since it depends on how much juice you need. FlyData’s benchmark claims that running Hadoop via Amazon’s Elastic Mapreduce is 10 times more expensive than Redshift. Using Hadoop on Amazon’s EC2 is a different story. Running a relatively low cost m1.xlarge machine with 1.68 TB of storage for 3 years (heavy reserve billing) in the US East region costs about $124 per month, so that’s about $886/TB/year. Working on-demand, using SSD drive machines, or a different region increases prices.
No winner - it depends on your needs

Round 4 - Ease of Use

Redshift has automated tasks for data warehouse administration and automatic backups to Amazon S3. Transitioning to Redshift should be a piece of cake for PostgreSQL developers since they can use the same queries and SQL clients that they’re used to.
Handling Hadoop, whether on the cloud or not, is trickier. Your system administrators will need to learn Hadoop architecture and tools and your developers will need to learn coding in Pig or MapReduce. Heck, you might need to hire new staff with Hadoop expertise. There are Hadoop as a Service solutions which save you from all that trouble (uh hum), however, most data warehouse devs and admins will find it easier to use Redshift.
Redshift takes the round

Round 5 - Data Format

When it comes to data format Redshift is pretty strict. It only accepts flat text files in a fixed format such as CSV. On top of that, Redshift only supports certain data types. The serial data type, arrays, and XML are unsupported at the moment. Even newline characters should be escaped and Redshift doesn’t support multiple NULLs in your data either. This means you’ll need to spend time converting your data before you can use it with Redshift.
Hadoop accepts every data format and data type imaginable.
Winner: Hadoop

Round 6 - Data Storage

Redshift data can only be stored on Amazon S3 or DynamoDB. Not only will you have to use more of Amazon’s services, but you’ll need to spend extra time preparing and uploading the data. Redshift loads data via a single thread by default, so it could take a some time to load. Amazon suggests S3 best practices to speed up the process such as splitting the data into multiple files, compressing them, using a manifest file, etc. Moving the data to DynamoDB is of course a bigger headache, unless it’s already there.
Life is more flexible with Hadoop. You can store data on local drives, in a relational database, or in the cloud (S3 included), and then import them straight into the Hadoop cluster.
Another round for Hadoop

Round 7 - General

Being a columnar DB, Redshift has a columnar engine and can’t, for instance, do any text analysis. Hadoop is open to all kinds of analysis via MapReduce and even more applications in version 2. Upon failure, say an I/O file error, Redshift goes on processing the next data without retries. Hadoop tries again in that case.
Hadoop wins again




Tonight’s Winner

We have a tie! Huh!? Didn’t Hadoop win most of the rounds? Yes, it did, but Big Data’s superheroes are better off working together as a team rather than fighting. Turn on the Hadoop-Signal when you need relatively cheap data storage, batch processing of petabytes, or processing data in non-relational formats. Call out to red-caped Redshift for analytics, fast performance for terabytes, and an easier transition for your PostgreSQL team. As Airbnb concluded in their benchmark: "We don’t think Redshift is a replacement of the Hadoop family due to its limitations, but rather it is a very good complement to Hadoop for interactive analytics". We Agree.