Childhood dreams do come true -
in 2015 "Batman vs. Superman" will bring the world’s biggest superheroes
to battle on-screen, finally solving that eternal debate who will
prevail (I put my Bitcoins on Batman).
The Big Data world has its own share of epic battles. In November 2012 Amazon announced Redshift,
their cutting edge data warehouse-as-a-service that scales for only
$1,000 per terabyte per year. Apache Hadoop, created in 2005, is not the
only big data superhero on the block anymore. Now that we have our own
Superman vs. Batman, we gotta ask, how does Hadoop compare with Amazon
Redshift? Let’s get them in the ring and find out.In the left corner wearing a black cape we have Apache Hadoop. Hadoop is an open source framework for distributed processing and storage of Big Data on commodity machines. It uses HDFS, a dedicated file system that cuts data into small chunks and spreads them optimally over a cluster. The data is processed in parallel on the machines via MapReduce (Hadoop 2.0 aka YARN allows for other applications as well).
In the right corner wearing a red cape we have Redshift. Redshift’s data warehouse-as-a-service is based on technology acquired from ParAccel. It is built on an old version of PostgreSQL with 3 major enhancements:
-
Columnar database - this type of database returns data by columns rather than whole rows. It has better performance for aggregating large sets of data, perfect for analytical querying.
-
Sharding - Redshift supports data sharding, that is, partitioning the tables across different servers for better performance.
-
Scalability - With everything running on the cloud, Redshift clusters can be easily up/down sized as needed.
Enough said; time to battle. Are you ready? Let’s get ready to rumble!
Round 1 - Scaling
The largest Redshift node comes with 16TB of storage and a
maximum of 100 nodes can be created. Therefore, if your Big Data goes
beyond 1.6PB, Redshift will not do. Also, when scaling Amazon’s
clusters, the data needs to be reshuffled amongst the machines. It could
take several days and plenty of CPU power, thus slowing your system for
regular operations.
Hadoop scales to as many petabytes as you want, all the more so on
the cloud. Scaling Hadoop doesn’t require reshuffling since new data
will simply be saved on the new machines. In case you do want to balance
the data, there is a rebalancer utility available.First round goes to Hadoop!
Round 2 - Performance
According to several performance tests made by the Airbnb nerds,
a Redshift 16 node cluster performed a lot faster than a Hive/Elastic
Mapreduce 44 node cluster. Another Hadoop vs. Amazon Redshift benchmark made by FlyData, a data synchronization solution for Redshift, confirms that Redshift performs faster for terabytes of data.
Nonetheless, there are some constraints to Redshift’s super speed.
Certain Redshift maintenance tasks have limited resources, so procedures
like deleting old data could take a while. Although Redshift shards
data, it doesn’t do it optimally. You might end up joining data across
different nodes and miss out on the improved performance.Hadoop still has some tricks up its utility belt. FlyData’s benchmark concludes that while Redshift performs faster for terabytes, Hadoop performs better for petabytes. Airbnb agree and state that Hadoop does a better job of running big joins over billions of rows. Unlike Redshift, Hadoop doesn’t have hard resource limitations for maintenance tasks. As for spreading data across nodes optimally, saving it in a hierarchical document format should do the trick. It may take extra work, but at least Hadoop has a solution.
We have a tie - Redshift wins for TBs, Hadoop for PBs
Round 3 - Pricing
This is a tricky one. Redshift’s pricing depends on the choice of region, node size, storage type (newly introduced),
and whether you work with on-demand or reserved resources. Paying
$1000/TB/year only applies for 3 years of a reserved XL Node with 2TB of
storage in US East (North Virginia). Working with the same node and the
same region on-demand costs $3,723/TB/year, more than triple the price.
Choosing the region of Asia Pacific costs even more.
On premise Hadoop is definitely more expensive. According to Accenture’s "Hadoop Deployment Comparison Study",
the total cost of ownership of a bare-metal hadoop cluster with 24
nodes and 50 TB of HDFS is more than $21,000 per month. That’s about
$5,040/TB/year including maintenance and everything. However, it doesn’t
make sense to compare pears with pineapples; let’s compare Redshift
with Hadoop as a service.Pricing for Hadoop as a service isn’t that clear since it depends on how much juice you need. FlyData’s benchmark claims that running Hadoop via Amazon’s Elastic Mapreduce is 10 times more expensive than Redshift. Using Hadoop on Amazon’s EC2 is a different story. Running a relatively low cost m1.xlarge machine with 1.68 TB of storage for 3 years (heavy reserve billing) in the US East region costs about $124 per month, so that’s about $886/TB/year. Working on-demand, using SSD drive machines, or a different region increases prices.
No winner - it depends on your needs
Round 4 - Ease of Use
Redshift has automated tasks for data warehouse
administration and automatic backups to Amazon S3. Transitioning to
Redshift should be a piece of cake for PostgreSQL developers since they
can use the same queries and SQL clients that they’re used to.
Handling Hadoop, whether on the cloud or not, is trickier. Your
system administrators will need to learn Hadoop architecture and tools
and your developers will need to learn coding in Pig or MapReduce. Heck,
you might need to hire new staff with Hadoop expertise. There are Hadoop as a Service
solutions which save you from all that trouble (uh hum), however, most
data warehouse devs and admins will find it easier to use Redshift.Redshift takes the round
Round 5 - Data Format
When it comes to data format Redshift is pretty strict. It
only accepts flat text files in a fixed format such as CSV. On top of
that, Redshift only supports certain data types. The serial data type, arrays, and XML are unsupported at the moment. Even newline characters should be escaped
and Redshift doesn’t support multiple NULLs in your data either. This
means you’ll need to spend time converting your data before you can use
it with Redshift.
Hadoop accepts every data format and data type imaginable.Winner: Hadoop
Round 6 - Data Storage
Redshift data can only be stored on Amazon S3 or DynamoDB.
Not only will you have to use more of Amazon’s services, but you’ll need
to spend extra time preparing and uploading the data. Redshift loads
data via a single thread by default, so it could take a some time to
load. Amazon suggests S3 best practices to speed up the process such as
splitting the data into multiple files, compressing them, using a
manifest file, etc. Moving the data to DynamoDB is of course a bigger headache, unless it’s already there.
Life is more flexible with Hadoop. You can store data on local
drives, in a relational database, or in the cloud (S3 included), and
then import them straight into the Hadoop cluster.Another round for Hadoop
Round 7 - General
Being a columnar DB, Redshift has a columnar engine and
can’t, for instance, do any text analysis. Hadoop is open to all kinds
of analysis via MapReduce and even more applications in version 2. Upon
failure, say an I/O file error, Redshift goes on processing the next
data without retries. Hadoop tries again in that case.
Hadoop wins again
1 comment:
Thank you for good article,
Full Stack Training in Chennai | Certification | Online Training Course| Full Stack Training in Bangalore | Certification | Online Training Course | Full Stack Training in Hyderabad | Certification | Online Training Course | Full Stack Developer Training in Chennai | Mean Stack Developer Training in Chennai | Full Stack Training | Certification | Full Stack Online Training Course
Post a Comment