Amazon launched much awaited RedShift @ AWS re:invent last year. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service. It comes under Massively Parallel Processing (MPP) system and it complements and works well with the EMR and DynamoDB products of AWS. It is optimized for datasets ranging from a few hundred gigabytes to a petabyte and it is simple and cost effective.Most attractive part of Amazon RedShift is its promise to cost less than than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions in the market currently.
Amazon Redshift was customized by AWS team to deliver queries faster and give good I/O performance for virtually any size dataset by using columnar storage technology and parallelizing + distributing queries across multiple nodes. AWS team made Amazon Redshift easy to use by automating most of the common administrative tasks associated with provisioning, configuring, monitoring, backing up, and securing a data warehouse. This automation in turn saves lots of costly labor cost associated with the above activities in maintaining and operating a MPP system.
Recently AK team analysed and tested Amazon Redshift for its promise and have published a detailed report on their blog. They tested variety of Amazon RedShift nodes with data sets ranging from 80GB to 2.4TB in size equating to 2B -> 57B rows in the main fact table.
Some of the points observed by AK team were :
- Starting an Amazon RedShift cluster, regardless of what size and Instance type, took between 3 and 20 minutes.
- Each Amazon RedShift xlarge node could load about 3.17MB/sec of compressed S3 data or 78k rows/sec and each 8xlarge node could load about 23.8MB/sec or 584k rows/sec.
- The overall COPY speed scaled linearly with cost and with data size, measured over hundreds of chunked loads.
- For loading, AK team observed linear scaling per dollar and observed just a bit worse than the 8:1 price ratio between the small and big nodes.
- Backing up 4-5TB Amazon RedShift cluster reliably took 2-3 hours, they were getting about 400MB/sec to S3 from their 2-node 8xlarge/16-node xlarge clusters
- AK team saw an effective rate of about 175MB/sec transfer when upgrading from 2-node 8xlarge/16-node xlarge to 4-node 8xlarge/32-node xlarge clusters, meaning they looked up the amount of space Amazon Redshift reported using on disk and then divided that by the time between hitting the resize button and being able to run the first query on the new cluster. For the data sets, that meant 6-7 hours of degraded query performance (but not downtime!) followed by a seamless transition.
I enjoyed reading this excellent analysis and this article is a must read for Amazon RedShift users : You can find the original article at
- Thanks AK team for the detailed Analysis