AWS EMR

AWS Elastic MapReduce (EMR): You have to have been living under a rock not to have heard of the term big data. It’s a deceptively simple term for an unnerving difficult problem:

In 2010, Google chairman, Eric Schmidt, noted that humans now create as much information in two days as all of humanity had created up to the year 2003. Moreover, the research firm IDC projects that the digital universe will reach 40 zettabytes (ZB) by 2020, resulting in a 50-fold growth from the beginning of 2010. In other words, there’s lots and lots of data, and its growth is accelerating. The challenge that big data presents is that most of the established data analytics tools can’t scale to manage datasets of the size that many companies want to analyze.

For one, traditional business intelligence or data warehousing tools (the terms are used so interchangeably that they’re often referred to as BI/DW) are extremely expensive; when applied to very large datasets, you soon face national debt-type numbers. Humor aside, the established BI/DW tools have a more serious scalability shortcoming: They’re architected with a central analytics engine that reads data from disk, performs analysis, and spits out results.

Today, data sizes are so huge that simply sending the data to be analyzed across the network takes too long to perform any useful work. By the time the data is transferred, the insights that can be gleaned from it are obsolete. Clearly, a new BI/DW analytics architecture and problem approach were called for, and for inspiration, the industry reached out to Google. Google has implemented a different approach to gathering data. Its architecture, MapReduce, is based on this simple insight. With so much data, it makes sense to move the processing to the data rather than attempt to move the data to the processing.

MapReduce takes a very large data store that may be spread across hundreds or thousands of machines and formats the data to structure it for the type of analysis you want to perform (that is, it maps the data into an analyzable format), and then you filter the data (reduce the mapped data, in other words) to isolate the information you want to examine.

Google treats its MapReduce implementation as proprietary, but, based on a paper it published, one person implemented an open-source version of MapReduce called Hadoop. It’s no exaggeration to say that Hadoop has revolutionized the BI/DW industry. In fact, an entire ecosystem of complementary products exists to make Hadoop even more useful.

Analyze data with Amazon EMR

You’ve probably already cut to the chase and recognized a familiar refrain. Hadoop is useful but complex to install, configure, and manage. Gee, wouldn’t it be useful if someone created an easy-to-use, cost-effective Hadoop solution that integrates with the existing ecosystem, allowing established tools that complement Hadoop to be used with this service?

Yes, it would, and Amazon calls its Hadoop solution Elastic MapReduce (EMR). The concept is straightforward:

1. Identify the data source you want to analyze. This is data located in S3. EMR can handle petabytes (a petabyte is 1,000 terabytes) of data with no problem.

2. Tell EMR how many instances (and of what type) you want the EMR pool to contain.EMR can use EC2 standard instances or one of the more exotic types, such as High-IO or High- CPU.

Each instance offers a certain amount of disk storage for running the Hadoop Distributed File System (HDFS). The total amount of data you want to analyze dictates the number of instances you require.

3. Set up an EMR job flow. A job flow can be either of two types:

Streaming: Programming language mappers and reducers are introduced into EMR and processed across EC2 instances and the data they include.

Query-oriented: A higher-level data warehouse tool, such as Hive (which provides a Structured Query Language-like interface) can be used to run interactive queries against the data. The output of either type can be stored in S3 and then used for further analysis without requiring an active job flow.

4. Continue running the job flow, running MapReduce programs or higher-level query languages against the data, until you’re finished using the job flow. A job flow can be terminated, which terminates all instances that make up the EMR pool.

Amazon manages the instances within the EMR pool. If an instance terminates unexpectedly, Amazon starts a new instance and ensures that it has the correct data on it to replace the terminated instance. And, of course, Amazon takes care of starting the EMR pool, connecting the instances to one another, and running MapReduce programs or providing higher-level tools for you to use for analysis.

AWS EMR supports these Programming Languages

Java, Ruby, Perl, Python, PHP, R, Bash, and C++. With respect to these higher-level tools, Amazon provides a wide variety. In addition to Hive (as just mentioned), Amazon also offers Pig (a specialized Hadoop language). Finally, if you want, you can use EMR to output data that can then be imported into a specialized analytics tool like (the curiously named) R.

EMR is one service in which Amazon’s pay-only- for-what-you-use philosophy may not be optimal because transferring and formatting very large datasets to the EMR EC2 instances may take a long time. When you end a job flow, the instances on which the EMR pool is running are terminated and the data discarded. The next time you want to run an analysis, you have to rebuild the EMR pool.

So you need to establish a trade-off, to balance the cost of keeping your EMR pool up and running versus the cost of rebuilding it. Clearly, if you plan to run multiple analyses overtime against a data pool, it probably makes sense to keep your job flow active. One interesting characteristic of EMR is that it differs from the other platform services I’ve already described. The others are “helper” services useful services that help you build better applications more quickly.

By contrast, EMR represents a standalone application that’s not intended to support an application that the user is writing. Another example of this type of “noun helper” stand-alone application is Redshift, covered next. I expect that you’ll see more of these stand-alone applications, for these reasons:

Its serious reputation: Amazon feels that AWS is now accepted as a serious IT player, and IT is willing to trust it with important use cases. The company is now ready to branch out into areas that provide more direct user benefits in addition to its established infrastructure components that enable users to build their own applications.

The opportunity to expand: Amazon perceives many application domains as ripe for automation and commoditization. As it provides offerings in these domains, its users increasingly benefit, and AWS can become more useful to them, thereby cementing its place as a critical part of their IT environments.

Strategic pricing strategies: AWS recognizes that the high price of current offerings in these application domains prevents many potential users from taking advantage of them; its offerings democratize access to these domains. I’ll let you decide whether Amazon is acting purely altruistically in this regard, or perhaps with an element of self-interest.