𝔖 Bobbio Scriptorium
✦   LIBER   ✦

[ACM Press Proceeding of the second joint WOSP/SIPEW international conference - Karlsruhe, Germany (2011.03.14-2011.03.16)] Proceeding of the second joint WOSP/SIPEW international conference on Performance engineering - ICPE '11 - Performance modeling in mapreduce environments

✍ Scribed by Cherkasova, Ludmila


Book ID
121404160
Publisher
ACM Press
Year
2011
Weight
286 KB
Category
Article
ISBN
1450305199

No coin nor oath required. For personal study only.

✦ Synopsis


Unstructured data is the largest and fastest growing portion of most enterprise's assets, often representing 70% to 80% of online data. These steep increase in volume of information being produced often exceeds the capabilities of existing commercial databases. MapReduce and its open-source implementation Hadoop represent an economically compelling alternative that offers an efficient distributed computing platform for handling large volumes of data and mining petabytes of unstructured information. It is increasingly being used across the enterprise for advanced data analytics, business intelligence, and enabling new applications associated with data retention, regulatory compliance, e-discovery and litigation issues.However, setting up a dedicated Hadoop cluster requires a significant capital expenditure that can be difficult to justify. Cloud computing offers a compelling alternative and allows users to rent resources in a "pay-as-you-go" fashion. For example, a list of offered Amazon Web Services includes MapReduce environment for rent. It is an attractive and cost-efficient option for many users because acquiring and maintaining complex, large-scale infrastructures is a difficult and expensive decision.One of the open questions in such environments is the amount of resources that a user should lease from the service provider. Currently, there is no available methodology to easily answer this question, and the task of estimating required resources to meet application performance goals is the solely user's responsibility. The users need to perform adequate application testing, performance evaluation, capacity planning estimation, and then request appropriate amount of resources from the service provider. To address these problems we need to understand: "What do we need to know about a MapReduce job for building an efficient and accurate modeling framework? Can we extract a representative job profile that reflects a set of critical performance characteristics of the underlying application during all job execution phases, i.e., map, shuffle, sort and reduce phases? What metrics should be included in the job profile?" We discuss a profiling technique for MapReduce applications that aims to construct a compact job profile that is comprised of performance invariants which are independent of the amount of resources assigned to the job (i.e., the size of the Hadoop cluster) and the size of the input dataset. The challenge is how to accurately predict application performance in the large production environment and for processing large datasets fromCopyright is held by the author/owner(s).


πŸ“œ SIMILAR VOLUMES