Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments (Synthesis Lectures on Data Management)

✍ Scribed by Daniel C. M. de Oliveira, Ji Liu, Esther Pacitti

Publisher: MORGAN & CLAYPOOL
Year: 2019
Tongue: English
Leaves: 181
Series: Synthesis Lectures on Data Management
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Workflows may be defined as abstractions used to model the coherent flow of activities in the context of an in silico scientific experiment.

They are employed in many domains of science such as bioinformatics, astronomy, and engineering. Such workflows usually present a considerable number of activities and activations (i.e., tasks associated with activities) and may need a long time for execution. Due to the continuous need to store and process data efficiently (making them data-intensive workflows), high-performance computing environments allied to parallelization techniques are used to run these workflows. At the beginning of the 2010s, cloud technologies emerged as a promising environment to run scientific workflows. By using clouds, scientists have expanded beyond single parallel computers to hundreds or even thousands of virtual machines.

More recently, Data-Intensive Scalable Computing (DISC) frameworks (e.g., Apache Spark and Hadoop) and environments emerged and are being used to execute data-intensive workflows. DISC environments are composed of processors and disks in large-commodity computing clusters connected using high-speed communications switches and networks. The main advantage of DISC frameworks is that they support and grant efficient in-memory data management for large-scale applications, such as data-intensive workflows. However, the execution of workflows in cloud and DISC environments raise many challenges such as scheduling workflow activities and activations, managing produced data, collecting provenance data, etc.

Several existing approaches deal with the challenges mentioned earlier. This way, there is a real need for understanding how to manage these workflows and various big data platforms that have been developed and introduced. As such, this book can help researchers understand how linking workflow management with Data-Intensive Scalable Computing can help in understanding and analyzing scientific big data.

In this book, we aim to identify and distill the body of work on workflow management in clouds and DISC environments. We start by discussing the basic principles of data-intensive scientific workflows. Next, we present two workflows that are executed in a single site and multi-site clouds taking advantage of provenance. Afterward, we go towards workflow management in DISC environments, and we present, in detail, solutions that enable the optimized execution of the workflow using frameworks such as Apache Spark and its extensions.

✦ Table of Contents

Preface
Acknowledgments
Overview
Motivating Examples
Montage
SciEvol
The Life Cycle of Cloud and DISC Workflows
Structure of the Book
Background Knowledge
Key Concepts
Workflow Formalism
Workflow Standards
Scientific Workflow Management Systems
Distributed Execution of Workflows
A Brief on Existing SWfMSs
Distributed Environments Used for Executing Workflows
Computing Clusters
Cloud Computing
Data-Intensive Scalable Computing Clusters
Apache Spark
Conclusion
Workflow Execution in a Single-Site Cloud
Bibliographic and Historical Notes
Early Work on Single-Site Virtual Machine Provisioning for Scientific Workflows
Early Work on Single-Site Workflow Scheduling
Chapter Goals and Contributions
Multi-Objective Cost Model
Single-Site Virtual Machine Provisioning (SSVP)
Problem Definition
SSVP Algorithm
SGreedy Scheduling Algorithm
Evaluating SSVP and SGreedy
Conclusion
Workflow Execution in a Multi-Site Cloud
Overview of Workflow Execution in a Multi-Site Cloud
Workflow Execution with a Multi-Site Cloud Platform
Direct Workflow Execution
Fine-Grained Workflow Execution
Using Distributed Data Management Techniques
Activation Scheduling in a Multi-Site Cloud
Coarse-Grained Workflow Execution with Multiple Objectives
Workflow Partitioning
Fragment Scheduling Algorithms
Performance Analysis
Conclusion
Workflow Execution in DISC Environments
Bibliographic and Historical Notes
Early Work on Fine Tuning Parameters of DISC Frameworks
Early Work on Provenance Capture in DISC Frameworks
Early Work on Scheduling and Data Placement Strategies in DISC Frameworks
Chapter Goals and Contributions
Fine Tuning of Spark Parameters
Problem Definition
SpaCE: A Spark Fine-Tuning Engine
Provenance Management in Apache Spark
Retrospective and Domain Provenance Manager
Prospective Provenance Manager
SAMbA-FS–Mapping File Contents into Main-Memory
Provenance Data Server
Evaluation of SAMbA
Scheduling Spark Workflows in DISC Environments
TARDIS Architecture
TARDIS Data Placement and Scheduling
Conclusion
Conclusion
Bibliography
Authors' Biographies
Blank Page

📜 SIMILAR VOLUMES

Cloud Computing: Data-Intensive Computin

📁 Cloud Computing: Data-Intensive Computing and Scheduling

✍ Frederic Magoules, Jie Pan, Fei Teng 📂 Library 📅 2012 🏛 CRC Press 🌐 English

As more and more data is generated at a faster-than-ever rate, processing large volumes of data is becoming a challenge for data analysis software. Addressing performance issues, Cloud Computing: Data-Intensive Computing and Scheduling explores the evolution of classical techniques and describes com

CLOUD COMPUTING: data-intensive computin

📁 CLOUD COMPUTING: data-intensive computing and scheduling

✍ Magoules, Frederic. Pan Jie. Teng Fei 📂 Library 📅 2019 🏛 Crc Press 🌐 English

Cloud Computing for Data-Intensive Appli

📁 Cloud Computing for Data-Intensive Applications

✍ Xiaolin Li, Judy Qiu (eds.) 📂 Library 📅 2014 🏛 Springer-Verlag New York 🌐 English

This book presents a range of cloud computing platforms for data-intensive scientific applications. It covers systems that deliver infrastructure as a service, including: HPC as a service; virtual networks as a service; scalable and reliable storage; algorithms that manage vast cloud resources an

Data Analytics and Management in Data In

📁 Data Analytics and Management in Data Intensive Domains

✍ Leonid Kalinichenko, Yannis Manolopoulos, Oleg Malkov, Nikolay Skvortsov, Sergey 📂 Library 📅 2018 🏛 Springer International Publishing 🌐 English

This book constitutes the refereed proceedings of the 19th International Conference on Data Analytics and Management in Data Intensive Domains, DAMDID/RCDL 2017, held in Moscow, Russia, in October 2017.The 16 revised full papers presented together with three invited papers were carefully revie

Data Analytics and Management in Data In

📁 Data Analytics and Management in Data Intensive Domains

✍ Leonid Kalinichenko and Sergei O. Kuznetsov 📂 Library 📅 2017 🏛 Springer 🌐 English

This book constitutes the refereed proceedings of the 28th International Conference on Data Analytics and Management in Data Intensive Domains, DAMDID/RCDL 2016, held in Ershovo, Moscow, Russia, in October 2016. The 16 revised full papers presented together with one invited talk and two keyno

Data Intensive Storage Services for Clou

📁 Data Intensive Storage Services for Cloud Environments

✍ Spyridon V. Gogouvitis, Dimosthenis P. Kyriazis 📂 Library 📅 2013 🏛 IGI Global 🌐 English

With the evolution of digitized data, our society has become dependent on services to extract valuable information and enhance decision making by individuals, businesses, and government in all aspects of life. Therefore, emerging cloud-based infrastructures for storage have been widely thought of