𝔖 Scriptorium
✦   LIBER   ✦

πŸ“

Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments (Synthesis Lectures on Data Management)

✍ Scribed by Daniel C. M. de Oliveira, Ji Liu, Esther Pacitti


Publisher
MORGAN & CLAYPOOL
Year
2019
Tongue
English
Leaves
181
Series
Synthesis Lectures on Data Management
Category
Library

⬇  Acquire This Volume

No coin nor oath required. For personal study only.

✦ Synopsis


Workflows may be defined as abstractions used to model the coherent flow of activities in the context of an in silico scientific experiment.

They are employed in many domains of science such as bioinformatics, astronomy, and engineering. Such workflows usually present a considerable number of activities and activations (i.e., tasks associated with activities) and may need a long time for execution. Due to the continuous need to store and process data efficiently (making them data-intensive workflows), high-performance computing environments allied to parallelization techniques are used to run these workflows. At the beginning of the 2010s, cloud technologies emerged as a promising environment to run scientific workflows. By using clouds, scientists have expanded beyond single parallel computers to hundreds or even thousands of virtual machines.

More recently, Data-Intensive Scalable Computing (DISC) frameworks (e.g., Apache Spark and Hadoop) and environments emerged and are being used to execute data-intensive workflows. DISC environments are composed of processors and disks in large-commodity computing clusters connected using high-speed communications switches and networks. The main advantage of DISC frameworks is that they support and grant efficient in-memory data management for large-scale applications, such as data-intensive workflows. However, the execution of workflows in cloud and DISC environments raise many challenges such as scheduling workflow activities and activations, managing produced data, collecting provenance data, etc.

Several existing approaches deal with the challenges mentioned earlier. This way, there is a real need for understanding how to manage these workflows and various big data platforms that have been developed and introduced. As such, this book can help researchers understand how linking workflow management with Data-Intensive Scalable Computing can help in understanding and analyzing scientific big data.

In this book, we aim to identify and distill the body of work on workflow management in clouds and DISC environments. We start by discussing the basic principles of data-intensive scientific workflows. Next, we present two workflows that are executed in a single site and multi-site clouds taking advantage of provenance. Afterward, we go towards workflow management in DISC environments, and we present, in detail, solutions that enable the optimized execution of the workflow using frameworks such as Apache Spark and its extensions.

✦ Table of Contents


Preface
Acknowledgments
Overview
Motivating Examples
Montage
SciEvol
The Life Cycle of Cloud and DISC Workflows
Structure of the Book
Background Knowledge
Key Concepts
Workflow Formalism
Workflow Standards
Scientific Workflow Management Systems
Distributed Execution of Workflows
A Brief on Existing SWfMSs
Distributed Environments Used for Executing Workflows
Computing Clusters
Cloud Computing
Data-Intensive Scalable Computing Clusters
Apache Spark
Conclusion
Workflow Execution in a Single-Site Cloud
Bibliographic and Historical Notes
Early Work on Single-Site Virtual Machine Provisioning for Scientific Workflows
Early Work on Single-Site Workflow Scheduling
Chapter Goals and Contributions
Multi-Objective Cost Model
Single-Site Virtual Machine Provisioning (SSVP)
Problem Definition
SSVP Algorithm
SGreedy Scheduling Algorithm
Evaluating SSVP and SGreedy
Conclusion
Workflow Execution in a Multi-Site Cloud
Overview of Workflow Execution in a Multi-Site Cloud
Workflow Execution with a Multi-Site Cloud Platform
Direct Workflow Execution
Fine-Grained Workflow Execution
Using Distributed Data Management Techniques
Activation Scheduling in a Multi-Site Cloud
Coarse-Grained Workflow Execution with Multiple Objectives
Workflow Partitioning
Fragment Scheduling Algorithms
Performance Analysis
Conclusion
Workflow Execution in DISC Environments
Bibliographic and Historical Notes
Early Work on Fine Tuning Parameters of DISC Frameworks
Early Work on Provenance Capture in DISC Frameworks
Early Work on Scheduling and Data Placement Strategies in DISC Frameworks
Chapter Goals and Contributions
Fine Tuning of Spark Parameters
Problem Definition
SpaCE: A Spark Fine-Tuning Engine
Provenance Management in Apache Spark
Retrospective and Domain Provenance Manager
Prospective Provenance Manager
SAMbA-FS–Mapping File Contents into Main-Memory
Provenance Data Server
Evaluation of SAMbA
Scheduling Spark Workflows in DISC Environments
TARDIS Architecture
TARDIS Data Placement and Scheduling
Conclusion
Conclusion
Bibliography
Authors' Biographies
Blank Page


πŸ“œ SIMILAR VOLUMES


Cloud Computing: Data-Intensive Computin
✍ Frederic Magoules, Jie Pan, Fei Teng πŸ“‚ Library πŸ“… 2012 πŸ› CRC Press 🌐 English

As more and more data is generated at a faster-than-ever rate, processing large volumes of data is becoming a challenge for data analysis software. Addressing performance issues, Cloud Computing: Data-Intensive Computing and Scheduling explores the evolution of classical techniques and describes com

Cloud Computing for Data-Intensive Appli
✍ Xiaolin Li, Judy Qiu (eds.) πŸ“‚ Library πŸ“… 2014 πŸ› Springer-Verlag New York 🌐 English

<p>This book presents a range of cloud computing platforms for data-intensive scientific applications. It covers systems that deliver infrastructure as a service, including: HPC as a service; virtual networks as a service; scalable and reliable storage; algorithms that manage vast cloud resources an

Data Analytics and Management in Data In
✍ Leonid Kalinichenko, Yannis Manolopoulos, Oleg Malkov, Nikolay Skvortsov, Sergey πŸ“‚ Library πŸ“… 2018 πŸ› Springer International Publishing 🌐 English

<p>This book constitutes the refereed proceedings of the 19th International Conference on Data Analytics and Management in Data Intensive Domains, DAMDID/RCDL 2017, held in Moscow, Russia, in October 2017.<p>The 16 revised full papers presented together with three invited papers were carefully revie

Data Analytics and Management in Data In
✍ Leonid Kalinichenko and Sergei O. Kuznetsov πŸ“‚ Library πŸ“… 2017 πŸ› Springer 🌐 English

This book constitutes the refereed proceedings of the 28th International Conference on Data Analytics and Management in Data Intensive Domains, DAMDID/RCDL 2016, held in Ershovo, Moscow, Russia, in October 2016.<br><br>The 16 revised full papers presented together with one invited talk and two keyno

Data Intensive Storage Services for Clou
✍ Spyridon V. Gogouvitis, Dimosthenis P. Kyriazis πŸ“‚ Library πŸ“… 2013 πŸ› IGI Global 🌐 English

<p>With the evolution of digitized data, our society has become dependent on services to extract valuable information and enhance decision making by individuals, businesses, and government in all aspects of life. Therefore, emerging cloud-based infrastructures for storage have been widely thought of