Data Deduplication for High Performance Storage System

✍ Scribed by Dan Feng

Publisher: Springer
Year: 2022
Tongue: English
Leaves: 170
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

This book comprehensively introduces data deduplication technologies for storage systems. It first presents the overview of data deduplication including its theoretical basis, basic workflow, application scenarios and its key technologies, and then the book focuses on each key technology of the deduplication to provide an insight into the evolution of the technology over the years including chunking algorithms, indexing schemes, fragmentation reduced schemes, rewriting algorithm and security solution. In particular, the state-of-the-art solutions and the newly proposed solutions are both elaborated. At the end of the book, the author discusses the fundamental trade-offs in each of deduplication design choices and propose an open-source deduplication prototype. The book with its fundamental theories and complete survey can guide the beginners, students and practitioners working on data deduplication in storage system. It also provides a compact reference in the perspective of key data deduplication technologies for those researchers in developing high performance storage solutions.

✦ Table of Contents

Preface
Contents
Abbreviations
Chapter 1: Deduplication: Beginning from Data Backup System
1.1 Background
1.1.1 Development of Backup System
1.1.2 Features of Backup System
1.2 Deduplication in Backup Systems
1.2.1 Large-Scale Redundant Data
1.2.2 Why Deduplication?
1.3 Deduplication-Based Backup System
1.4 Concluding Remarks
Chapter 2: Overview of Data Deduplication
2.1 The Principle and Methods of Data Deduplication
2.1.1 File-Level and Chunk-Level Deduplication
2.1.2 Local and Global Deduplication
2.1.3 Online and Offline Deduplication
2.1.4 Source-Based and Target-Based Deduplication
2.2 Basic Workflow of Data Deduplication
2.2.1 Workflow of chunk-level Deduplication
2.2.2 Procedure of Data Deduplication
2.3 Application Scenarios of Data Deduplication
2.3.1 Deduplication for Secondary Storage
2.3.2 Deduplication for Primary Storage
2.3.3 Deduplication for Cloud Storage
2.3.4 Deduplication for Solid State Storage
2.3.5 Deduplication in Network Environments
2.3.6 Deduplication for Virtual Machines
2.4 Key Technologies of Data Deduplication
2.5 Concluding Remarks
Chapter 3: Chunking Algorithms
3.1 Existing Chunking Algorithm
3.1.1 Typical Content-Defined Chunking Algorithm
3.2 Asymmetric Extremum CDC Algorithm
3.2.1 The AE Algorithm Design
3.2.2 The Optimized AE Algorithm
3.2.3 Properties of the AE Algorithm
3.2.4 Performance Evaluation
Deduplication Efficiency
Chunking Throughput
3.3 FastCDC: A Fast and Efficient CDC Approach
3.3.1 Limitation of Gear-Based Chunking Algorithm
3.3.2 FastCDC Overview
3.3.3 Optimizing Hash Judgment
3.3.4 Cut-Point Skipping
3.3.5 Normalized Chunking
3.3.6 The FastCDC Algorithm Design
3.3.7 Performance Evaluation
3.4 Concluding Remarks
Chapter 4: Indexing Schemes
4.1 Correlated Techniques of Indexing Scheme
4.1.1 DDFS: Data Domain File System
4.1.2 Extreme Binning
4.2 Performance Bottleneck and Exploiting File Characteristics
4.2.1 Low Throughput and High RAM Usage
4.2.2 Characteristics of the File Size
4.3 Design and Implementation of SiLo
4.3.1 Similarity Algorithm
4.3.2 Locality Approach
4.3.3 SiLo Workflow
4.4 Performance Evaluation
4.4.1 Duplicate Elimination
4.4.2 RAM Usage for Deduplication Indexing
4.4.3 Deduplication Throughput
4.5 Concluding Remarks
Chapter 5: Rewriting Algorithms
5.1 Development of Defragmentation Algorithm
5.2 HAR: History-Aware Rewriting Algorithm
5.2.1 Fragmentation Classification
5.2.2 Inheritance of Sparse Containers
5.2.3 History-Aware Rewriting Algorithm
5.2.4 Optimal Restore Cache
5.2.5 A Hybrid Scheme
5.2.6 Performance Evaluation
5.3 A Causality-Based Deduplication Performance Booster
5.3.1 File Causality of Backup Datasets
5.3.2 CABdedup Architecture
5.3.3 Exploring and Exploiting Causality Information
5.3.4 Performance Evaluation
Experimental Setup
5.4 Concluding Remarks
Chapter 6: Secure Deduplication
6.1 Progress of Secure Deduplication
6.1.1 Convergent Encryption or Message-Locked Encryption
6.1.2 DupLESS and ClearBox
6.2 Privacy Risks and Performance Cost
6.3 Design of SecDep
6.3.1 System Architecture
6.3.2 UACE: User-Aware Convergent Encryption
6.3.3 MLK: Multi-Level Key Management
6.4 Security Analysis
6.4.1 Confidentiality of Data
6.4.2 Security of Keys and SecDep
6.5 Performance Evaluation
6.5.1 Experiment Setup
6.5.2 Deduplication Ratio and Backup Time
6.5.3 Space and Computation Overheads
6.6 Concluding Remarks
Chapter 7: Post-deduplication Delta Compression Schemes
7.1 Post-Deduplication Delta Compression Techniques
7.2 Ddelta: A Deduplication-Inspired Fast Delta Compression
7.2.1 Computation Overheads of Delta Compression
7.2.2 Deduplication-Inspired Delta Compression
7.2.3 Gear-Based Fast Chunking
7.2.4 Greedily Scanning the Duplicate-Adjacent Areas
7.2.5 Encoding and Decoding
7.2.6 The Detail of Ddelta Scheme
7.2.7 Performance Evaluation
7.3 A Deduplication-Aware Elimination Scheme
7.3.1 Similarity in Storage System
7.3.2 Architecture Overview
7.3.3 Duplicate-Adjacency-Based Resemblance Detection Approach
7.3.4 Improved Super-Feature Approach
7.3.5 Delta Compression
7.3.6 The Detail of DARE Scheme
7.3.7 Performance Evaluation
7.4 Concluding Remarks
Chapter 8: The Framework of Data Deduplication
8.1 In-Line Data Deduplication Space
8.2 Framework Architecture
8.2.1 Backup Pipeline
8.2.2 Restore Pipeline
8.2.3 Garbage Collection
8.3 Performance Evaluation
8.3.1 Experimental Setup
8.3.2 Metrics and Our Goal
8.3.3 Exact Deduplication
8.3.4 Near-Exact Deduplication Exploiting Physical Locality
8.3.5 Near-Exact Deduplication Exploiting Logical Locality
8.3.6 Rewriting Algorithm and Its Interplay
8.3.7 Storage Cost
8.4 Design Recommendation
8.5 Concluding Remarks
References

📜 SIMILAR VOLUMES

Data Deduplication for High Performance

📁 Data Deduplication for High Performance Storage System

✍ Dan Feng 📂 Library 📅 2022 🏛 Springer Nature 🌐 English

Data Deduplication for Data Optimization

📁 Data Deduplication for Data Optimization for Storage and Network Systems

✍ Daehee Kim, Sejun Song, Baek-Young Choi (auth.) 📂 Library 📅 2017 🏛 Springer International Publishing 🌐 English

This book introduces fundamentals and trade-offs of data de-duplication techniques. It describes novel emerging de-duplication techniques that remove duplicate data both in storage and network in an efficient and effective manner. It explains places where duplicate data are originated, and pro

Client Data Caching: A Foundation for Hi

📁 Client Data Caching: A Foundation for High Performance Object Database Systems

✍ Michael J. Franklin (auth.) 📂 Library 📅 1996 🏛 Springer US 🌐 English

Despite the significant ongoing work in the development of new database systems, many of the basic architectural and performance tradeoffs involved in their design have not previously been explored in a systematic manner. The designers of the various systems have adopted a wide range of strategie

Storage Systems: Organization, Performan

📁 Storage Systems: Organization, Performance, Coding, Reliability, and Their Data Processing

✍ Alexander Thomasian 📂 Library 📅 2021 🏛 Morgan Kaufmann 🌐 English

Storage Systems: Organization, Performance, Coding, Reliability and Their Data Processing covers the coding, reliability and performance of popular RAID organizations: RAID1 mirrored disks, RAID5/6/7 1/2/3-disk failure tolerant - 1/2/3DFT arrays. Readers will learn about the storage of fil

Storage Systems: Organization, Performan

📁 Storage Systems: Organization, Performance, Coding, Reliability, and Their Data Processing

✍ Alexander Thomasian 📂 Library 📅 2021 🏛 Morgan Kaufmann 🌐 English

Storage Systems: Organization, Performance, Coding, Reliability and Their Data Processing was motivated by the 1988 Redundant Array of Inexpensive/Independent Disks proposal to replace large form factor mainframe disks with an array of commodity disks. Disk loads are balanced by striping data into s

Advanced Error Control Techniques for Da

📁 Advanced Error Control Techniques for Data Storage Systems

✍ Erozan M. Kurtas, Bane Vasic 📂 Library 📅 2005 🌐 English

With the massive amount of data produced and stored each year, reliable storage and retrieval of information is more crucial than ever. Robust coding and decoding techniques are critical for correcting errors and maintaining data integrity. Comprising chapters thoughtfully selected from the highly p