Data Profiling (Synthesis Lectures on Data Management)

✍ Scribed by Ziawasch Abedjan, Lukasz Golab, Felix Naumann

Publisher: Morgan & Claypool Publishers
Year: 2018
Tongue: English
Leaves: 156
Series: Synthesis Lectures on Data Management
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Data profiling refers to the activity of collecting data about data, i.e., metadata. Most IT professionals and researchers who work with data have engaged in data profiling, at least informally, to understand and explore an unfamiliar dataset or to determine whether a new dataset is appropriate for a particular task at hand. Data profiling results are also important in a variety of other situations, including query optimization, data integration, and data cleaning. Simple metadata are statistics, such as the number of rows and columns, schema and datatype information, the number of distinct values, statistical value distributions, and the number of null or empty values in each column. More complex types of metadata are statements about multiple columns and their correlation, such as candidate keys, functional dependencies, and other types of dependencies.

This book provides a classification of the various types of profilable metadata, discusses popular data profiling tasks, and surveys state-of-the-art profiling algorithms. While most of the book focuses on tasks and algorithms for relational data profiling, we also briefly discuss systems and techniques for profiling non-relational data such as graphs and text. We conclude with a discussion of data profiling challenges and directions for future work in this area.

✦ Table of Contents

Preface
Acknowledgments
Discovering Metadata
Motivation and Overview
Data Profiling and Data Mining
Use Cases
Organization of This Book
Data Profiling Tasks
Single-Column Analysis
Dependency Discovery
Relaxed Dependencies
Single-Column Analysis
Cardinalities
Value Distributions
Data Types, Patterns, and Domains
Data Completeness
Approximate Statistics
Summary and Discussion
Dependency Discovery
Dependency Definitions
Functional Dependencies
Unique Column Combinations
Inclusion Dependencies
Search Space and Data Structures
Lattices and Search Space Sizes
Position List Indexes and Search Space Validation
Search Complexity
Null Semantics
Discovering Unique Column Combinations
Gordian
HCA
Ducc
HyUCC
Swan
Discovering Functional Dependencies
Tane
Fun
FD_Mine
Dfd
Dep-Miner
FastFDs
Fdep
HyFD
Discovering Inclusion Dependencies
SQL-Based IND Validation
B&B
DeMarchi
Binder
Spider
S-IndD
Sindy
Mind
Find2
ZigZag
Mind2
Relaxed and Other Dependencies
Relaxing the Extent of a Dependency
Partial Dependencies
Conditional Dependencies
Relaxing Attribute Comparisons
Metric and Matching Dependencies
Order and Sequential Dependencies
Approximating the Dependency Discovery
Generalizing Functional Dependencies
Denial Constraints
Multivalued Dependencies
Use Cases
Data Exploration
Schema Engineering
Data Cleaning
Query Optimization
Data Integration
Profiling Non-Relational Data
XML
RDF
Time Series
Graphs
Text
Data Profiling Tools
Research Prototypes
Commercial Tools
Data Profiling Challenges
Functional Challenges
Profiling Dynamic Data
Interactive Profiling
Profiling for Integration
Interpreting Profiling Results
Non-Functional Challenges
Efficiency and Scalability
Profiling on New Architectures
Benchmarking Profiling Methods
Conclusions
Bibliography
Authors' Biographies
Blank Page

📜 SIMILAR VOLUMES

Data-Intensive Workflow Management: For

📁 Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments (Synthesis Lectures on Data Management)

✍ Daniel C. M. de Oliveira, Ji Liu, Esther Pacitti 📂 Library 📅 2019 🏛 MORGAN & CLAYPOOL 🌐 English

Workflows may be defined as abstractions used to model the coherent flow of activities in the context of an in silico scientific experiment.They are employed in many domains of science such as bioinformatics, astronomy, and engineering. Such workflows usually present a consid

Web Page Recommendation Models: Theory a

📁 Web Page Recommendation Models: Theory and Algorithms (Synthesis Lectures on Data Management)

✍ Sule Gunduz-Oguducu 📂 Library 📅 2010 🏛 Morgan and Claypool Publishers 🌐 English

One of the application areas of data mining is the World Wide Web (WWW or Web), which serves as a huge, widely distributed, global information service for every kind of information such as news, advertisements, consumer information, financial management, education, government, e-commerce, health ser

Replicated Data Management for Mobile Co

📁 Replicated Data Management for Mobile Computing (Synthesis Lectures on Mobile and Pervasive Computing)

✍ Douglas B. Terry 📂 Library 📅 2008 🏛 Morgan and Claypool Publishers 🌐 English

Managing data in a mobile computing environment invariably involves caching or replication. In many cases, a mobile device has access only to data that is stored locally, and much of that data arrives via replication from other devices, PCs, and services. Given portable devices with limited re

Ontology Engineering (Synthesis Lectures

📁 Ontology Engineering (Synthesis Lectures on Data, Semantics, and Knowledge)

✍ Elisa F. Kendall, Deborah L. McGuinness 📂 Library 📅 2019 🏛 Springer 🌐 English

Ontologies have become increasingly important as the use of knowledge graphs, machine learning, natural language processing (NLP), and the amount of data generated on a daily basis has exploded. As of 2014, 90% of the data in the digital universe was generated in the two years prior, and the v

Modeling and Data Mining in Blogosphere

📁 Modeling and Data Mining in Blogosphere (Synthesis Lectures on Data Mining and Knowledge Discovery)

✍ Huan Liu, Nitin Agarwal 📂 Library 📅 2009 🏛 Morgan and Claypool Publishers 🌐 English

This book offers a comprehensive overview of the various concepts and research issues about blogs or weblogs. It introduces techniques and approaches, tools and applications, and evaluation methodologies with examples and case studies. Blogs allow people to express their thoughts, voice their opinio

Data Through Movement: Designing Embodie

📁 Data Through Movement: Designing Embodied Human-Data Interaction for Informal Learning (Synthesis Lectures on Visualization)

📂 Library 🏛 Morgan & Claypool 🌐 English