๐”– Scriptorium
โœฆ   LIBER   โœฆ

๐Ÿ“

Data Profiling (Synthesis Lectures on Data Management)

โœ Scribed by Ziawasch Abedjan, Lukasz Golab, Felix Naumann


Publisher
Morgan & Claypool Publishers
Year
2018
Tongue
English
Leaves
156
Series
Synthesis Lectures on Data Management
Category
Library

โฌ‡  Acquire This Volume

No coin nor oath required. For personal study only.

โœฆ Synopsis


Data profiling refers to the activity of collecting data about data, i.e., metadata. Most IT professionals and researchers who work with data have engaged in data profiling, at least informally, to understand and explore an unfamiliar dataset or to determine whether a new dataset is appropriate for a particular task at hand. Data profiling results are also important in a variety of other situations, including query optimization, data integration, and data cleaning. Simple metadata are statistics, such as the number of rows and columns, schema and datatype information, the number of distinct values, statistical value distributions, and the number of null or empty values in each column. More complex types of metadata are statements about multiple columns and their correlation, such as candidate keys, functional dependencies, and other types of dependencies.

This book provides a classification of the various types of profilable metadata, discusses popular data profiling tasks, and surveys state-of-the-art profiling algorithms. While most of the book focuses on tasks and algorithms for relational data profiling, we also briefly discuss systems and techniques for profiling non-relational data such as graphs and text. We conclude with a discussion of data profiling challenges and directions for future work in this area.

โœฆ Table of Contents


Preface
Acknowledgments
Discovering Metadata
Motivation and Overview
Data Profiling and Data Mining
Use Cases
Organization of This Book
Data Profiling Tasks
Single-Column Analysis
Dependency Discovery
Relaxed Dependencies
Single-Column Analysis
Cardinalities
Value Distributions
Data Types, Patterns, and Domains
Data Completeness
Approximate Statistics
Summary and Discussion
Dependency Discovery
Dependency Definitions
Functional Dependencies
Unique Column Combinations
Inclusion Dependencies
Search Space and Data Structures
Lattices and Search Space Sizes
Position List Indexes and Search Space Validation
Search Complexity
Null Semantics
Discovering Unique Column Combinations
Gordian
HCA
Ducc
HyUCC
Swan
Discovering Functional Dependencies
Tane
Fun
FD_Mine
Dfd
Dep-Miner
FastFDs
Fdep
HyFD
Discovering Inclusion Dependencies
SQL-Based IND Validation
B&B
DeMarchi
Binder
Spider
S-IndD
Sindy
Mind
Find2
ZigZag
Mind2
Relaxed and Other Dependencies
Relaxing the Extent of a Dependency
Partial Dependencies
Conditional Dependencies
Relaxing Attribute Comparisons
Metric and Matching Dependencies
Order and Sequential Dependencies
Approximating the Dependency Discovery
Generalizing Functional Dependencies
Denial Constraints
Multivalued Dependencies
Use Cases
Data Exploration
Schema Engineering
Data Cleaning
Query Optimization
Data Integration
Profiling Non-Relational Data
XML
RDF
Time Series
Graphs
Text
Data Profiling Tools
Research Prototypes
Commercial Tools
Data Profiling Challenges
Functional Challenges
Profiling Dynamic Data
Interactive Profiling
Profiling for Integration
Interpreting Profiling Results
Non-Functional Challenges
Efficiency and Scalability
Profiling on New Architectures
Benchmarking Profiling Methods
Conclusions
Bibliography
Authors' Biographies
Blank Page


๐Ÿ“œ SIMILAR VOLUMES


Data-Intensive Workflow Management: For
โœ Daniel C. M. de Oliveira, Ji Liu, Esther Pacitti ๐Ÿ“‚ Library ๐Ÿ“… 2019 ๐Ÿ› MORGAN & CLAYPOOL ๐ŸŒ English

<p><b>Workflows may be defined as abstractions used to model the coherent flow of activities in the context of an <i>in silico</i> scientific experiment.</b></p><p>They are employed in many domains of science such as bioinformatics, astronomy, and engineering. Such workflows usually present a consid

Web Page Recommendation Models: Theory a
โœ Sule Gunduz-Oguducu ๐Ÿ“‚ Library ๐Ÿ“… 2010 ๐Ÿ› Morgan and Claypool Publishers ๐ŸŒ English

One of the application areas of data mining is the World Wide Web (WWW or Web), which serves as a huge, widely distributed, global information service for every kind of information such as news, advertisements, consumer information, financial management, education, government, e-commerce, health ser

Replicated Data Management for Mobile Co
โœ Douglas B. Terry ๐Ÿ“‚ Library ๐Ÿ“… 2008 ๐Ÿ› Morgan and Claypool Publishers ๐ŸŒ English

<span>Managing data in a mobile computing environment invariably involves caching or replication. In many cases, a mobile device has access only to data that is stored locally, and much of that data arrives via replication from other devices, PCs, and services. Given portable devices with limited re

Ontology Engineering (Synthesis Lectures
โœ Elisa F. Kendall, Deborah L. McGuinness ๐Ÿ“‚ Library ๐Ÿ“… 2019 ๐Ÿ› Springer ๐ŸŒ English

<span>Ontologies have become increasingly important as the use of knowledge graphs, machine learning, natural language processing (NLP), and the amount of data generated on a daily basis has exploded. As of 2014, 90% of the data in the digital universe was generated in the two years prior, and the v

Modeling and Data Mining in Blogosphere
โœ Huan Liu, Nitin Agarwal ๐Ÿ“‚ Library ๐Ÿ“… 2009 ๐Ÿ› Morgan and Claypool Publishers ๐ŸŒ English

This book offers a comprehensive overview of the various concepts and research issues about blogs or weblogs. It introduces techniques and approaches, tools and applications, and evaluation methodologies with examples and case studies. Blogs allow people to express their thoughts, voice their opinio