Fuzzy Data Matching with SQL: Enhancing Data Quality and Query Performance

✍ Scribed by Jim Lehmer

Publisher: O'Reilly Media
Year: 2023
Tongue: English
Leaves: 285
Edition: 1
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

If you were handed two different but related sets of data, what tools would you use to find the matches? What if all you had was SQL SELECT access to a database? In this practical book, author Jim Lehmer provides best practices, techniques, and tricks to help you import, clean, match, score, and think about heterogeneous data using SQL.

DBAs, programmers, business analysts, and data scientists will learn how to identify and remove duplicates, parse strings, extract data from XML and JSON, generate SQL using SQL, regularize data and prepare datasets, and apply data quality and ETL approaches for finding the similarities and differences between various expressions of the same data.

Full of real-world techniques, the examples in the book contain working code. You'll learn how to:
• Identity and remove duplicates in two different datasets using SQL
• Regularize data and achieve data quality using SQL
• Extract data from XML and JSON
• Generate SQL using SQL to increase your productivity
• Prepare datasets for import, merging, and better analysis using SQL
• Report results using SQL
• Apply data quality and ETL approaches to finding similarities and differences between various expressions of the same data

✦ Table of Contents

Cover
Copyright
Table of Contents
Preface
What Problems Are We Trying to Solve?
What Will We Cover?
Part I: Review
Part II: Various Data Problems
Part III: Bringing It Together
Appendix
Who Is This Book For?
Why SQL?
Warning! Opinions Ahead!
Typographical Conventions Used in This Book
Additional Information on the Book’s Conventions
The Data “Model”
Environment Layout
Customer Table
“Normalized” View
Meet the Snedleys
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Part I. Review
Chapter 1. A SELECT Review
Simple SELECT Statements
Common Table Expressions
In CASE of Emergency
Joins
A Diversion into NULL Values
OUTER JOINs
Finding the Most Current Value
Final Thoughts on SELECT
Chapter 2. Function Junction
Aggregate Functions
MAX
MIN
COUNT
SUM
AVG
Conversion Functions
CAST and CONVERT
COALESCE
TRY_CONVERT
Cryptographic Functions: HASHBYTES
Date and Time Functions
GETDATE
DATEADD
DATEDIFF
DATEPART
ISDATE
Logical Functions: IIF
String Functions
CHARINDEX and PATINDEX
LEN
LEFT, RIGHT, and SUBSTRING
LTRIM, RTRIM, and TRIM
LOWER and UPPER
REPLACE and TRANSLATE
REVERSE
STRING_AGG
System Functions
ISNULL
ISNUMERIC
Final Thoughts on Functions
Part II. Various Data Problems
Chapter 3. Names, Names, Names
What’s in a Name?
Last Names
Punctuation
Suffixes
First Names
Middle Name
Nicknames
Company Name
Full Name
“Person-Like Entities”
Final Thoughts on Names
Chapter 4. Location, Location, Location
What Makes an Address?
Street Address
Box, Suite, Lot, or Apartment Number
Don’t Overdo It!
City
County
State or State Abbreviation
ZIP or Postal Code
Country
Final Thoughts on Locations
Chapter 5. Dates, Dates, Dates
Time Is Relative
Final Thoughts on Dates
Chapter 6. Email
What Makes a Valid Email Address?
Final Thoughts on Email
Chapter 7. Phone Numbers
What Makes a “Phone Number”?
One Final Note on Tax IDs
Final Thoughts on Phone Numbers (and Tax IDs)
Chapter 8. Bad Characters
Data Representations
Invisible Whitespace
COLLATE
Cleaning Up the Input Data
Final Thoughts on Bad Characters
Chapter 9. Orthogonal Data
A Common Problem, A Common Solution, A New Common Problem
Lather, Rinse, Repeat
Final Thoughts on Orthogonal Data
Part III. Bringing It Together
Chapter 10. The Big Score
What Will We Want?
Tuning Scores
Eliminating Duplicates
Duplicate Data
Duplicated Data
Final Thoughts on Scoring
Chapter 11. Data Quality, or GIGO
Sneaking Data Quality In
Impossible Data
Simply Wrong
Semantically Wrong
ETL Your Way to Success
Final Thoughts on Data Quality
Chapter 12. Tying It All Together
Approach
What’s the Score?
First Pass: Naive Matching
Second Pass: Normalizing Relations
Impossible Data
Now Let’s Normalize
Third Pass: Score!
What About Tuning?
Final Thoughts on Practical Matters
Chapter 13. Code Is Data, Too!
Working with XML Data
Working with JSON Data
Extracting Data from HTML
Code-Generating Code
Impact Analysis: The Second Case Study
Gather Together Every Code “Artifact” You Can
Import Artifacts into SQL
And Now, for My Next Trick
Final Thoughts on Code As Data
Final Thoughts on All of It
Appendix. The Data “Model”
Customer Table
NormalizedCustomer View
PotentialMatches Table
CustomerCountByState View
PostalAbbreviations Table
Glossary
Index
About the Author
Colophon
Tech Stack

✦ Subjects

Fuzzy Logic; SQL; Best Practices

📜 SIMILAR VOLUMES

Fuzzy Data Matching with SQL: Enhancing

📁 Fuzzy Data Matching with SQL: Enhancing Data Quality and Query Performance

✍ Jim Lehmer 📂 Library 📅 2023 🏛 O'Reilly Media 🌐 English

If you were handed two different but related sets of data, what tools would you use to find the matches? What if all you had was SQL SELECT access to a database? In this practical book, author Jim Lehmer provides best practices, techniques, and tricks to help you import, clean, match, score, and

Fuzzy Data Matching with SQL: Enhancing

📁 Fuzzy Data Matching with SQL: Enhancing Data Quality and Query Performance

✍ Jim Lehmer 📂 Library 🏛 O'Reilly Media 🌐 English

If you were handed two different but related sets of data, what tools would you use to find the matches? What if all you had was SQL SELECT access to a database? In this practical book, author Jim Lehmer provides best practices, techniques, and tricks to help you import, clean, match, score

Querying Databricks with Spark SQL: Leve

📁 Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights

✍ Adam Aspin 📂 Library 📅 2023 🏛 BPB Online 🌐 English

A practical guide to using Spark SQL to perform complex queries on your Databricks data Description Databricks stands out as a widely embraced platform dedicated to the creation of data lakes. Within its framework, it extends support to a specialized version of Structured Query Language (SQL) kn

SQL for Data Analytics: Perform fast and

📁 SQL for Data Analytics: Perform fast and efficient data analysis with the power of SQL

✍ Upom Malik, Matt Goldwasser, Benjamin Johnston 📂 Library 📅 2019 🏛 Packt Publishing 🌐 English

Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets.Key Features<li>Explore a variety of statistical techniques to analyze your data<li>Integrate your SQL pipelines with other analytics technologies<li>Perform adv

SQL for Data Analytics: Perform fast and

📁 SQL for Data Analytics: Perform fast and efficient data analysis with the power of SQL

✍ Upom Malik, Matt Goldwasser, Benjamin Johnston 📂 Library 📅 2019 🏛 Packt Publishing 🌐 English

Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets. Key Features • Explore a variety of statistical techniques to analyze your data • Integrate your SQL pipelines with other analytics technologies • Perform advanced analytics suc

SQL for Data Analytics: Perform Fast and

📁 SQL for Data Analytics: Perform Fast and Efficient Data Analysis with the Power of SQL

✍ Upom Malik; Matt Goldwasser; Benjamin Johnston 📂 Library 📅 2019 🌐 English

Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets. Key Features Explore a variety of statistical techniques to analyze your data Integrate your SQL pipelines with other analytics technologies Perform advanced analytics such as geospat