๐”– Scriptorium
โœฆ   LIBER   โœฆ

๐Ÿ“

Fuzzy Data Matching with SQL: Enhancing Data Quality and Query Performance

โœ Scribed by Jim Lehmer


Publisher
O'Reilly Media
Year
2023
Tongue
English
Leaves
285
Edition
1
Category
Library

โฌ‡  Acquire This Volume

No coin nor oath required. For personal study only.

โœฆ Synopsis


If you were handed two different but related sets of data, what tools would you use to find the matches? What if all you had was SQL SELECT access to a database? In this practical book, author Jim Lehmer provides best practices, techniques, and tricks to help you import, clean, match, score, and think about heterogeneous data using SQL.

DBAs, programmers, business analysts, and data scientists will learn how to identify and remove duplicates, parse strings, extract data from XML and JSON, generate SQL using SQL, regularize data and prepare datasets, and apply data quality and ETL approaches for finding the similarities and differences between various expressions of the same data.

Full of real-world techniques, the examples in the book contain working code. You'll learn how to:
โ€ข Identity and remove duplicates in two different datasets using SQL
โ€ข Regularize data and achieve data quality using SQL
โ€ข Extract data from XML and JSON
โ€ข Generate SQL using SQL to increase your productivity
โ€ข Prepare datasets for import, merging, and better analysis using SQL
โ€ข Report results using SQL
โ€ข Apply data quality and ETL approaches to finding similarities and differences between various expressions of the same data

โœฆ Table of Contents


Cover
Copyright
Table of Contents
Preface
What Problems Are We Trying to Solve?
What Will We Cover?
Part I: Review
Part II: Various Data Problems
Part III: Bringing It Together
Appendix
Who Is This Book For?
Why SQL?
Warning! Opinions Ahead!
Typographical Conventions Used in This Book
Additional Information on the Bookโ€™s Conventions
The Data โ€œModelโ€
Environment Layout
Customer Table
โ€œNormalizedโ€ View
Meet the Snedleys
Using Code Examples
Oโ€™Reilly Online Learning
How to Contact Us
Acknowledgments
Part I. Review
Chapter 1. A SELECT Review
Simple SELECT Statements
Common Table Expressions
In CASE of Emergency
Joins
A Diversion into NULL Values
OUTER JOINs
Finding the Most Current Value
Final Thoughts on SELECT
Chapter 2. Function Junction
Aggregate Functions
MAX
MIN
COUNT
SUM
AVG
Conversion Functions
CAST and CONVERT
COALESCE
TRY_CONVERT
Cryptographic Functions: HASHBYTES
Date and Time Functions
GETDATE
DATEADD
DATEDIFF
DATEPART
ISDATE
Logical Functions: IIF
String Functions
CHARINDEX and PATINDEX
LEN
LEFT, RIGHT, and SUBSTRING
LTRIM, RTRIM, and TRIM
LOWER and UPPER
REPLACE and TRANSLATE
REVERSE
STRING_AGG
System Functions
ISNULL
ISNUMERIC
Final Thoughts on Functions
Part II. Various Data Problems
Chapter 3. Names, Names, Names
Whatโ€™s in a Name?
Last Names
Punctuation
Suffixes
First Names
Middle Name
Nicknames
Company Name
Full Name
โ€œPerson-Like Entitiesโ€
Final Thoughts on Names
Chapter 4. Location, Location, Location
What Makes an Address?
Street Address
Box, Suite, Lot, or Apartment Number
Donโ€™t Overdo It!
City
County
State or State Abbreviation
ZIP or Postal Code
Country
Final Thoughts on Locations
Chapter 5. Dates, Dates, Dates
Time Is Relative
Final Thoughts on Dates
Chapter 6. Email
What Makes a Valid Email Address?
Final Thoughts on Email
Chapter 7. Phone Numbers
What Makes a โ€œPhone Numberโ€?
One Final Note on Tax IDs
Final Thoughts on Phone Numbers (and Tax IDs)
Chapter 8. Bad Characters
Data Representations
Invisible Whitespace
COLLATE
Cleaning Up the Input Data
Final Thoughts on Bad Characters
Chapter 9. Orthogonal Data
A Common Problem, A Common Solution, A New Common Problem
Lather, Rinse, Repeat
Final Thoughts on Orthogonal Data
Part III. Bringing It Together
Chapter 10. The Big Score
What Will We Want?
Tuning Scores
Eliminating Duplicates
Duplicate Data
Duplicated Data
Final Thoughts on Scoring
Chapter 11. Data Quality, or GIGO
Sneaking Data Quality In
Impossible Data
Simply Wrong
Semantically Wrong
ETL Your Way to Success
Final Thoughts on Data Quality
Chapter 12. Tying It All Together
Approach
Whatโ€™s the Score?
First Pass: Naive Matching
Second Pass: Normalizing Relations
Impossible Data
Now Letโ€™s Normalize
Third Pass: Score!
What About Tuning?
Final Thoughts on Practical Matters
Chapter 13. Code Is Data, Too!
Working with XML Data
Working with JSON Data
Extracting Data from HTML
Code-Generating Code
Impact Analysis: The Second Case Study
Gather Together Every Code โ€œArtifactโ€ You Can
Import Artifacts into SQL
And Now, for My Next Trick
Final Thoughts on Code As Data
Final Thoughts on All of It
Appendix. The Data โ€œModelโ€
Customer Table
NormalizedCustomer View
PotentialMatches Table
CustomerCountByState View
PostalAbbreviations Table
Glossary
Index
About the Author
Colophon
Tech Stack

โœฆ Subjects


Fuzzy Logic; SQL; Best Practices


๐Ÿ“œ SIMILAR VOLUMES


Fuzzy Data Matching with SQL: Enhancing
โœ Jim Lehmer ๐Ÿ“‚ Library ๐Ÿ“… 2023 ๐Ÿ› O'Reilly Media ๐ŸŒ English

<p>If you were handed two different but related sets of data, what tools would you use to find the matches? What if all you had was SQL SELECT access to a database? In this practical book, author Jim Lehmer provides best practices, techniques, and tricks to help you import, clean, match, score, and

Fuzzy Data Matching with SQL: Enhancing
โœ Jim Lehmer ๐Ÿ“‚ Library ๐Ÿ› O'Reilly Media ๐ŸŒ English

<p><span>If you were handed two different but related sets of data, what tools would you use to find the matches? What if all you had was SQL SELECT access to a database? In this practical book, author Jim Lehmer provides best practices, techniques, and tricks to help you import, clean, match, score

Querying Databricks with Spark SQL: Leve
โœ Adam Aspin ๐Ÿ“‚ Library ๐Ÿ“… 2023 ๐Ÿ› BPB Online ๐ŸŒ English

A practical guide to using Spark SQL to perform complex queries on your Databricks data Description Databricks stands out as a widely embraced platform dedicated to the creation of data lakes. Within its framework, it extends support to a specialized version of Structured Query Language (SQL) kn

SQL for Data Analytics: Perform fast and
โœ Upom Malik, Matt Goldwasser, Benjamin Johnston ๐Ÿ“‚ Library ๐Ÿ“… 2019 ๐Ÿ› Packt Publishing ๐ŸŒ English

<p><b>Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets.</b><p><b>Key Features</b><p><li>Explore a variety of statistical techniques to analyze your data<li>Integrate your SQL pipelines with other analytics technologies<li>Perform adv

SQL for Data Analytics: Perform fast and
โœ Upom Malik, Matt Goldwasser, Benjamin Johnston ๐Ÿ“‚ Library ๐Ÿ“… 2019 ๐Ÿ› Packt Publishing ๐ŸŒ English

Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets. Key Features โ€ข Explore a variety of statistical techniques to analyze your data โ€ข Integrate your SQL pipelines with other analytics technologies โ€ข Perform advanced analytics suc

SQL for Data Analytics: Perform Fast and
โœ Upom Malik; Matt Goldwasser; Benjamin Johnston ๐Ÿ“‚ Library ๐Ÿ“… 2019 ๐ŸŒ English

Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets. Key Features Explore a variety of statistical techniques to analyze your data Integrate your SQL pipelines with other analytics technologies Perform advanced analytics such as geospat