๐”– Scriptorium
โœฆ   LIBER   โœฆ

๐Ÿ“

Fuzzy Data Matching with SQL: Enhancing Data Quality and Query Performance

โœ Scribed by Jim Lehmer


Publisher
O'Reilly Media
Year
2023
Tongue
English
Leaves
282
Category
Library

โฌ‡  Acquire This Volume

No coin nor oath required. For personal study only.

โœฆ Synopsis


If you were handed two different but related sets of data, what tools would you use to find the matches? What if all you had was SQL SELECT access to a database? In this practical book, author Jim Lehmer provides best practices, techniques, and tricks to help you import, clean, match, score, and think about heterogeneous data using SQL.

DBAs, programmers, business analysts, and data scientists will learn how to identify and remove duplicates, parse strings, extract data from XML and JSON, generate SQL using SQL, regularize data and prepare datasets, and apply data quality and ETL approaches for finding the similarities and differences between various expressions of the same data.

Full of real-world techniques, the examples in the book contain working code. You'll learn how to:

  • Identity and remove duplicates in two different datasets using SQL
  • Regularize data and achieve data quality using SQL
  • Extract data from XML and JSON
  • Generate SQL...
  • โœฆ Table of Contents


    Preface
    What Problems Are We Trying to Solve?
    What Will We Cover?
    Part I: Review
    Part II: Various Data Problems
    Part III: Bringing It Together
    Appendix
    Who Is This Book For?
    Why SQL?
    Warning! Opinions Ahead!
    Typographical Conventions Used in This Book
    Additional Information on the Bookโ€™s Conventions
    The Data โ€œModelโ€
    Environment Layout
    Customer Table
    โ€œNormalizedโ€ View
    Meet the Snedleys
    Using Code Examples
    Oโ€™Reilly Online Learning
    How to Contact Us
    Acknowledgments
    I. Review
    1. A SELECT Review
    Simple SELECT Statements
    Common Table Expressions
    In CASE of Emergency
    Joins
    A Diversion into NULL Values
    OUTER JOINs
    Finding the Most Current Value
    Final Thoughts on SELECT
    2. Function Junction
    Aggregate Functions
    MAX
    MIN
    COUNT
    SUM
    AVG
    Conversion Functions
    CAST and CONVERT
    COALESCE
    TRY_CONVERT
    Cryptographic Functions: HASHBYTES
    Date and Time Functions
    GETDATE
    DATEADD
    DATEDIFF
    DATEPART
    ISDATE
    Logical Functions: IIF
    String Functions
    CHARINDEX and PATINDEX
    LEN
    LEFT, RIGHT, and SUBSTRING
    LTRIM, RTRIM, and TRIM
    LOWER and UPPER
    REPLACE and TRANSLATE
    REVERSE
    STRING_AGG
    System Functions
    ISNULL
    ISNUMERIC
    Final Thoughts on Functions
    II. Various Data Problems
    3. Names, Names, Names
    Whatโ€™s in a Name?
    Last Names
    Punctuation
    Suffixes
    First Names
    Middle Name
    Nicknames
    Company Name
    Full Name
    โ€œPerson-Like Entitiesโ€
    Final Thoughts on Names
    4. Location, Location, Location
    What Makes an Address?
    Street Address
    Box, Suite, Lot, or Apartment Number
    Donโ€™t Overdo It!
    City
    County
    State or State Abbreviation
    ZIP or Postal Code
    Country
    Final Thoughts on Locations
    5. Dates, Dates, Dates
    Time Is Relative
    Final Thoughts on Dates
    6. Email
    What Makes a Valid Email Address?
    Final Thoughts on Email
    7. Phone Numbers
    What Makes a โ€œPhone Numberโ€?
    One Final Note on Tax IDs
    Final Thoughts on Phone Numbers (and Tax IDs)
    8. Bad Characters
    Data Representations
    Invisible Whitespace
    COLLATE
    Cleaning Up the Input Data
    Final Thoughts on Bad Characters
    9. Orthogonal Data
    A Common Problem, A Common Solution, A New Common Problem
    Lather, Rinse, Repeat
    Final Thoughts on Orthogonal Data
    III. Bringing It Together
    10. The Big Score
    What Will We Want?
    Tuning Scores
    Eliminating Duplicates
    Duplicate Data
    Duplicated Data
    Final Thoughts on Scoring
    11. Data Quality, or GIGO
    Sneaking Data Quality In
    Impossible Data
    Simply Wrong
    Semantically Wrong
    ETL Your Way to Success
    Final Thoughts on Data Quality
    12. Tying It All Together
    Approach
    Whatโ€™s the Score?
    First Pass: Naive Matching
    Second Pass: Normalizing Relations
    Impossible Data
    Now Letโ€™s Normalize
    Third Pass: Score!
    What About Tuning?
    Final Thoughts on Practical Matters
    13. Code Is Data, Too!
    Working with XML Data
    Working with JSON Data
    Extracting Data from HTML
    Code-Generating Code
    Impact Analysis: The Second Case Study
    Gather Together Every Code โ€œArtifactโ€ You Can
    Import Artifacts into SQL
    And Now, for My Next Trick
    Final Thoughts on Code As Data
    Final Thoughts on All of It
    A. The Data โ€œModelโ€
    Customer Table
    NormalizedCustomer View
    PotentialMatches Table
    CustomerCountByState View
    PostalAbbreviations Table
    Glossary
    Index


    ๐Ÿ“œ SIMILAR VOLUMES


    Fuzzy Data Matching with SQL: Enhancing
    โœ Jim Lehmer ๐Ÿ“‚ Library ๐Ÿ› O'Reilly Media ๐ŸŒ English

    <p><span>If you were handed two different but related sets of data, what tools would you use to find the matches? What if all you had was SQL SELECT access to a database? In this practical book, author Jim Lehmer provides best practices, techniques, and tricks to help you import, clean, match, score

    Fuzzy Data Matching with SQL: Enhancing
    โœ Jim Lehmer ๐Ÿ“‚ Library ๐Ÿ“… 2023 ๐Ÿ› O'Reilly Media ๐ŸŒ English

    If you were handed two different but related sets of data, what tools would you use to find the matches? What if all you had was SQL SELECT access to a database? In this practical book, author Jim Lehmer provides best practices, techniques, and tricks to help you import, clean, match, score, and thi

    Querying Databricks with Spark SQL: Leve
    โœ Adam Aspin ๐Ÿ“‚ Library ๐Ÿ“… 2023 ๐Ÿ› BPB Online ๐ŸŒ English

    A practical guide to using Spark SQL to perform complex queries on your Databricks data Description Databricks stands out as a widely embraced platform dedicated to the creation of data lakes. Within its framework, it extends support to a specialized version of Structured Query Language (SQL) kn

    SQL for Data Analytics: Perform fast and
    โœ Upom Malik, Matt Goldwasser, Benjamin Johnston ๐Ÿ“‚ Library ๐Ÿ“… 2019 ๐Ÿ› Packt Publishing ๐ŸŒ English

    <p><b>Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets.</b><p><b>Key Features</b><p><li>Explore a variety of statistical techniques to analyze your data<li>Integrate your SQL pipelines with other analytics technologies<li>Perform adv

    SQL for Data Analytics: Perform fast and
    โœ Upom Malik, Matt Goldwasser, Benjamin Johnston ๐Ÿ“‚ Library ๐Ÿ“… 2019 ๐Ÿ› Packt Publishing ๐ŸŒ English

    Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets. Key Features โ€ข Explore a variety of statistical techniques to analyze your data โ€ข Integrate your SQL pipelines with other analytics technologies โ€ข Perform advanced analytics suc

    SQL for Data Analytics: Perform Fast and
    โœ Upom Malik; Matt Goldwasser; Benjamin Johnston ๐Ÿ“‚ Library ๐Ÿ“… 2019 ๐ŸŒ English

    Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets. Key Features Explore a variety of statistical techniques to analyze your data Integrate your SQL pipelines with other analytics technologies Perform advanced analytics such as geospat