Hands-On Entity Resolution, A Practical Guide to Data Matching with Python

✍ Scribed by Michael Shearer

Publisher: O'Reilly Media, Inc.
Year: 2024
Tongue: English
Leaves: 200
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Entity resolution is a key analytic technique that enables you to identify multiple data records that refer to the same real-world entity. With this hands-on guide, product managers, data analysts, and data scientists will learn how to add value to data by cleansing, analyzing, and resolving datasets using open source Python libraries and cloud APIs.

✦ Table of Contents

Preface
Who Should Read This Book
Why I Wrote This Book
Navigating This Book
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
1. Introduction to Entity Resolution
What Is Entity Resolution?
Why Is Entity Resolution Needed?
Main Challenges of Entity Resolution
Lack of Unique Names
Inconsistent Naming Conventions
Data Capture Inconsistencies
Worked Example
Deliberate Obfuscation
Match Permutations
Blind Matching?
The Entity Resolution Process
Data Standardization
Record Blocking
Attribute Comparison
Match Classification
Clustering
Canonicalization
Worked Example
Measuring Performance
Getting Started
2. Data Standardization
Sample Problem
Environment Setup
Acquiring Data
Wikipedia Data
TheyWorkForYou Data
Adding Facebook links
Cleansing Data
Wikipedia
TheyWorkForYou
Attribute Comparison
Constituency
Measuring Performance
Sample Calculation
Summary
3. Text Matching
Edit Distance Matching
Levenshtein Distance
Jaro Similarity
Jaro-Winkler Similarity
Phonetic Matching
Metaphone
Match Rating Approach
Comparing the Techniques
Sample Problem
Full Similarity Comparison
Measuring Performance
Summary
4. Probabilistic Matching
Sample Problem
Single Attribute Match Probability
First Name Match Probability
Last Name Match Probability
Multiple Attribute Match Probability
Probabilistic Models
Bayes’ Theorem
m Value
u Value
Lambda ( λ ) Value
Bayes Factor
Fellegi-Sunter Model
Match Weight
Expectation-Maximization Algorithm
Iteration 1
Iteration 2
Iteration 3
Introducing Splink
Configuring Splink
Splink Performance
Summary
5. Record Blocking
Sample Problem
Data Acquisition
Wikipedia Data
UK Companies House Data
Data Standardization
Wikipedia Data
UK Companies House Data
Record Blocking and Attribute Comparison
Record Blocking with Splink
Attribute Comparison
Match Classification
Measuring Performance
Summary
6. Company Matching
Sample Problem
Data Acquisition
Data Standardization
Companies House Data
Maritime and Coastguard Agency Data
Record Blocking and Attribute Comparison
Match Classification
Measuring Performance
Matching New Entities
Summary
7. Clustering
Simple Exact Match Clustering
Approximate Match Clustering
Sample Problem
Data Acquisition
Data Standardization
Record Blocking and Attribute Comparison
Data Analysis
Expectation-Maximization Blocking Rules
Match Classification and Clustering
Cluster Visualization
Cluster Analysis
Summary
8. Scaling Up on Google Cloud
Google Cloud Setup
Setting Up Project Storage
Creating a Dataproc Cluster
Configuring a Dataproc Cluster
Entity Resolution on Spark
Measuring Performance
Tidy Up!
Summary
9. Cloud Entity Resolution Services
Introduction to BigQuery
Enterprise Knowledge Graph API
Schema Mapping
Reconciliation Job
Result Processing
Entity Reconciliation Python Client
Measuring Performance
Summary
10. Privacy-Preserving Record Linkage
An Introduction to Private Set Intersection
How PSI Works
PSI Protocol Based on ECDH
Bloom Filters
Bloom filter example
Golomb-Coded Sets
GCS example
Example: Using the PSI Process
Environment Setup
Google Cloud setup
Option 1: Prebuilt PSI package
Option 2: Build PSI package
Server install
Server Code
Client Code
Using raw encrypted server values
Using Bloom filter–encoded encrypted server values
Using GCS-encoded encrypted server values
Full MCA and Companies House Sample Example
Summary
11. Further Considerations
Data Considerations
Unstructured Data
Data Quality
Temporal Equivalence
Attribute Comparison
Set Matching
Geocoding Location Matching
Aggregating Comparisons
Post Processing
Graphical Representation
Real-Time Considerations
Performance Evaluation
Pairwise Approach
Cluster-Based Approach
Future of Entity Resolution
Index

📜 SIMILAR VOLUMES

Hands-On Entity Resolution: A Practical

📁 Hands-On Entity Resolution: A Practical Guide to Data Matching With Python

✍ Michael Shearer 📂 Library 📅 2024 🏛 O'Reilly Media 🌐 English

<p>Entity resolution is a key analytic technique that enables you to identify multiple data records that refer to the same real-world entity. With this hands-on guide, product managers, data analysts, and data scientists will learn how to add value to data by cleansing, analyzing, and resolving data

Hands-On Entity Resolution: A Practical

📁 Hands-On Entity Resolution: A Practical Guide to Data Matching With Python

✍ Michael Shearer 📂 Library 📅 2024 🏛 O'Reilly Media 🌐 English

Hands-On Entity Resolution: A Practical

📁 Hands-On Entity Resolution: A Practical Guide to Data Matching With Python

✍ Michael Shearer 📂 Library 📅 2024 🏛 Oreilly & Associates Inc 🌐 English

<p><span>Entity resolution is a key analytic technique that enables you to identify multiple data records that refer to the same real-world entity. With this hands-on guide, product managers, data analysts, and data scientists will learn how to add value to data by cleansing, analyzing, and resolvin

A Practical Introduction to Python Progr

📁 A Practical Introduction to Python Programming : Hand-On Machine Learning With Python

✍ Engr. Michael David 📂 Library 📅 2021 🌐 English

My goal here is for something that is partly a tutorial and partly a reference book. I like how tutorials get you up and running quickly, but they can often be a little wordy and disorganized. Reference books contain a lot of good information, but they are often too terse, and they don’t often give

Data Ingestion with Python Cookbook: A p

📁 Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring

✍ Gláucia Esppenchutz 📂 Library 📅 2023 🏛 Packt Publishing Pvt Ltd 🌐 English

Deploy your data ingestion pipeline, orchestrate, and monitor efficiently to prevent loss of data and quality Key Features Harness best practices to create a Python and PySpark data ingestion pipeline Seamlessly automate and orchestrate your data pipelines using Apache Airflow Build a monitori

Modern Data Architectures with Python: A

📁 Modern Data Architectures with Python: A practical guide to building and deploying data pipelines, data warehouses

✍ Brian Lipp 📂 Library 📅 2023 🏛 Packt Publishing 🌐 English

Learn to build scalable and reliable data ecosystems using Data Mesh, Databricks Spark, and Kafka. Key Features Develop modern data skills in emerging technologies Learn pragmatic design methodologies like Data Mesh and Lake House Grow a deeper understanding of data governance Book Descript