SQL for Data Science: Data Cleaning, Wrangling and Analytics with Relational Databases

✍ Scribed by Antonio Badia

Publisher: Springer
Year: 2020
Tongue: English
Leaves: 290
Series: Data-Centric Systems and Applications
Edition: 1
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

This textbook explains SQL within the context of data science and introduces the different parts of SQL as they are needed for the tasks usually carried out during data analysis. Using the framework of the data life cycle, it focuses on the steps that are very often given the short shift in traditional textbooks, like data loading, cleaning and pre-processing.

The book is organized as follows. Chapter 1 describes the data life cycle, i.e. the sequence of stages from data acquisition to archiving, that data goes through as it is prepared and then actually analyzed, together with the different activities that take place at each stage. Chapter 2 gets into databases proper, explaining how relational databases organize data. Non-traditional data, like XML and text, are also covered. Chapter 3 introduces SQL queries, but unlike traditional textbooks, queries and their parts are described around typical data analysis tasks like data exploration, cleaning and transformation. Chapter 4 introduces some basic techniques for data analysis and shows how SQL can be used for some simple analyses without too much complication. Chapter 5 introduces additional SQL constructs that are important in a variety of situations and thus completes the coverage of SQL queries. Lastly, chapter 6 briefly explains how to use SQL from within R and from within Python programs. It focuses on how these languages can interact with a database, and how what has been learned about SQL can be leveraged to make life easier when using R or Python. All chapters contain a lot of examples and exercises on the way, and readers are encouraged to install the two open-source database systems (MySQL and Postgres) that are used throughout the book in order to practice and work on the exercises, because simply reading the book is much less useful than actually using it.

This book is for anyone interested in data science and/or databases. It just demands a bit of computer fluency, but no specific background on databases or data analysis. All concepts are introduced intuitively and with a minimum of specialized jargon. After going through this book, readers should be able to profitably learn more about data mining, machine learning, and database management from more advanced textbooks and courses.

✦ Table of Contents

Preface
Contents
1 The Data Life Cycle
1.1 Stages and Operations in the Data Life Cycle
1.2 Types of Datasets
1.2.1 Structured Data
1.2.2 Semistructured Data
1.2.3 Unstructured Data
1.3 Types of Domains
1.3.1 Nominal/Categorical Data
1.3.2 Ordinal Data
1.3.3 Numerical Data
1.4 Metadata
1.5 The Role of Databases in the Cycle
2 Relational Data
2.1 Database Tables
2.1.1 Data Types
2.1.2 Inserting Data
2.1.3 Keys
2.1.4 Organizing Data into Tables
2.2 Database Schemas
2.2.1 Heterogeneous Data
2.2.2 Multi-valued Attributes
2.2.3 Complex Data
2.3 Other Types of Data
2.3.1 XML and JSON Data
2.3.2 Graph Data
2.3.3 Text
2.4 Getting Data In and Out of the Database
2.4.1 Importing and Loading Data
2.4.2 Updating Data
2.4.3 Exporting Data
3 Data Cleaning and Pre-processing
3.1 The Basic SQL Query
3.1.1 Joins
3.1.2 Functions
3.1.3 Grouping
3.1.4 Order
3.1.5 Complex Queries
3.2 Exploratory Data Analysis (EDA)
3.2.1 Univariate Analysis
3.2.2 Multivariate Analysis
3.2.3 Distribution Fitting
3.3 Data Cleaning
3.3.1 Attribute Transformation
3.3.1.1 Working with Numbers
3.3.1.2 Working with Strings
3.3.1.3 Working with Dates
3.3.2 Missing Data
3.3.3 Outlier Detection
3.3.4 Duplicate Detection and Removal
3.4 Data Pre-processing
3.4.1 Restructuring Data
3.5 Metadata and Implementing Workflows
3.5.1 Metadata
4 Introduction to Data Analysis
4.1 What Is Data Analysis?
4.2 Supervised Approaches
4.2.1 Classification: Naive Bayes
4.2.2 Linear Regression
4.2.3 Logistic Regression
4.3 Unsupervised Approaches
4.3.1 Distances and Clustering
4.3.1.1 K-Means Clustering
4.3.2 The kNN Algorithm
4.3.3 Association Rules
4.4 Dealing with JSON/XML
4.5 Text Analysis
4.6 Graph Analytics: Recursive Queries
4.7 Collaborative Filtering
5 More SQL
5.1 More on Joins
5.2 Complex Subqueries
5.3 Windows and Window Aggregates
5.4 Set Operations
5.5 Expressing Domain Knowledge
6 Databases and Other Tools
6.1 SQL and R
6.1.1 DBI
6.1.2 dbplyr
6.1.3 sqldf
6.1.4 Packages: Advanced Data Analysis
6.2 SQL and Python
6.2.1 Python and Databases: DB-API
6.2.2 Libraries and Further Analysis
A Getting Started
A.1 Downloading and Installing Postgres and MySQL
A.2 Getting the Server Started
A.3 User Management
B Big Data
B.1 What Is Big Data?
B.2 Data Warehouses
B.3 Cluster Databases
B.4 The Cloud
References
Index

✦ Subjects

Unsupervised Learning; Data Science; Supervised Learning; Python; Big Data; Classification; SQL; Relational Databases; MySQL; PostgreSQL; R; JSON; Linear Regression; Logistic Regression; Data Cleaning; Text Analysis; Relational Algebra; XML

📜 SIMILAR VOLUMES

Practical Python Data Wrangling and Data

📁 Practical Python Data Wrangling and Data Quality: Getting Started with Reading, Cleaning, and Analyzing Data

✍ Susan E. McGregor 📂 Library 📅 2021 🏛 O'Reilly Media 🌐 English

<div>The world around us is full of data that holds unique insights and valuable stories, and this book will help you uncover them. Whether you already work with data or want to learn more about its possibilities, the examples and techniques in this practical book will help you more easily clean,

Practical Python Data Wrangling and Data

📁 Practical Python Data Wrangling and Data Quality: Getting Started with Reading, Cleaning, and Analyzing Data

✍ Susan McGregor 📂 Library 📅 2021 🏛 O'Reilly Media 🌐 English

The world around us is full of data that holds unique insights and valuable stories, and this book will help you uncover them. Whether you already work with data or want to learn more about its possibilities, the examples and techniques in this practical book will help you more easily clean, evaluat

Data Wrangling on AWS: Clean and organiz

📁 Data Wrangling on AWS: Clean and organize complex data for analysis

✍ Navnit Shukla | Sankar M | Sam Palani 📂 Library 📅 2023 🏛 Packt Publishing Pvt Ltd 🌐 English

Data wrangling is the process of cleaning, transforming, and organizing raw, messy, or unstructured data into a structured format. It involves processes such as data cleaning, data integration, data transformation, and data enrichment to ensure that the data is accurate, consistent, and suitable for

SQL for Data Analytics: Perform fast and

📁 SQL for Data Analytics: Perform fast and efficient data analysis with the power of SQL

✍ Upom Malik, Matt Goldwasser, Benjamin Johnston 📂 Library 📅 2019 🏛 Packt Publishing 🌐 English

Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets.Key Features<li>Explore a variety of statistical techniques to analyze your data<li>Integrate your SQL pipelines with other analytics technologies<li>Perform adv

SQL for Data Analytics: Perform fast and

📁 SQL for Data Analytics: Perform fast and efficient data analysis with the power of SQL

✍ Upom Malik, Matt Goldwasser, Benjamin Johnston 📂 Library 📅 2019 🏛 Packt Publishing 🌐 English

Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets. Key Features • Explore a variety of statistical techniques to analyze your data • Integrate your SQL pipelines with other analytics technologies • Perform advanced analytics suc

SQL for Data Analytics: Perform Fast and

📁 SQL for Data Analytics: Perform Fast and Efficient Data Analysis with the Power of SQL

✍ Upom Malik; Matt Goldwasser; Benjamin Johnston 📂 Library 📅 2019 🌐 English

Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets. Key Features Explore a variety of statistical techniques to analyze your data Integrate your SQL pipelines with other analytics technologies Perform advanced analytics such as geospat