Big Data Analytics: A Guide to Data Science Practitioners Making the Transition to Big Data (Chapman & Hall/CRC Data Science)

✍ Scribed by Matter, Ulrich;

Publisher: CRC Press LLC
Year: 2023
Tongue: English
Leaves: 328
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Successfully navigating the data-driven economy presupposes a certain understanding of the technologies and methods to gain insights from Big Data. This book aims to help data science practitioners to successfully manage the transition to Big Data. Building on familiar content from applied econometrics and business analytics, this book introduces the reader to the basic concepts of Big Data Analytics. The focus of the book is on how to productively apply econometric and machine learning techniques with large, complex data sets, as well as on all the steps involved before analysing the data (data storage, data import, data preparation). The book combines conceptual and theoretical material with the practical application of the concepts using R and SQL. The reader will thus acquire the skills to analyse large data sets, both locally and in the cloud. Various code examples and tutorials, focused on empirical economic and business research, illustrate practical techniques to handle and analyse Big Data. Key Features: Includes many code examples in R and SQL, with R/SQL scripts freely provided online. Extensive use of real datasets from empirical economic research and business analytics, with data files freely provided online. Leads students and practitioners to think critically about where the bottlenecks are in practical data analysis tasks with large data sets, and how to address them. The book is a valuable resource for data science practitioners, graduate students and researchers who aim to gain insights from big data in the context of research questions in business, economics, and the social sciences.

✦ Table of Contents

Preface

I Setting the Scene: Analyzing Big Data

Introduction

1 What is Big in “Big Data”?

2 Approaches to Analyzing Big Data

3 The Two Domains of Big Data Analytics

    3.1 A practical big P problem

        3.1.1 Simple logistic regression (naive approach)

        3.1.2 Regularization: the lasso estimator

    3.2 A practical big N problem

        3.2.1 OLS as a point of reference

        3.2.2 The Uluru algorithm as an alternative to OLS

II Platform: Software and Computing Resources

Introduction

4 Software: Programming with (Big) Data

    4.1 Domains of programming with (big) data

    4.2 Measuring R performance

    4.3 Writing efficient R code

        4.3.1 Memory allocation and growing objects

        4.3.2 Vectorization in basic R functions

        4.3.3 apply-type functions and vectorization

        4.3.4 Avoiding unnecessary copying

        4.3.5 Releasing memory

        4.3.6 Beyond R

    4.4 SQL basics

        4.4.1 First steps in SQL(ite)

        4.4.2 Joins

    4.5 With a little help from my friends: GPT and R/SQL coding

    4.6 Wrapping up

5 Hardware: Computing Resources

    5.1 Mass storage

        5.1.1 Avoiding redundancies

        5.1.2 Data compression

    5.2 Random access memory (RAM)

    5.3 Combining RAM and hard disk: Virtual memory

    5.4 CPU and parallelization

        5.4.1 Naive multi-session approach

        5.4.2 Multi-session approach with futures

        5.4.3 Multi-core and multi-node approach

    5.5 GPUs for scientific computing

        5.5.1 GPUs in R

    5.6 The road ahead: Hardware made for machine learning

    5.7 Wrapping up

    5.8 Still have insufficient computing resources?

6 Distributed Systems

    6.1 MapReduce

    6.2 Apache Hadoop

        6.2.1 Hadoop word count example

    6.3 Apache Spark

    6.4 Spark with R

        6.4.1 Data import and summary statistics

    6.5 Spark with SQL

    6.6 Spark with R + SQL

    6.7 Wrapping up

7 Cloud Computing

    7.1 Cloud computing basics and platforms

    7.2 Transitioning to the cloud

    7.3 Scaling up in the cloud: Virtual servers

        7.3.1 Parallelization with an EC2 instance

    7.4 Scaling up with GPUs

        7.4.1 GPUs on Google Colab

        7.4.2 RStudio and EC2 with GPUs on AWS

    7.5 Scaling out: MapReduce in the cloud

    7.6 Wrapping up

III Components of Big Data Analytics

Introduction

8 Data Collection and Data Storage

    8.1 Gathering and compilation of raw data

    8.2 Stack/combine raw source files

    8.3 Efficient local data storage

        8.3.1 RDBMS basics

        8.3.2 Efficient data access: Indices and joins in SQLite

    8.4 Connecting R to an RDBMS

        8.4.1 Creating a new database with RSQLite

        8.4.2 Importing data

        8.4.3 Issuing queries

    8.5 Cloud solutions for (big) data storage

        8.5.1 Easy-to-use RDBMS in the cloud: AWS RDS

    8.6 Column-based analytics databases

        8.6.1 Installation and start up

        8.6.2 First steps via Druid's GUI

        8.6.3 Query Druid from R

    8.7 Data warehouses

        8.7.1 Data warehouse for analytics: Google BigQuery example

    8.8 Data lakes and simple storage service

        8.8.1 AWS S3 with R: First steps

        8.8.2 Uploading data to S3

        8.8.3 More than just simple storage: S3 + Amazon Athena

    8.9 Wrapping up

9 Big Data Cleaning and Transformation

    9.1 Out-of-memory strategies and lazy evaluation: Practical basics

        9.1.1 Chunking data with the ff package

        9.1.2 Memory mapping with bigmemory

        9.1.3 Connecting to Apache Arrow

    9.2 Big Data preparation tutorial with ff

        9.2.1 Set up

        9.2.2 Data import

        9.2.3 Inspect imported files

        9.2.4 Data cleaning and transformation

        9.2.5 Inspect difference in in-memory operation

        9.2.6 Subsetting

        9.2.7 Save/load/export ff files

    9.3 Big Data preparation tutorial with arrow

    9.4 Wrapping up

10 Descriptive Statistics and Aggregation

    10.1 Data aggregation: The ‘split-apply-combine’ strategy

    10.2 Data aggregation with chunked data files

    10.3 High-speed in-memory data aggregation with arrow

    10.4 High-speed in-memory data aggregation with data.table

    10.5 Wrapping up

11 (Big) Data Visualization

    11.1 Challenges of Big Data visualization

    11.2 Data exploration with ggplot2

    11.3 Visualizing time and space

        11.3.1 Preparations

        11.3.2 Pick-up and drop-off locations

    11.4 Wrapping up

IV Application: Topics in Big Data Econometrics

Introduction

12 Bottlenecks in Everyday Data Analytics Tasks

    12.1 Case study: Efficient fixed effects estimation

    12.2 Case study: Loops, memory, and vectorization

        12.2.1 Naïve approach (ignorant of R)

        12.2.2 Improvement 1: Pre-allocation of memory

        12.2.3 Improvement 2: Exploit vectorization

    12.3 Case study: Bootstrapping and parallel processing

        12.3.1 Parallelization with an EC2 instance

13 Econometrics with GPUs

    13.1 OLS on GPUs

    13.2 A word of caution

    13.3 Higher-level interfaces for basic econometrics with GPUs

    13.4 TensorFlow/Keras example: Predict housing prices

        13.4.1 Data preparation

        13.4.2 Model specification

        13.4.3 Training and prediction

    13.5 Wrapping up

14 Regression Analysis and Categorization with Spark and R

    14.1 Simple linear regression analysis

    14.2 Machine learning for classification

    14.3 Building machine learning pipelines with R and Spark

        14.3.1 Set up and data import

        14.3.2 Building the pipeline

    14.4 Wrapping up

15 Large-scale Text Analysis with sparklyr

    15.1 Getting started: Import, pre-processing, and word count

    15.2 Tutorial: political slant

        15.2.1 Data download and import

        15.2.2 Cleaning speeches data

        15.2.3 Create a bigrams count per party

        15.2.4 Find “partisan” phrases

        15.2.5 Results: Most partisan phrases by congress

    15.3 Natural Language Processing at Scale

        15.3.1 Preparatory steps

        15.3.2 Sentiment annotation

    15.4 Aggregation and visualization

    15.5 sparklyr and lazy evaluation

V Appendices

Appendix A: GitHub

Appendix B: R Basics

Appendix C: Install Hadoop

VI References and Index

Bibliography

Index

📜 SIMILAR VOLUMES

Big Data Analytics: A Guide to Data Scie

📁 Big Data Analytics: A Guide to Data Science Practitioners Making the Transition to Big Data (Chapman & Hall/CRC Data Science Series)

✍ Ulrich Matter 📂 Library 📅 2023 🏛 Chapman and Hall/CRC 🌐 English

Successfully navigating the data-driven economy presupposes a certain understanding of the technologies and methods to gain insights from Big Data. This book aims to help data science practitioners to successfully manage the transition to Big Data. Building on familiar content from appl

Big Data Analytics: A Guide to Data Scie

📁 Big Data Analytics: A Guide to Data Science Practitioners Making the Transition to Big Data

✍ Ulrich Matter 📂 Library 📅 2023 🏛 CRC Press/Chapman & Hall 🌐 English

Data Lake Analytics on Microsoft Azure:

📁 Data Lake Analytics on Microsoft Azure: A Practitioner's Guide to Big Data Engineering

✍ Harsh Chawla, Pankaj Khattar 📂 Library 📅 2020 🏛 Apress 🌐 English

Get a 360-degree view of how the journey of data analytics solutions has evolved from monolithic data stores and enterprise data warehouses to data lakes and modern data warehouses. You willThis book includes comprehensive coverage of how: </

Data Science and Big Data Analytics

📁 Data Science and Big Data Analytics

✍ Durgesh Kumar Mishra, Xin-She Yang, Aynur Unal 📂 Library 📅 2019 🏛 Springer Singapore 🌐 English

This book presents conjectural advances in big data analysis, machine learning and computational intelligence, as well as their potential applications in scientific computing. It discusses major issues pertaining to big data analysis using computational intelligence techniques, and the conject

Scalable Big Data Architecture: A practi

📁 Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture

✍ Bahaaldine Azarmi 📂 Library 📅 2016 🏛 Apress 🌐 English

This book highlights the different types of data architecture and illustrates the many possibilities hidden behind the term "Big Data", from the usage of No-SQL databases to the deployment of stream analytics architecture, machine learning, and governance. Scalable Big Data Archit

Scalable Big Data Architecture: A Practi

📁 Scalable Big Data Architecture: A Practitioners Guide to Choosing Relevant Big Data Architecture

✍ Bahaaldine Azarmi 📂 Library 📅 2015 🏛 Apress 🌐 English