Large-scale data analysis is now vitally important to virtually every business. Mobile and social technologies are generating massive datasets distributed cloud computing offers the resources to store and analyze them and professionals have radically new technologies at their command, including NoSQ
Data Just Right: Introduction to Large-Scale Data & Analytics
✍ Scribed by Michael Manoochehri
- Publisher
- Addison-Wesley
- Year
- 0
- Tongue
- English
- Leaves
- 245
- Series
- Data & Analytics Series
- Edition
- 1st
- Category
- Library
No coin nor oath required. For personal study only.
✦ Synopsis
The array of tools for collecting, storing, and gaining insight from data is huge and
getting bigger every day. For people entering the field, that means digging through
hundreds of Web sites and dozens of books to get the basics of working with data at
scale. That’s why this book is a great addition to the Addison-Wesley Data & Analytics
series; it provides a broad overview of tools, techniques, and helpful tips for building
large data analysis systems.
Michael is the perfect author to provide this introduction to Big Data analytics. He
worked on the Cloud Platform Developer Relations team at Google, helping develop-ers with BigQuery, Google’s hosted platform for analyzing terabytes of data quickly.
He brings his breadth of experience to this book, providing practical guidance for
anyone looking to start working with Big Data or anyone looking for additional tips,
tricks, and tools.
The introductory chapters start with guidelines for success with Big Data systems
and introductions to NoSQL, distributed computing, and the CAP theorem. An intro-duction to analytics at scale using Hadoop and Hive is followed by coverage of real-time analytics with BigQuery. More advanced topics include MapReduce pipelines,
Pig and Cascading, and machine learning with Mahout. Finally, you’ll see examples
of how to blend Python and R into a working Big Data tool chain. Throughout all
of this material are examples that help you work with and learn the tools. All of this
combines to create a perfect book to read for picking up a broad understanding of Big
Data analytics.
—Paul Dix, Series Editor
✦ Table of Contents
Contents......Page 8
Foreword......Page 16
Preface......Page 18
Acknowledgments......Page 26
About the Author......Page 28
I: Directives in the Big Data Era......Page 30
When Data Became a BIG Deal......Page 32
Data and the Single Server......Page 33
The Big Data Trade-Off......Page 34
Anatomy of a Big Data Pipeline......Page 38
Summary......Page 39
II: Collecting and Sharing a Lot of Data......Page 40
2 Hosting and Sharing Terabytes of Raw Data......Page 42
Suffering from Files......Page 43
Storage: Infrastructure as a Service......Page 44
Choosing the Right Data Format......Page 45
Character Encoding......Page 48
Data in Motion: Data Serialization Formats......Page 50
Summary......Page 52
Relational Databases: Command and Control......Page 54
Relational Databases versus the Internet......Page 57
Nonrelational Database Models......Page 60
Leaning toward Write Performance: Redis......Page 64
Sharding across Many Redis Instances......Page 67
NewSQL: The Return of Codd......Page 70
Summary......Page 71
A Warehouse Full of Jargon......Page 72
Hadoop: The Elephant in the Warehouse......Page 77
Data Silos Can Be Good......Page 78
Convergence: The End of the Data Silo......Page 80
Summary......Page 82
III: Asking Questions about Your Data......Page 84
What Is a Data Warehouse?......Page 86
Apache Hive: Interactive Querying for Hadoop......Page 89
Shark: Queries at the Speed of RAM......Page 94
Data Warehousing in the Cloud......Page 95
Summary......Page 96
Analytical Databases......Page 98
Dremel: Spreading the Wealth......Page 100
BigQuery: Data Analytics as a Service......Page 102
Building a Custom Big Data Dashboard......Page 104
The Future of Analytical Query Engines......Page 111
Summary......Page 112
7 Visualization Strategies for Exploring Large Datasets......Page 114
Cautionary Tales: Translating Data into Narrative......Page 115
Human Scale versus Machine Scale......Page 118
Building Applications for Data Interactivity......Page 119
Summary......Page 125
IV: Building Data Pipelines......Page 126
What Is a Data Pipeline?......Page 128
Data Pipelines with Hadoop Streaming......Page 130
A One-Step MapReduce Transformation......Page 134
Managing Complexity: Python MapReduce Frameworks for Hadoop 110......Page 139
Summary......Page 143
9 Building Data Transformation Workflows with Pig and Cascading......Page 146
It’s Complicated: Multistep MapReduce Transformations......Page 147
Cascading: Building Robust Data-Workflow Applications......Page 151
Summary......Page 157
V: Machine Learning for Large Datasets......Page 158
10 Building a Data Classification System with Mahout......Page 160
Challenges of Machine Learning......Page 161
Apache Mahout: Scalable Machine Learning......Page 165
MLBase: Distributed Machine Learning Framework......Page 168
Summary......Page 169
VI: Statistical Analysis for Massive Datasets......Page 172
11 Using R with Large Datasets......Page 174
Why Statistics Are Sexy......Page 175
Strategies for Dealing with Large Datasets......Page 178
Summary......Page 184
The Snakes Are Loose in the Data Zoo......Page 186
Python Libraries for Data Processing......Page 189
Building More Complex Workflows......Page 196
iPython: Completing the Scientific Computing Tool Chain......Page 199
Summary......Page 203
VII: Looking Ahead......Page 206
Overlapping Solutions......Page 208
Understanding Your Data Problem......Page 210
A Playbook for the Build versus Buy Problem......Page 211
My Own Private Data Center......Page 213
Understand the Costs of Open-Source......Page 215
Summary......Page 216
14 The Future: Trends in Data Technology......Page 218
Hadoop: The Disruptor and the Disrupted......Page 219
Everything in the Cloud......Page 220
The Rise and Fall of the Data Scientist......Page 222
Convergence: The Ultimate Database......Page 224
Convergence of Cultures......Page 225
Summary......Page 226
A......Page 228
B......Page 229
C......Page 230
D......Page 231
G......Page 233
I......Page 234
K......Page 235
M......Page 236
N......Page 237
P......Page 238
R......Page 239
S......Page 240
T......Page 242
W......Page 243
Z......Page 244
✦ Subjects
Информатика и вычислительная техника;Искусственный интеллект;Интеллектуальный анализ данных;
📜 SIMILAR VOLUMES
Large-scale data analysis is now vitally important to virtually every business. Mobile and social technologies are generating massive datasets distributed cloud computing offers the resources to store and analyze them and professionals have radically new technologies at their command, including NoSQ
<p><p>This edited book collects state-of-the-art research related to large-scale data analytics that has been accomplished over the last few years. This is among the first books devoted to this important area based on contributions from diverse scientific areas such as databases, data mining, superc
<p>This book presents a language integrated query framework for big data. The continuous, rapid growth of data information to volumes of up to terabytes (1,024 gigabytes) or petabytes (1,048,576 gigabytes) means that the need for a system to manage and query information from large scale data sources
<p><i>Real-Time Data Analytics for Large-Scale Sensor Data</i> covers the theory and applications of hardware platforms and architectures, the development of software methods, techniques and tools, applications, governance and adoption strategies for the use of massive sensor data in real-time data
<p><em>Big Data Analytics with Spark</em> is a step-by-step guide for learning Spark, which is an open-source fast and general-purpose cluster computing framework for large-scale data analysis. You will learn how to use Spark for different types of big data analytics projects, including batch, inter