Web scraping with Python: collecting more data from the modern web

✍ Scribed by Mitchell, Ryan E

Publisher: O'Reilly Media
Year: 2018
Tongue: English
Leaves: 392
Edition: 2nd edition
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

If programming is magic then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. The expanded edition of this practical book not only introduces you web scraping, but also serves as a comprehensive guide to scraping almost every type of data from the modern web.

Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server's response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you're likely to encounter.

Parse complicated HTML pages
Develop crawlers with the Scrapy framework
Learn methods to store data you scrape
Read and extract data from documents
Clean and normalize badly formatted data
Read and write natural languages
Crawl through forms and logins
Scrape JavaScript and crawl through APIs
Use and write image-to-text software
Avoid scraping traps and bot blockers
Use scrapers to test your website

✦ Table of Contents

Preface......Page 5
Why Web Scraping?......Page 6
About This Book......Page 8
Conventions Used in This Book......Page 10
Using Code Examples......Page 11
How to Contact Us......Page 12
Acknowledgments......Page 13
I. Building Scrapers......Page 14
Connecting......Page 15
An Introduction to BeautifulSoup......Page 18
Installing BeautifulSoup......Page 19
Running BeautifulSoup......Page 22
Connecting Reliably and Handling Exceptions......Page 25
You Don’t Always Need a Hammer......Page 29
Another Serving of BeautifulSoup......Page 31
find() and find_all() with BeautifulSoup......Page 33
Navigating Trees......Page 36
Regular Expressions......Page 42
Regular Expressions and BeautifulSoup......Page 47
Lambda Expressions......Page 49
Traversing a Single Domain......Page 52
Crawling an Entire Site......Page 57
Collecting Data Across an Entire Site......Page 60
Crawling Across the Internet......Page 63
Planning and Defining Objects......Page 70
Dealing with Different Website Layouts......Page 75
Crawling Sites Through Search......Page 81
Crawling Sites Through Links......Page 85
Crawling Multiple Page Types......Page 88
Thinking About Web Crawler Models......Page 90
5. Scrapy......Page 92
Initializing a New Spider......Page 93
Writing a Simple Scraper......Page 94
Spidering with Rules......Page 96
Creating Items......Page 101
The Item Pipeline......Page 104
Logging with Scrapy......Page 108
More Resources......Page 109
Media Files......Page 111
Storing Data to CSV......Page 115
MySQL......Page 117
Installing MySQL......Page 118
Some Basic Commands......Page 121
Integrating with Python......Page 125
Database Techniques and Good Practice......Page 128
“Six Degrees” in MySQL......Page 132
Email......Page 135
II. Advanced Scraping......Page 139
Document Encoding......Page 140
Text......Page 141
Text Encoding and the Global Internet......Page 142
Reading CSV Files......Page 147
PDF......Page 150
Microsoft Word and .docx......Page 152
Cleaning in Code......Page 158
Data Normalization......Page 163
OpenRefine......Page 165
9. Reading and Writing Natural Languages......Page 171
Summarizing Data......Page 172
Markov Models......Page 176
Six Degrees of Wikipedia: Conclusion......Page 181
Natural Language Toolkit......Page 184
Installation and Setup......Page 185
Statistical Analysis with NLTK......Page 186
Lexicographical Analysis with NLTK......Page 189
Additional Resources......Page 193
Python Requests Library......Page 195
Submitting a Basic Form......Page 196
Radio Buttons, Checkboxes, and Other Inputs......Page 199
Submitting Files and Images......Page 200
Handling Logins and Cookies......Page 201
HTTP Basic Access Authentication......Page 203
Other Form Problems......Page 205
A Brief Introduction to JavaScript......Page 206
Common JavaScript Libraries......Page 208
Ajax and Dynamic HTML......Page 210
Executing JavaScript in Python with Selenium......Page 212
Additional Selenium Webdrivers......Page 219
Handling Redirects......Page 220
A Final Note on JavaScript......Page 222
A Brief Introduction to APIs......Page 224
HTTP Methods and APIs......Page 226
More About API Responses......Page 228
Parsing JSON......Page 230
Undocumented APIs......Page 231
Finding Undocumented APIs......Page 234
Documenting Undocumented APIs......Page 235
Finding and Documenting APIs Automatically......Page 236
Combining APIs with Other Data Sources......Page 239
More About APIs......Page 244
13. Image Processing and Text Recognition......Page 245
Pillow......Page 246
Tesseract......Page 247
NumPy......Page 250
Processing Well-Formatted Text......Page 251
Adjusting Images Automatically......Page 254
Scraping Text from Images on Websites......Page 258
Reading CAPTCHAs and Training Tesseract......Page 261
Training Tesseract......Page 263
Retrieving CAPTCHAs and Submitting Solutions......Page 268
14. Avoiding Scraping Traps......Page 272
A Note on Ethics......Page 273
Adjust Your Headers......Page 274
Handling Cookies with JavaScript......Page 276
Timing Is Everything......Page 279
Hidden Input Field Values......Page 280
Avoiding Honeypots......Page 282
The Human Checklist......Page 284
15. Testing Your Website with Scrapers......Page 287
What Are Unit Tests?......Page 288
Python unittest......Page 289
Testing Wikipedia......Page 291
Interacting with the Site......Page 295
unittest or Selenium?......Page 300
16. Web Crawling in Parallel......Page 302
Multithreaded Crawling......Page 303
Race Conditions and Queues......Page 306
The threading Module......Page 310
Multiprocess Crawling......Page 313
Multiprocess Crawling......Page 315
Communicating Between Processes......Page 317
Multiprocess Crawling—Another Approach......Page 320
Why Use Remote Servers?......Page 322
Avoiding IP Address Blocking......Page 323
Portability and Extensibility......Page 324
Tor......Page 325
PySocks......Page 326
Running from a Website-Hosting Account......Page 327
Running from the Cloud......Page 329
Additional Resources......Page 331
Trademarks, Copyrights, Patents, Oh My!......Page 332
Copyright Law......Page 334
Trespass to Chattels......Page 335
The Computer Fraud and Abuse Act......Page 338
robots.txt and Terms of Service......Page 339
eBay versus Bidder’s Edge and Trespass to Chattels......Page 344
United States v. Auernheimer and The Computer Fraud and Abuse Act......Page 346
Field v. Google: Copyright and robots.txt......Page 349
Moving Forward......Page 350
Index......Page 352

✦ Subjects

Computer Science;Programming;Science;Technology;Coding;Reference;Computers;Nonfiction;Technical;Textbooks

📜 SIMILAR VOLUMES

Web Scraping with Python: Collecting Mor

📁 Web Scraping with Python: Collecting More Data from the Modern Web

✍ Ryan Mitchell 📂 Library 📅 2018 🏛 O’Reilly Media 🌐 English

<div><p>If programming is magic then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. The expanded edition of this practical book not only introduces you web scraping, but also

Web scraping with Python: collecting mor

📁 Web scraping with Python: collecting more data from the modern web

✍ Mitchell, Ryan E 📂 Library 📅 2018 🏛 O'Reilly Media, Inc. 🌐 English

Web Scraping with Python : Collecting Da

📁 Web Scraping with Python : Collecting Data from the Modern Web

✍ Ryan Mitchell 📂 Library 📅 2015 🏛 O'Reilly Media 🌐 English

Web Scraping with Python: Collecting Dat

📁 Web Scraping with Python: Collecting Data from the Modern Web

✍ Ryan Mitchell 📂 Library 📅 2015 🏛 "O'Reilly Media, Inc." 🌐 English

Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once. Ideal for programmers, security profe

Web Scraping with Python: Collecting Dat

📁 Web Scraping with Python: Collecting Data from the Modern Web

✍ Ryan Mitchell 📂 Library 📅 2015 🏛 O'Reilly Media 🌐 English

<div><p>Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once.</p><p>Ideal for programmers,

Web Scraping with Python: Collecting Dat

📁 Web Scraping with Python: Collecting Data from the Modern Web

✍ Ryan Mitchell 📂 Library 📅 2015 🏛 O'Reilly Media 🌐 English