Web Scraping with Python: Collecting Data from the Modern Web
✍ Scribed by Ryan Mitchell
- Publisher
- "O'Reilly Media, Inc."
- Year
- 2015
- Tongue
- English
- Leaves
- 255
- Category
- Library
No coin nor oath required. For personal study only.
✦ Synopsis
Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once. Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as analyzing raw data or using scrapers for frontend website testing. Code samples are available to help you understand the concepts in practice. Learn how to parse complicated HTML pages Traverse multiple pages and sites Get a general overview of APIs and how they work Learn several methods for storing the data you scrape Download, read, and extract data from documents Use tools and techniques to clean badly formatted data Read and write natural languages Crawl through forms and logins Understand how to scrape JavaScript Learn image processing and text recognition
✦ Table of Contents
Cover
Copyright
Table of Contents
Preface
What Is Web Scraping?
Why Web Scraping?
About This Book
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
Part I. Building Scrapers
Chapter 1. Your First Web Scraper
Connecting
An Introduction to BeautifulSoup
Installing BeautifulSoup
Running BeautifulSoup
Connecting Reliably
Chapter 2. Advanced HTML Parsing
You Don’t Always Need a Hammer
Another Serving of BeautifulSoup
find() and findAll() with BeautifulSoup
Other BeautifulSoup Objects
Navigating Trees
Regular Expressions
Regular Expressions and BeautifulSoup
Accessing Attributes
Lambda Expressions
Beyond BeautifulSoup
Chapter 3. Starting to Crawl
Traversing a Single Domain
Crawling an Entire Site
Collecting Data Across an Entire Site
Crawling Across the Internet
Crawling with Scrapy
Chapter 4. Using APIs
How APIs Work
Common Conventions
Methods
Authentication
Responses
API Calls
Echo Nest
A Few Examples
Twitter
Getting Started
A Few Examples
Google APIs
Getting Started
A Few Examples
Parsing JSON
Bringing It All Back Home
More About APIs
Chapter 5. Storing Data
Media Files
Storing Data to CSV
MySQL
Installing MySQL
Some Basic Commands
Integrating with Python
Database Techniques and Good Practice
“Six Degrees” in MySQL
Email
Chapter 6. Reading Documents
Document Encoding
Text
Text Encoding and the Global Internet
CSV
Reading CSV Files
PDF
Microsoft Word and .docx
Part II. Advanced Scraping
Chapter 7. Cleaning Your Dirty Data
Cleaning in Code
Data Normalization
Cleaning After the Fact
OpenRefine
Chapter 8. Reading and Writing Natural Languages
Summarizing Data
Markov Models
Six Degrees of Wikipedia: Conclusion
Natural Language Toolkit
Installation and Setup
Statistical Analysis with NLTK
Lexicographical Analysis with NLTK
Additional Resources
Chapter 9. Crawling Through Forms and Logins
Python Requests Library
Submitting a Basic Form
Radio Buttons, Checkboxes, and Other Inputs
Submitting Files and Images
Handling Logins and Cookies
HTTP Basic Access Authentication
Other Form Problems
Chapter 10. Scraping JavaScript
A Brief Introduction to JavaScript
Common JavaScript Libraries
Ajax and Dynamic HTML
Executing JavaScript in Python with Selenium
Handling Redirects
Chapter 11. Image Processing and Text Recognition
Overview of Libraries
Pillow
Tesseract
NumPy
Processing Well-Formatted Text
Scraping Text from Images on Websites
Reading CAPTCHAs and Training Tesseract
Training Tesseract
Retrieving CAPTCHAs and Submitting Solutions
Chapter 12. Avoiding Scraping Traps
A Note on Ethics
Looking Like a Human
Adjust Your Headers
Handling Cookies
Timing Is Everything
Common Form Security Features
Hidden Input Field Values
Avoiding Honeypots
The Human Checklist
Chapter 13. Testing Your Website with Scrapers
An Introduction to Testing
What Are Unit Tests?
Python unittest
Testing Wikipedia
Testing with Selenium
Interacting with the Site
Unittest or Selenium?
Chapter 14. Scraping Remotely
Why Use Remote Servers?
Avoiding IP Address Blocking
Portability and Extensibility
Tor
PySocks
Remote Hosting
Running from a Website Hosting Account
Running from the Cloud
Additional Resources
Moving Forward
Appendix A. Python at a Glance
Installation and “Hello, World!”
Appendix B. The Internet at a Glance
Appendix C. The Legalities and Ethics of Web Scraping
Trademarks, Copyrights, Patents, Oh My!
Copyright Law
Trespass to Chattels
The Computer Fraud and Abuse Act
robots.txt and Terms of Service
Three Web Scrapers
eBay versus Bidder’s Edge and Trespass to Chattels
United States v. Auernheimer and The Computer Fraud and Abuse Act
Field v. Google: Copyright and robots.txt
Index
About the Author
📜 SIMILAR VOLUMES
<div><p>Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once.</p><p>Ideal for programmers,
<div><p>Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once.</p><p>Ideal for programmers,
Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you'll learn how to use Python scripts and web APIs to gather and process data from thousands - or even millions - of web pages at once. Ideal for programmers, security
<div><p>Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once.</p><p>Ideal for programmers,
<div><p>If programming is magic then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. The expanded edition of this practical book not only introduces you web scraping, but also