𝔖 Scriptorium
✦   LIBER   ✦

📁

Web Scraping with Python: Data Extraction from the Modern Web, 3rd Edition

✍ Scribed by Ryan Mitchell


Publisher
O'Reilly Media
Year
2024
Tongue
English
Leaves
300
Edition
3
Category
Library

⬇  Acquire This Volume

No coin nor oath required. For personal study only.

✦ Synopsis


If programming is magic, then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. This thoroughly updated third edition not only introduces you to web scraping but also serves as a comprehensive guide to scraping almost every type of data from the modern web.

Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server's response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you're likely to encounter.

  • Parse complicated HTML pages
  • Develop crawlers with the Scrapy framework
  • Learn methods to store the data you scrape
  • Read and extract data from documents
  • Clean and normalize badly formatted data
  • Read and write natural...
  • ✦ Table of Contents


    Preface
    What Is Web Scraping?
    Why Web Scraping?
    About This Book
    Conventions Used in This Book
    Using Code Examples
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
    I. Building Scrapers
    1. How the Internet Works
    Networking
    Physical Layer
    Data Link Layer
    Network Layer
    Transport Layer
    Session Layer
    Presentation Layer
    Application Layer
    HTML
    CSS
    JavaScript
    Watching Websites with Developer Tools
    2. The Legalities and Ethics of Web Scraping
    Trademarks, Copyrights, Patents, Oh My!
    Copyright Law
    Copyright and artificial intelligence
    Trespass to Chattels
    The Computer Fraud and Abuse Act
    robots.txt and Terms of Service
    Three Web Scrapers
    eBay v. Bidder’s Edge and Trespass to Chattels
    United States v. Auernheimer and the Computer Fraud and Abuse Act
    Field v. Google: Copyright and robots.txt
    3. Applications of Web Scraping
    Classifying Projects
    E-commerce
    Marketing
    Academic Research
    Product Building
    Travel
    Sales
    SERP Scraping
    4. Writing Your First Web Scraper
    Installing and Using Jupyter
    Connecting
    An Introduction to BeautifulSoup
    Installing BeautifulSoup
    Running BeautifulSoup
    Connecting Reliably and Handling Exceptions
    5. Advanced HTML Parsing
    Another Serving of BeautifulSoup
    find() and find_all() with BeautifulSoup
    Other BeautifulSoup Objects
    Navigating Trees
    Dealing with children and other descendants
    Dealing with siblings
    Dealing with parents
    Regular Expressions
    Regular Expressions and BeautifulSoup
    Accessing Attributes
    Lambda Expressions
    You Don’t Always Need a Hammer
    6. Writing Web Crawlers
    Traversing a Single Domain
    Crawling an Entire Site
    Collecting Data Across an Entire Site
    Crawling Across the Internet
    7. Web Crawling Models
    Planning and Defining Objects
    Dealing with Different Website Layouts
    Structuring Crawlers
    Crawling Sites Through Search
    Crawling Sites Through Links
    Crawling Multiple Page Types
    Thinking About Web Crawler Models
    8. Scrapy
    Installing Scrapy
    Initializing a New Spider
    Writing a Simple Scraper
    Spidering with Rules
    Creating Items
    Outputting Items
    The Item Pipeline
    Logging with Scrapy
    More Resources
    9. Storing Data
    Media Files
    Storing Data to CSV
    MySQL
    Installing MySQL
    Some Basic Commands
    Integrating with Python
    Database Techniques and Good Practice
    “Six Degrees” in MySQL
    Email
    II. Advanced Scraping
    10. Reading Documents
    Document Encoding
    Text
    Text Encoding and the Global Internet
    A history of text encoding
    Encodings in action
    CSV
    Reading CSV Files
    PDF
    Microsoft Word and .docx
    11. Working with Dirty Data
    Cleaning Text
    Working with Normalized Text
    Cleaning Data with Pandas
    Cleaning
    Indexing, Sorting, and Filtering
    More About Pandas
    12. Reading and Writing Natural Languages
    Summarizing Data
    Markov Models
    Six Degrees of Wikipedia: Conclusion
    Natural Language Toolkit
    Installation and Setup
    Statistical Analysis with NLTK
    Lexicographical Analysis with NLTK
    Additional Resources
    13. Crawling Through Forms and Logins
    Python Requests Library
    Submitting a Basic Form
    Radio Buttons, Checkboxes, and Other Inputs
    Submitting Files and Images
    Handling Logins and Cookies
    HTTP Basic Access Authentication
    Other Form Problems
    14. Scraping JavaScript
    A Brief Introduction to JavaScript
    Common JavaScript Libraries
    jQuery
    Google Analytics
    Google Maps
    Ajax and Dynamic HTML
    Executing JavaScript in Python with Selenium
    Installing and Running Selenium
    Selenium Selectors
    Waiting to Load
    XPath
    Additional Selenium WebDrivers
    Handling Redirects
    A Final Note on JavaScript
    15. Crawling Through APIs
    A Brief Introduction to APIs
    HTTP Methods and APIs
    More About API Responses
    Parsing JSON
    Undocumented APIs
    Finding Undocumented APIs
    Documenting Undocumented APIs
    Combining APIs with Other Data Sources
    More About APIs
    16. Image Processing and Text Recognition
    Overview of Libraries
    Pillow
    Tesseract
    Installing Tesseract
    NumPy
    Processing Well-Formatted Text
    Adjusting Images Automatically
    Scraping Text from Images on Websites
    Reading CAPTCHAs and Training Tesseract
    Training Tesseract
    Scraping and preparing images
    Creating box files with the Tesseract trainer project
    Training Tesseract from box files
    Using traineddata files with Tesseract
    Retrieving CAPTCHAs and Submitting Solutions
    17. Avoiding Scraping Traps
    A Note on Ethics
    Looking Like a Human
    Adjust Your Headers
    Handling Cookies with JavaScript
    TLS Fingerprinting
    Timing Is Everything
    Common Form Security Features
    Hidden Input Field Values
    Avoiding Honeypots
    The Human Checklist
    18. Testing Your Website with Scrapers
    An Introduction to Testing
    What Are Unit Tests?
    Python unittest
    Testing Wikipedia
    Testing with Selenium
    Interacting with the Site
    Drag and drop
    Taking screenshots
    19. Web Scraping in Parallel
    Processes Versus Threads
    Multithreaded Crawling
    Race Conditions and Queues
    More Features of the Threading Module
    Multiple Processes
    Multiprocess Crawling
    Communicating Between Processes
    Multiprocess Crawling—Another Approach
    20. Web Scraping Proxies
    Why Use Remote Servers?
    Avoiding IP Address Blocking
    Portability and Extensibility
    Tor
    PySocks
    Remote Hosting
    Running from a Website-Hosting Account
    Running from the Cloud
    Moving Forward
    Web Scraping Proxies
    ScrapingBee
    ScraperAPI
    Oxylabs
    Zyte
    Additional Resources
    Index


    📜 SIMILAR VOLUMES


    Web Scraping with Python: Data Extractio
    ✍ Ryan Mitchell 📂 Library 📅 2024 🏛 O'Reilly Media 🌐 English

    <p><span>If programming is magic, then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. This thoroughly updated third edition not only introduces you to web scraping but also s

    Web Scraping with Python: Data Extractio
    ✍ Ryan Mitchell 📂 Library 📅 2024 🏛 O'Reilly Media 🌐 English

    If programming is magic, then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. This thoroughly updated third edition not only introduces you to web scraping but also serves as

    Web Scraping with Python: Collecting Dat
    ✍ Ryan Mitchell 📂 Library 📅 2015 🏛 "O'Reilly Media, Inc." 🌐 English

    Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once. Ideal for programmers, security profe

    Web Scraping with Python: Collecting Dat
    ✍ Ryan Mitchell 📂 Library 📅 2015 🏛 O'Reilly Media 🌐 English

    <div><p>Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once.</p><p>Ideal for programmers,

    Web Scraping with Python: Collecting Dat
    ✍ Ryan Mitchell 📂 Library 📅 2015 🏛 O'Reilly Media 🌐 English

    <div><p>Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once.</p><p>Ideal for programmers,