𝔖 Scriptorium
✦   LIBER   ✦

πŸ“

php|architect's Guide to Web Scraping with PHP

✍ Scribed by Matthew Turland


Publisher
musketeers.me, LLC
Year
2010
Tongue
English
Leaves
192
Category
Library

⬇  Acquire This Volume

No coin nor oath required. For personal study only.

✦ Synopsis


Despite all the advancements in web APIs and interoperability, it’s inevitable that, at some point in your career, you will have to β€œscrape” content from a website that was not built with web services in mind. And, despite its sometimes less-than-stellar reputation, web scraping is usually an entire legitimate activityβ€”for example, to capture data from an old version of a website for insertion into a modern CMS.
This book, written by scraping expert Matthew Turland, covers web scraping techniques and topics that range from the simple to exotic using a variety of technologies and frameworks:
* Understanding HTTP requests
* The PHP HTTP streams wrapper
* cURL
* pecl_http
* PEAR:HTTP
* Zend_Http_Client
* Building your own scraping library
* Using Tidy
* Analyzing code with the DOM, SimpleXML and XMLReader extensions
* CSS selector libraries
* PCRE pattern matching
* Tips and Tricks
* Multiprocessing / parallel processing

✦ Table of Contents


Credits......Page 14
Foreword......Page 18
Intended Audience......Page 21
Web Scraping Defined......Page 22
Applications of Web Scraping......Page 23
Topics Covered......Page 24
HTTP......Page 27
Requests......Page 28
GET Requests......Page 29
Anatomy of a URL......Page 30
Query Strings......Page 31
POST Requests......Page 32
Responses......Page 33
Cookies......Page 35
Referring URLs......Page 36
Persistent Connections......Page 37
User Agents......Page 38
Ranges......Page 39
Basic HTTP Authentication......Page 40
Digest HTTP Authentication......Page 41
Wrap-Up......Page 44
HTTP Streams Wrapper......Page 47
Simple Request and Response Handling......Page 48
Stream Contexts and POST Requests......Page 49
Error Handling......Page 51
HTTP Authentication......Page 52
Wrap-Up......Page 53
cURL Extension......Page 55
Contrasting GET and POST......Page 56
Handling Headers......Page 58
Debugging......Page 59
Cookies......Page 60
HTTP Authentication......Page 61
User Agents......Page 62
DNS Caching......Page 63
Request Pooling......Page 64
Wrap-Up......Page 66
pecl_http PECL Extension......Page 69
POST Requests......Page 70
Handling Headers......Page 72
Content Encoding......Page 74
Cookies......Page 75
HTTP Authentication......Page 76
Byte Ranges......Page 77
Request Pooling......Page 78
Wrap-Up......Page 79
PEAR::HTTP_Client......Page 81
Requests and Responses......Page 82
Juggling Data......Page 84
Wrangling Headers......Page 85
Using the Client......Page 86
Observing Requests......Page 87
Wrap-Up......Page 88
Basic Requests......Page 91
Responses......Page 92
Custom Headers......Page 93
Configuration......Page 94
Debugging......Page 95
Cookies......Page 96
User Agents......Page 97
Wrap-Up......Page 98
Sending Requests......Page 101
Parsing Responses......Page 103
Transfer Encoding......Page 104
Content Encoding......Page 105
Timing......Page 106
Validation......Page 109
Input......Page 110
Configuration......Page 111
Options......Page 112
Debugging......Page 113
Wrap-Up......Page 116
DOM Extension......Page 119
Loading Documents......Page 120
Tree Terminology......Page 121
Locating Nodes......Page 123
XPath and DOMXPath......Page 124
Absolute Addressing......Page 125
Addressing Attributes......Page 127
Conditions......Page 128
Resources......Page 129
Loading a Document......Page 133
Accessing Elements......Page 134
Accessing Attributes......Page 135
XPath......Page 137
Wrap-Up......Page 138
XMLReader Extension......Page 141
Loading a Document......Page 142
Iteration......Page 143
Elements and Attributes......Page 144
Wrap-Up......Page 147
Reason to Use Them......Page 149
Basics......Page 150
Basic Filters......Page 152
Attribute Filters......Page 154
Form Filters......Page 156
Zend_Dom_Query......Page 158
DOMQuery......Page 159
Wrap-Up......Page 160
PCRE Extension......Page 163
Pattern Basics......Page 164
Anchors......Page 165
Repetition and Quantifiers......Page 166
Subpatterns......Page 167
Matching......Page 168
Escape Sequences......Page 170
Modifiers......Page 173
Wrap-Up......Page 174
Batch Jobs......Page 177
Parallel Processing......Page 178
Forms......Page 179
Testing......Page 181
That's All Folks......Page 182
Legality of Web Scraping......Page 185
Multiprocessing......Page 189


πŸ“œ SIMILAR VOLUMES


php architect's Guide to Web Scraping
✍ Matthew Turland πŸ“‚ Library πŸ“… 2010 πŸ› Marco Tabini & Associates, Inc. 🌐 English

Despite all the advancements in web APIs and interoperability, it's inevitable that, at some point in your career, you will have to "scrape" content from a website that was not built with web services in mind. And, despite its sometimes less-than-stellar reputation, web scraping is usually an entire

php|architect's Guide to Web Scraping
✍ Matthew Turland πŸ“‚ Library πŸ“… 2010 πŸ› musketeers.me, LLC 🌐 English

Despite all the advancements in web APIs and interoperability, it's inevitable that, at some point in your career, you will have to "scrape" content from a website that was not built with web services in mind. And, despite its sometimes less-than-stellar reputation, web scraping is usually an entire

php|architect's Guide to PHP Security|
✍ Ilia Alshanetsky, Rasmus Lerdorf πŸ“‚ Library πŸ“… 2005 πŸ› Marco Tabini & Associates, Inc. 🌐 English

Overall, an excellent resource for security. It's small size means that that topics are narrow enough to be digested and acted upon individually.

php|architect's Guide to PHP Security|
✍ Ilia Alshanetsky, Rasmus Lerdorf πŸ“‚ Library πŸ“… 2005 πŸ› Marco Tabini & Associates, Inc. 🌐 English

Overall, an excellent resource for security. It's small size means that that topics are narrow enough to be digested and acted upon individually.

php architect's Guide to PHP Security
✍ Ilia Alshanetsky πŸ“‚ Library πŸ“… 2005 πŸ› Marco Tabini & Associates, Inc. 🌐 English

With the number of security flaws and exploits discovered and released every day constantly on the rise, knowing how to write secure and reliable applications is become more and more important every day. Written by Ilia Alshanetsky, one of the foremost experts on PHP security in the world, php|ar

php|architect's Guide to PHP Security
✍ Ilia Alshanetsky, Rasmus Lerdorf πŸ“‚ Library πŸ“… 2005 🌐 English

With the number of security flaws and exploits discovered and released every day constantly on the rise, knowing how to write secure and reliable applications is become more and more important every day. Written by Ilia Alshanetsky, one of the foremost experts on PHP security in the world, php|archi