Improved Version¶
Previously, the scraper returned data objects as raw dictionaries.
We want to give a bit more structure to our data so we will define our models BooksPage and BooksDetails using the BaseDataItem class.
Also, in the previous version there was no exception handling for the HTML parsing.
One common issue when using access chains in BeautifulSoup is that the code will raise an exception if there’s a method call or property access on a missing attribute.
E.g.
html.find("p", {"class": "price_color"}).text
If the p tag with class “price_color” is missing, accessing the text attribute will raise an exception.
Using try-except blocks or if conditions for each attribute is not a very elegant solution.
For this specific purpose, you can use BaseDataItem. A Pydantic model with a global pre-validator that will invoke callable values
and store any exception in the errors dictionary attribute.
For this to work, we need to wrap the calls on the html property within a callable.
Example usage:
BookDetails(
**{
"title": lambda: response.html.find("h1").text,
"price": lambda: response.html.find("p", {"class": "price_color"}).text,
"url": response.url,
}
)
Under the hood, BaseDataItem uses a DataWrapper dict. You can import it from dataservice and use it directly if you need to handle exceptions outside of the context of a BaseDataItem.
from dataservice import DataWrapper
wrapped = DataWrapper(
**{
"title": lambda: response.html.find("h1").text,
"price": lambda: response.html.find("p", {"class": "price_color"}).text,
"url": response.url,
}
)
if wrapped.errors:
print(wrapped.errors)
We don’t want to hammer the server with too many concurrent requests, so we add random delay between requests using the ServiceConfig object.
ServiceConfig is a simple class that allows custom configuration for your DataService object.
We also want to activate the cache in case we need to re-run the scraper. We can do this by setting the cache attribute to True.
By default, a file named cache.json will be created in the current working directory. You can also specify a custom path with the path key.
Furthermore, the cache file will be written periodically to disk as well as on interrupt signals. You can set the interval in seconds with the write_interval key. Default is 20 * 60 seconds, i.e. 20 minutes.
from dataservice import ServiceConfig
service_config = ServiceConfig(random_delay=1000, cache={"use": True})
DataService doesn’t come with logging on out of the box, however, it provides a utility function to set up a simple console logging for you.
By default it creates a logger with name dataservice that logs to the console with level set to DEBUG. You can also pass a custom logger name and use the same logger setup, like so.
from dataservice import setup_logging
logger = getLogger("books_scraper")
setup_logging("books_scraper")
Please note that this utility has been coded for simplicity and may not be suitable for all use cases. For more advanced logging, you should set up your own logger.
So far we haven’t done anything with the results. We will now iterate over the DataService iterator and group the results by class name, then
write them to a JSON file using the write utility method. Currently the write method supports JSON and CSV formats.
Finally, we want to be able to pass the pagination argument to the parse_books_page() function. Since we know that
callbacks are one-argument functions, we will use a lambda to pass the pagination argument. You can also, of course, use a partial function if you want.
Note
We previously mentioned that the Client can be any Python callable. In our code however, we are creating an instance
of the HttpXClient() class, whose main method make_request() is invoked via magic method __call__.
Full code for the improved example:
"""Simple example of scraping books from a website with pagination argument."""
import argparse
import timeit
from collections import defaultdict
from logging import getLogger
from pprint import pprint
from typing import Iterator
from urllib.parse import urljoin
from dataservice import (
BaseDataItem,
DataService,
HttpXClient,
Request,
Response,
ServiceConfig,
setup_logging,
)
logger = getLogger("books_scraper")
setup_logging("books_scraper")
class BooksPage(BaseDataItem):
url: str
title: str | None
books: int
class BookDetails(BaseDataItem):
url: str
title: str | None
price: str | None
def parse_books_page(
response: Response, pagination: bool = False
) -> Iterator[BooksPage | Request]:
"""Parse the books page."""
articles = response.html.find_all("article", {"class": "product_pod"})
yield BooksPage(
**{
"url": response.request.url,
"title": lambda: response.html.title.get_text(strip=True),
"books": len(articles),
}
)
for article in articles:
href = article.h3.a["href"]
url = urljoin(response.request.url, href)
yield Request(url=url, callback=parse_book_details, client=response.client)
if pagination:
next_page = response.html.find("li", {"class": "next"})
if next_page is not None:
next_page_url = urljoin(response.request.url, next_page.a["href"])
yield Request(
url=next_page_url,
callback=lambda resp: parse_books_page(resp, pagination=pagination),
client=response.client,
)
def parse_book_details(response: Response) -> BookDetails:
"""Parse the book details."""
return BookDetails(
**{
"title": lambda: response.html.find("h1").text,
"price": lambda: response.html.find("p", {"class": "price_color"}).text,
"url": response.url,
}
)
def main(pagination: bool):
httpx_client = HttpXClient()
start_requests = [
Request(
url="https://books.toscrape.com/index.html",
callback=lambda resp: parse_books_page(resp, pagination=pagination),
client=httpx_client,
)
]
service_config = ServiceConfig(delay={"amount": 10000}, cache={"use": True})
data_service = DataService(start_requests, service_config)
data = defaultdict(list)
for item in data_service:
data[type(item).__name__].append(item)
data_service.write("books_pages.json", data["BooksPage"])
data_service.write("book_details.json", data["BookDetails"])
if __name__ == "__main__":
args_parser = argparse.ArgumentParser()
args_parser.add_argument(
"--pagination",
action="store_true",
help="Enable pagination to scrape multiple pages",
)
args = args_parser.parse_args()
elapsed = timeit.timeit(lambda: main(args.pagination), number=1)
pprint("Elapsed time: {:.2f} seconds".format(elapsed))
If you now run the script with pagination enabled, you should see something like this:
$ python books_scraper_improved.py --pagination
2024-08-15 18:38:17,557 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/alice-in-wonderland-alices-adventures-in-wonderland-1_5/index.html
2024-08-15 18:38:17,634 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/ajin-demi-human-volume-1-ajin-demi-human-1_4/index.html
2024-08-15 18:38:17,654 :: dataservice.cache :: DEBUG :: Cache miss for https://books.toscrape.com/catalogue/choosing-our-religion-the-spiritual-lives-of-americas-nones_14/index.html
2024-08-15 18:38:17,654 :: dataservice.clients :: INFO :: Requesting https://books.toscrape.com/catalogue/choosing-our-religion-the-spiritual-lives-of-americas-nones_14/index.html
2024-08-15 18:38:17,683 :: dataservice.cache :: DEBUG :: Cache miss for https://books.toscrape.com/catalogue/frankenstein_20/index.html
2024-08-15 18:38:17,683 :: dataservice.clients :: INFO :: Requesting https://books.toscrape.com/catalogue/frankenstein_20/index.html
2024-08-15 18:38:17,744 :: dataservice.cache :: DEBUG :: Cache miss for https://books.toscrape.com/catalogue/deep-under-walker-security-1_15/index.html
2024-08-15 18:38:17,744 :: dataservice.clients :: INFO :: Requesting https://books.toscrape.com/catalogue/deep-under-walker-security-1_15/index.html
2024-08-15 18:38:17,786 :: dataservice.cache :: DEBUG :: Cache miss for https://books.toscrape.com/catalogue/bleach-vol-1-strawberry-and-the-soul-reapers-bleach-1_7/index.html
2024-08-15 18:38:17,786 :: dataservice.clients :: INFO :: Requesting https://books.toscrape.com/catalogue/bleach-vol-1-strawberry-and-the-soul-reapers-bleach-1_7/index.html
2024-08-15 18:38:18,027 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/emma_17/index.html
2024-08-15 18:38:18,171 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/frankenstein_20/index.html
2024-08-15 18:38:18,208 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/choosing-our-religion-the-spiritual-lives-of-americas-nones_14/index.html
2024-08-15 18:38:18,247 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/deep-under-walker-security-1_15/index.html
2024-08-15 18:38:18,323 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/bleach-vol-1-strawberry-and-the-soul-reapers-bleach-1_7/index.html
2024-08-15 18:38:18,335 :: dataservice.cache :: INFO :: Writing cache to cache.json
2024-08-15 18:38:18,790 :: dataservice.files :: INFO :: Data written to books_pages.json
2024-08-15 18:38:18,804 :: dataservice.files :: INFO :: Data written to book_details.json
'Elapsed time: 80.19 seconds'
Let’s checkout how to write a crawler that can follow links to other pages.