Books Crawler¶

In this example we will build a Crawler that follows all the page links on using ‘HttpXClient’ to fetch HTML content. The website we are going to crawl, again, is Books to Scrape.

Building a crawler in DataService is fairly trivial. We just need to define a function that finds all links on the page, yields them and then calls itself recursively on each link. DataService already implements a deduplication mechanism so you don’t need to worry about making unnecessary requests. You can access the deduplication config via ServiceConfig.

By default, it uses the following Request attributes to determine if a request is unique:

url
params
method
form_data
json_data
content_type
headers

First we define a simple DataItem Link that will hold the link details.

class Link(BaseDataItem):
    source: str
    destination: str
    text: str

Then we proceed by defining the parse_links function that will extract all links from the page and yield a new Request object for each link.

def parse_links(response: Response):
    """Find all links on the page"""
    base_url = response.url

    links = response.html.find_all("a")
    for link in links:
        if is_same_domain(base_url, link["href"]):
            link_href = urljoin(base_url, link["href"])
            yield Link(
                source=base_url, destination=link_href, text=link.get_text(strip=True)
            )
            yield Request(url=link_href, callback=parse_links, client=response.client)

A few things to note in the function above:

We are generating a new Request object for each link found on the page using the initial URL as the base URL. We are also checking if the link is relative and converting it to an absolute URL. Furthermore we are filtering out any links that are not part of the same domain to prevent the crawler from running forever.

def is_same_domain(this_url: str, that_url: str) -> bool:
    """Check if two URLs are on the same domain."""
    these_parts, those_parts = urlparse(this_url), urlparse(that_url)
    if any(not parts.netloc for parts in (these_parts, those_parts)):
        return True
    return these_parts.netloc == those_parts.netloc

Full code for the crawler example:

"""Simple example of scraping books from a website with pagination argument."""

import logging
from urllib.parse import urljoin, urlparse

from dataservice import (
    BaseDataItem,
    DataService,
    HttpXClient,
    Request,
    Response,
    ServiceConfig,
    setup_logging,
)

logger = logging.getLogger("books_crawler")
setup_logging("books_crawler")


class Link(BaseDataItem):
    source: str
    destination: str
    text: str


def is_same_domain(this_url: str, that_url: str) -> bool:
    """Check if two URLs are on the same domain."""
    these_parts, those_parts = urlparse(this_url), urlparse(that_url)
    if any(not parts.netloc for parts in (these_parts, those_parts)):
        return True
    return these_parts.netloc == those_parts.netloc


def parse_links(response: Response):
    """Find all links on the page"""
    base_url = response.url

    links = response.html.find_all("a")
    for link in links:
        if is_same_domain(base_url, link["href"]):
            link_href = urljoin(base_url, link["href"])
            yield Link(
                source=base_url, destination=link_href, text=link.get_text(strip=True)
            )
            yield Request(url=link_href, callback=parse_links, client=response.client)


def main():
    client = HttpXClient()
    start_requests = iter(
        [
            Request(
                url="https://books.toscrape.com/index.html",
                callback=parse_links,
                client=client,
            )
        ]
    )
    data_service = DataService(
        start_requests, config=ServiceConfig(cache={"use": True})
    )
    data = tuple(data_service)
    for item in data:
        logger.info(item)
    for k, v in data_service.get_failures().items():
        logger.error(f"Error for URL: {k} - {v}")


if __name__ == "__main__":
    main()

In this example, after fetching data, we will log the results to console as well as the errors that occurred during the scraping process, by calling get_failures() method.

Let’s now move on to the REST Client examples too see how we can use the DataService to fetch data from REST APIs.