Improved Version
==============

Previously, the scraper returned data objects as raw dictionaries.
We want to give a bit more structure to our data so we will define our models ``BooksPage`` and ``BooksDetails`` using the ``BaseDataItem`` class.

.. literalinclude:: ../../../examples/stages/books_scraper_improved.py
   :pyobject: BooksPage

.. literalinclude:: ../../../examples/stages/books_scraper_improved.py
   :pyobject: BookDetails


Also, in the previous version there was no exception handling for the HTML parsing.
One common issue when using access chains in ``BeautifulSoup`` is that the code will raise an exception if there's a method call or property access on a missing attribute.
E.g.

.. code-block:: python

   html.find("p", {"class": "price_color"}).text

If the p tag with class "price_color" is missing, accessing the ``text`` attribute will raise an exception.

Using try-except blocks or if conditions for each attribute is not a very elegant solution.

For this specific purpose, you can use ``BaseDataItem``. A Pydantic model with a global pre-validator that will invoke callable values
and store any exception in the ``errors`` dictionary attribute.
For this to work, we need to wrap the calls on the ``html`` property within a callable.

Example usage:

.. code-block:: python

   BookDetails(
        **{
            "title": lambda: response.html.find("h1").text,
            "price": lambda: response.html.find("p", {"class": "price_color"}).text,
            "url": response.url,
        }
    )

Under the hood, ``BaseDataItem`` uses a ``DataWrapper`` dict. You can import it from ``dataservice`` and use it directly if you need to handle exceptions outside of the context of a ``BaseDataItem``.

.. code-block:: python

   from dataservice import DataWrapper

   wrapped = DataWrapper(
        **{
            "title": lambda: response.html.find("h1").text,
            "price": lambda: response.html.find("p", {"class": "price_color"}).text,
            "url": response.url,
        }
   )
   if wrapped.errors:
       print(wrapped.errors)


We don't want to hammer the server with too many concurrent requests, so we add random delay between requests using the ``ServiceConfig`` object.
``ServiceConfig`` is a simple class that allows custom configuration for your ``DataService`` object.

We also want to activate the cache in case we need to re-run the scraper. We can do this by setting the ``cache`` attribute to ``True``.
By default, a file named ``cache.json`` will be created in the current working directory. You can also specify a custom path with the ``path`` key.
Furthermore, the cache file will be written periodically to disk as well as on interrupt signals. You can set the interval in seconds with the ``write_interval`` key. Default is 20 * 60 seconds, i.e. 20 minutes.

.. code-block:: python

   from dataservice import ServiceConfig

   service_config = ServiceConfig(random_delay=1000, cache={"use": True})


``DataService`` doesn't come with logging on out of the box, however, it provides a utility function to set up a simple console logging for you.
By default it creates a logger with name ``dataservice`` that logs to the console with level set to DEBUG. You can also pass a custom logger name and use the same logger setup, like so.

.. code-block:: python

   from dataservice import setup_logging

   logger = getLogger("books_scraper")
   setup_logging("books_scraper")


Please note that this utility has been coded for simplicity and may not be suitable for all use cases. For more advanced logging, you should set up your own logger.

So far we haven't done anything with the results. We will now iterate over the ``DataService`` iterator and group the results by class name, then
write them to a JSON file using the ``write`` utility method. Currently the ``write`` method supports JSON and CSV formats.

Finally, we want to be able to pass the pagination argument to the ``parse_books_page()`` function. Since we know that
callbacks are one-argument functions, we will use a lambda to pass the pagination argument. You can also, of course, use a partial function if you want.

.. note::
   We previously mentioned that the Client can be any Python callable. In our code however, we are creating an instance
   of the ``HttpXClient()`` class, whose main method ``make_request()`` is invoked via magic method ``__call__``.


Full code for the improved example:

.. literalinclude:: ../../../examples/scraper/books_scraper_improved.py


If you now run the script with pagination enabled, you should see something like this:

.. code-block:: bash

   $ python books_scraper_improved.py --pagination

.. code-block::

   2024-08-15 18:38:17,557 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/alice-in-wonderland-alices-adventures-in-wonderland-1_5/index.html
   2024-08-15 18:38:17,634 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/ajin-demi-human-volume-1-ajin-demi-human-1_4/index.html
   2024-08-15 18:38:17,654 :: dataservice.cache :: DEBUG :: Cache miss for https://books.toscrape.com/catalogue/choosing-our-religion-the-spiritual-lives-of-americas-nones_14/index.html
   2024-08-15 18:38:17,654 :: dataservice.clients :: INFO :: Requesting https://books.toscrape.com/catalogue/choosing-our-religion-the-spiritual-lives-of-americas-nones_14/index.html
   2024-08-15 18:38:17,683 :: dataservice.cache :: DEBUG :: Cache miss for https://books.toscrape.com/catalogue/frankenstein_20/index.html
   2024-08-15 18:38:17,683 :: dataservice.clients :: INFO :: Requesting https://books.toscrape.com/catalogue/frankenstein_20/index.html
   2024-08-15 18:38:17,744 :: dataservice.cache :: DEBUG :: Cache miss for https://books.toscrape.com/catalogue/deep-under-walker-security-1_15/index.html
   2024-08-15 18:38:17,744 :: dataservice.clients :: INFO :: Requesting https://books.toscrape.com/catalogue/deep-under-walker-security-1_15/index.html
   2024-08-15 18:38:17,786 :: dataservice.cache :: DEBUG :: Cache miss for https://books.toscrape.com/catalogue/bleach-vol-1-strawberry-and-the-soul-reapers-bleach-1_7/index.html
   2024-08-15 18:38:17,786 :: dataservice.clients :: INFO :: Requesting https://books.toscrape.com/catalogue/bleach-vol-1-strawberry-and-the-soul-reapers-bleach-1_7/index.html
   2024-08-15 18:38:18,027 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/emma_17/index.html
   2024-08-15 18:38:18,171 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/frankenstein_20/index.html
   2024-08-15 18:38:18,208 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/choosing-our-religion-the-spiritual-lives-of-americas-nones_14/index.html
   2024-08-15 18:38:18,247 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/deep-under-walker-security-1_15/index.html
   2024-08-15 18:38:18,323 :: dataservice.clients :: INFO :: Received response for https://books.toscrape.com/catalogue/bleach-vol-1-strawberry-and-the-soul-reapers-bleach-1_7/index.html
   2024-08-15 18:38:18,335 :: dataservice.cache :: INFO :: Writing cache to cache.json
   2024-08-15 18:38:18,790 :: dataservice.files :: INFO :: Data written to books_pages.json
   2024-08-15 18:38:18,804 :: dataservice.files :: INFO :: Data written to book_details.json
   'Elapsed time: 80.19 seconds'

Let's checkout how to write a crawler that can follow links to other pages.