Internal API

This part of the documentation covers the internal interfaces of reader, which are useful for plugins, or if you want to use low-level functionality without using Reader itself.

Warning

As of version 3.5, the internal API is not part of the public API; it is not stable yet and might change without any notice.

Parser

reader._parser.default_parser(feed_root=None, session_timeout=(3.05, 60))

Create a pre-configured Parser.

Parameters
Returns

The parser.

Return type

Parser

class reader._parser.Parser

Retrieve and parse feeds by delegating to retrievers and parsers.

To retrieve and parse a single feed, you can call the parser object directly.

Reader only uses the following methods:

To add retrievers and parsers:

The rest of the methods are low-level methods.

session_factory

SessionFactory used to create Requests sessions for retrieving feeds.

Plugins may add request or response hooks to this.

parallel(feeds, map=<class 'map'>, is_parallel=True)

Retrieve and parse many feeds, possibly in parallel.

Yields the parsed feeds, as soon as they are ready.

Parameters
  • feeds (iterable(FeedArgument)) – An iterable of feeds.

  • map (function) – A map()-like function; the results can be in any order.

  • is_parallel (bool) – Whether map runs the tasks in parallel.

Yields

tuple(FeedArgument, ParsedFeed or None or ParseError) – A (feed, result) pair, where result is either:

  • the parsed feed

  • None, if the feed didn’t change

  • an exception instance

__call__(url, http_etag=None, http_last_modified=None)

Retrieve and parse one feed.

This is a convenience wrapper over parallel().

Parameters
  • feed (str) – The feed URL.

  • http_etag (str or None) – The HTTP ETag header from the last update.

  • http_last_modified (str or None) – The the HTTP Last-Modified header from the last update.

Returns

The parsed feed or None, if the feed didn’t change.

Return type

ParsedFeed or None

Raises

ParseError

retrieve(url, http_etag=None, http_last_modified=None, is_parallel=False)

Retrieve a feed.

Parameters
  • url (str) – The feed URL.

  • http_etag (str or None) – The HTTP ETag header from the last update.

  • http_last_modified (str or None) – The the HTTP Last-Modified header from the last update.

  • is_parallel (bool) – Whether this was called from parallel() (writes the contents to a temporary file, if possible).

Returns

A context manager that has as target either the result or None, if the feed didn’t change.

Return type

contextmanager(RetrieveResult or None)

Raises

ParseError

parse(url, result)

Parse a retrieved feed.

Parameters
Returns

The feed and entry data.

Return type

ParsedFeed

Raises

ParseError

get_parser(url, mime_type)

Select an appropriate parser for a feed.

Parsers registered by URL take precedence over those registered by MIME type.

If no MIME type is given, guess it from the URL using mimetypes.guess_type(). If the MIME type can’t be guessed, default to application/octet-stream.

Parameters
  • url (str) – The feed URL.

  • mime_type (str or None) – The MIME type of the retrieved resource.

Returns

The parser, and the (possibly guessed) MIME type.

Return type

tuple(ParserType, str)

Raises

ParseError – No parser matches.

validate_url(url)

Check if url is valid without actually retrieving it.

Raises

InvalidFeedURLError – If url is not valid.

mount_retriever(prefix, retriever)

Register a retriever to a URL prefix.

Retrievers are sorted in descending order by prefix length.

Parameters
get_retriever(url)

Get the retriever for a URL.

Parameters

url (str) – The URL.

Returns

The matching retriever.

Return type

RetrieverType

Raises

ParseError – No retriever matches the URL.

mount_parser_by_mime_type(parser, http_accept=None)

Register a parser to one or more MIME types.

Parameters
  • parser (ParserType) – The parser.

  • http_accept (str or None) – The content types the parser supports, as an Accept HTTP header value. If not given, use the parser’s http_accept attribute, if it has one.

Raises

TypeError – The parser does not have an http_accept attribute, and no http_accept was given.

get_parser_by_mime_type(mime_type)

Get a parser for a MIME type.

Parameters

mime_type (str) – The MIME type of the feed resource.

Returns

The parser.

Return type

ParserType

Raises

ParseError – No parser matches the MIME type.

mount_parser_by_url(url, parser)

Register a parser to an exact URL.

Parameters
get_parser_by_url(url)

Get a parser that was registered by URL.

Parameters

url (str) – The URL.

Returns

The parser.

Return type

ParserType

Raises

ParseError – No parser was registered for the URL.

process_feed_for_update(feed)

Change update-relevant information about a feed before it is passed to the retriever.

Delegates to process_feed_for_update() of the appropriate retriever.

Parameters

feed (FeedForUpdate) – Feed information.

Returns

The passed-in feed information, possibly modified.

Return type

FeedForUpdate

process_entry_pairs(url, mime_type, pairs)

Process entry data before being stored.

Delegates to process_entry_pairs() of the appropriate parser.

Parameters
  • url (str) – The feed URL.

  • mime_type (str or None) – The MIME type of the feed.

  • pairs (iterable(tuple(EntryData, EntryForUpdate or None))) – (entry data, entry for update) pairs.

Returns

(entry data, entry for update) pairs, possibly modified.

Return type

iterable(tuple(EntryData, EntryForUpdate or None))

class reader._requests_utils.SessionFactory(...)

Manage the lifetime of a session.

To get new session, call the factory directly.

request_hooks

Sequence of RequestHooks to be associated with new sessions.

response_hooks

Sequence of ResponseHooks to be associated with new sessions.

__call__()

Create a new session.

Return type

SessionWrapper

transient()

Return the current persistent() session, or a new one.

If a new session was created, it is closed once the context manager is exited.

Return type

contextmanager(SessionWrapper)

persistent()

Register a persistent session with this factory.

While the context manager returned by this method is entered, all persistent() and transient() calls will return the same session. The session is closed once the outermost persistent() context manager is exited.

Plugins should use transient().

Reentrant, but NOT threadsafe.

Return type

contextmanager(SessionWrapper)

class reader._requests_utils.SessionWrapper(...)

Minimal wrapper over a requests.Session.

Only provides a limited get() method.

Can be used as a context manager (closes the session on exit).

session

The underlying requests.Session.

request_hooks

Sequence of RequestHooks.

response_hooks

Sequence of ResponseHooks.

get(url, headers=None, **kwargs)

Like Requests get(), but apply request_hooks and response_hooks.

Parameters
Keyword Arguments

**kwargs – Passed to send().

Return type

requests.Response

Protocols

class reader._parser.FeedArgument(*args, **kwargs)

Any FeedForUpdate-like object.

property url

The feed URL.

property http_etag

The HTTP ETag header from the last update.

property http_last_modified

The the HTTP Last-Modified header from the last update.

class reader._parser.RetrieverType(*args, **kwargs)

A callable that knows how to retrieve a feed.

slow_to_read

Allow Parser to read() the result resource into a temporary file, and pass that to the parser (as an optimization). Implies the resource is a readable binary file.

__call__(url, http_etag, http_last_modified, http_accept)

Retrieve a feed.

Parameters
  • feed (str) – The feed URL.

  • http_etag (str or None) – The HTTP ETag header from the last update.

  • http_last_modified (str or None) – The the HTTP Last-Modified header from the last update.

  • http_accept (str or None) – Content types to be retrieved, as an HTTP Accept header.

Returns

A context manager that has as target either the result or None, if the feed didn’t change.

Return type

contextmanager(RetrieveResult or None)

Raises

ParseError

validate_url(url)

Check if url is valid for this retriever.

Raises

InvalidFeedURLError – If url is not valid.

class reader._parser.FeedForUpdateRetrieverType(*args, **kwargs)

Bases: RetrieverType[T_co], Protocol

A RetrieverType that can change update-relevant information.

process_feed_for_update(feed)

Change update-relevant information about a feed before it is passed to the retriever (RetrieverType.__call__()).

Parameters

feed (FeedForUpdate) – Feed information.

Returns

The passed-in feed information, possibly modified.

Return type

FeedForUpdate

class reader._parser.ParserType(*args, **kwargs)

A callable that knows how to parse a retrieved feed.

__call__(url, resource, headers)

Parse a feed.

Parameters
  • resource – The feed resource. Usually, a readable binary file.

  • headers (dict(str, str) or None) – The HTTP response headers associated with the resource.

Returns

The feed and entry data.

Return type

tuple(FeedData, collection(EntryData))

Raises

ParseError

class reader._parser.HTTPAcceptParserType(*args, **kwargs)

Bases: ParserType[T_cv], Protocol

A ParserType that knows what content it can handle.

property http_accept

The content types this parser supports, as an Accept HTTP header value.

class reader._parser.EntryPairsParserType(*args, **kwargs)

Bases: ParserType[T_cv], Protocol

A ParserType that can modify entry data before being stored.

process_entry_pairs(url, pairs)

Process entry data before being stored.

Parameters
Returns

(entry data, entry for update) pairs, possibly modified.

Return type

iterable(tuple(EntryData, EntryForUpdate or None))

class reader._requests_utils.RequestHook(*args, **kwargs)

Hook to modify a Request before it is sent.

__call__(session, request, **kwargs)

Modify a request before it is sent.

Parameters
Keyword Arguments

**kwargs – Will be passed to send().

Returns

A (possibly modified) request to be sent. If none, send the initial request.

Return type

requests.Request or None

class reader._requests_utils.ResponseHook(*args, **kwargs)

Hook to repeat a request depending on the Response.

__call__(session, response, request, **kwargs)

Repeat a request depending on the response.

Parameters
Keyword Arguments

**kwargs – Were passed to send().

Returns

A (possibly new) request to be sent, or None, to return the current response.

Return type

requests.Request or None

Data objects

class reader._parser.RetrieveResult(resource, mime_type=None, http_etag=None, http_last_modified=None, headers=None)

The result of retrieving a feed, plus metadata.

resource

The result of retrieving a feed. Usually, a readable binary file. Passed to the parser.

mime_type = None

The MIME type of the resource. Used to select an appropriate parser.

http_etag = None

The HTTP ETag header associated with the resource. Passed back to the retriever on the next update.

http_last_modified = None

The HTTP Last-Modified header associated with the resource. Passed back to the retriever on the next update.

headers = None

The HTTP response headers associated with the resource. Passed to the parser.

class reader._types.ParsedFeed(feed, entries, http_etag=None, http_last_modified=None, mime_type=None)

A parsed feed.

feed

The feed.

entries

Iterable of entries.

http_etag

The HTTP ETag header associated with the feed resource. Passed back to the retriever on the next update.

http_last_modified

The HTTP Last-Modified header associated with the feed resource. Passed back to the retriever on the next update.

mime_type

The MIME type of the feed resource. Used by process_entry_pairs() to select an appropriate parser.

class reader._types.FeedData(url, updated=None, title=None, link=None, author=None, subtitle=None, version=None)

Feed data that comes from the feed.

Attributes are a subset of those of Feed.

url
updated = None
title = None
author = None
subtitle = None
version = None
as_feed(**kwargs)

Convert this to a feed; kwargs override attributes.

Returns

Feed.

property resource_id
property hash
class reader._types.EntryData(feed_url, id, updated=None, title=None, link=None, author=None, published=None, summary=None, content=(), enclosures=())

Entry data that comes from the feed.

Attributes are a subset of those of Entry.

feed_url
id
updated = None
title = None
author = None
published = None
summary = None
content = ()
enclosures = ()
as_entry(**kwargs)

Convert this to an entry; kwargs override attributes.

Returns

Entry.

property resource_id
property hash
class reader._types.FeedForUpdate(url, updated, http_etag, http_last_modified, stale, last_updated, last_exception, hash)

Update-relevant information about an existing feed, from Storage.

url

The feed URL.

updated

The date the feed was last updated, according to the feed.

http_etag

The HTTP ETag header from the last update.

http_last_modified

The HTTP Last-Modified header from the last update.

stale

Whether the next update should update all entries, regardless of their .updated.

last_updated

The date the feed was last updated, according to reader; none if never.

last_exception

Whether the feed had an exception at the last update.

hash

The hash of the corresponding FeedData.

class reader._types.EntryForUpdate(updated, published, hash, hash_changed)

Update-relevant information about an existing entry, from Storage.

updated

The date the entry was last updated, according to the entry.

published

The date the entry was published, according to the entry.

hash

The hash of the corresponding EntryData.

hash_changed

The number of updates due to a different hash since the last time updated changed.

Recipes

Parsing a feed retrieved with something other than reader

Example of using the reader internal API to parse a feed retrieved asynchronously with HTTPX:

$ python examples/parser_only.py
death and gravity
Has your password been pwned? Or, how I almost failed to search a 37 GB text file in under 1 millisecond (in Python)
import asyncio
import io
import httpx
from reader._parser import default_parser
from werkzeug.http import parse_options_header

url = "https://death.andgravity.com/_feed/index.xml"
meta_parser = default_parser()


async def main():
    async with httpx.AsyncClient() as client:
        response = await client.get(url)

        # to select the parser, we need the MIME type of the response
        content_type = response.headers.get('content-type')
        if content_type:
            mime_type, _ = parse_options_header(content_type)
        else:
            mime_type = None

        # select the parser (raises ParseError if none found)
        parser, _ = meta_parser.get_parser(url, mime_type)

        # wrap the content in a readable binary file
        file = io.BytesIO(response.content)

        # parse the feed; not doing parser(url, file, response.headers) directly
        # because parsing is CPU-intensive and would block the event loop
        feed, entries = await asyncio.to_thread(parser, url, file, response.headers)

        print(feed.title)
        print(entries[0].title)


if __name__ == '__main__':
    asyncio.run(main())