Internal API
This part of the documentation covers the internal interfaces of reader,
which are useful for plugins,
or if you want to use low-level functionality
without using Reader
itself.
Warning
As of version 3.6, the internal API is not part of the public API; it is not stable yet and might change without any notice.
Parser
- reader._parser.default_parser(feed_root=None, session_timeout=(3.05, 60))
Create a pre-configured
Parser
.- Parameters:
feed_root (str or None) – See
make_reader()
for details.session_timeout (float or tuple(float, float) or None) – See
make_reader()
for details.
- Returns:
The parser.
- Return type:
- class reader._parser.Parser
Retrieve and parse feeds by delegating to
retrievers
andparsers
.To retrieve and parse a single feed, you can
call
the parser object directly.Reader
only uses the following methods:To add retrievers and parsers:
The rest of the methods are low-level methods.
- session_factory
SessionFactory
used to create Requests sessions for retrieving feeds.Plugins may add request or response hooks to this.
- parallel(feeds, map=<class 'map'>, is_parallel=True)
Retrieve and parse many feeds, possibly in parallel.
Yields the parsed feeds, as soon as they are ready.
- Parameters:
feeds (iterable(FeedArgument)) – An iterable of feeds.
map (function) – A
map()
-like function; the results can be in any order.is_parallel (bool) – Whether
map
runs the tasks in parallel.
- Yields:
tuple(
FeedArgument
,ParsedFeed
orNone
orParseError
) – A (feed, result) pair, where result is either:the parsed feed
None
, if the feed didn’t changean exception instance
- __call__(url, http_etag=None, http_last_modified=None)
Retrieve and parse one feed.
This is a convenience wrapper over
parallel()
.- Parameters:
- Returns:
The parsed feed or
None
, if the feed didn’t change.- Return type:
ParsedFeed or None
- Raises:
- retrieve(url, http_etag=None, http_last_modified=None, is_parallel=False)
Retrieve a feed.
- Parameters:
url (str) – The feed URL.
http_etag (str or None) – The HTTP
ETag
header from the last update.http_last_modified (str or None) – The the HTTP
Last-Modified
header from the last update.is_parallel (bool) – Whether this was called from
parallel()
(writes the contents to a temporary file, if possible).
- Returns:
A context manager that has as target either the result or
None
, if the feed didn’t change.- Return type:
contextmanager(RetrieveResult or None)
- Raises:
- parse(url, result)
Parse a retrieved feed.
- Parameters:
url (str) – The feed URL.
result (RetrieveResult) – A retrieve result.
- Returns:
The feed and entry data.
- Return type:
- Raises:
- get_parser(url, mime_type)
Select an appropriate parser for a feed.
Parsers
registered by URL
take precedence over thoseregistered by MIME type
.If no MIME type is given, guess it from the URL using
mimetypes.guess_type()
. If the MIME type can’t be guessed, default toapplication/octet-stream
.- Parameters:
- Returns:
The parser, and the (possibly guessed) MIME type.
- Return type:
- Raises:
ParseError – No parser matches.
- validate_url(url)
Check if
url
is valid without actually retrieving it.- Raises:
InvalidFeedURLError – If
url
is not valid.
- mount_retriever(prefix, retriever)
Register a retriever to a URL prefix.
Retrievers are sorted in descending order by prefix length.
- Parameters:
prefix (str) – A URL prefix.
retriever (RetrieverType) – The retriever.
- get_retriever(url)
Get the retriever for a URL.
- Parameters:
url (str) – The URL.
- Returns:
The matching retriever.
- Return type:
- Raises:
ParseError – No retriever matches the URL.
- mount_parser_by_mime_type(parser, http_accept=None)
Register a parser to one or more MIME types.
- Parameters:
parser (ParserType) – The parser.
http_accept (str or None) – The content types the parser supports, as an
Accept
HTTP header value. If not given, use the parser’shttp_accept
attribute, if it has one.
- Raises:
TypeError – The parser does not have an
http_accept
attribute, and nohttp_accept
was given.
- get_parser_by_mime_type(mime_type)
Get a parser for a MIME type.
- Parameters:
mime_type (str) – The MIME type of the feed resource.
- Returns:
The parser.
- Return type:
- Raises:
ParseError – No parser matches the MIME type.
- mount_parser_by_url(url, parser)
Register a parser to an exact URL.
- Parameters:
prefix (str) – A URL.
parser (ParserType) – The parser.
- get_parser_by_url(url)
Get a parser that was registered by URL.
- Parameters:
url (str) – The URL.
- Returns:
The parser.
- Return type:
- Raises:
ParseError – No parser was registered for the URL.
- process_feed_for_update(feed)
Change update-relevant information about a feed before it is passed to the retriever.
Delegates to
process_feed_for_update()
of the appropriate retriever.- Parameters:
feed (FeedForUpdate) – Feed information.
- Returns:
The passed-in feed information, possibly modified.
- Return type:
- process_entry_pairs(url, mime_type, pairs)
Process entry data before being stored.
Delegates to
process_entry_pairs()
of the appropriate parser.- Parameters:
url (str) – The feed URL.
mime_type (str or None) – The MIME type of the feed.
pairs (iterable(tuple(EntryData, EntryForUpdate or None))) – (entry data, entry for update) pairs.
- Returns:
(entry data, entry for update) pairs, possibly modified.
- Return type:
iterable(tuple(EntryData, EntryForUpdate or None))
- class reader._requests_utils.SessionFactory(...)
Manage the lifetime of a session.
To get new session,
call
the factory directly.- request_hooks
Sequence of
RequestHook
s to be associated with new sessions.
- response_hooks
Sequence of
ResponseHook
s to be associated with new sessions.
- __call__()
Create a new session.
- Return type:
- transient()
Return the current
persistent()
session, or a new one.If a new session was created, it is closed once the context manager is exited.
- Return type:
contextmanager(SessionWrapper)
- persistent()
Register a persistent session with this factory.
While the context manager returned by this method is entered, all
persistent()
andtransient()
calls will return the same session. The session is closed once the outermostpersistent()
context manager is exited.Plugins should use
transient()
.Reentrant, but NOT threadsafe.
- Return type:
contextmanager(SessionWrapper)
- class reader._requests_utils.SessionWrapper(...)
Minimal wrapper over a
requests.Session
.Only provides a limited
get()
method.Can be used as a context manager (closes the session on exit).
- session
The underlying
requests.Session
.
- request_hooks
Sequence of
RequestHook
s.
- response_hooks
Sequence of
ResponseHook
s.
- get(url, headers=None, **kwargs)
Like Requests
get()
, but applyrequest_hooks
andresponse_hooks
.
Protocols
- class reader._parser.FeedArgument(*args, **kwargs)
Any
FeedForUpdate
-like object.- property url
The feed URL.
- property http_etag
The HTTP
ETag
header from the last update.
- property http_last_modified
The the HTTP
Last-Modified
header from the last update.
- class reader._parser.RetrieverType(*args, **kwargs)
A callable that knows how to retrieve a feed.
- slow_to_read
Allow
Parser
toread()
the resultresource
into a temporary file, and pass that to the parser (as an optimization). Implies theresource
is a readable binary file.
- __call__(url, http_etag, http_last_modified, http_accept)
Retrieve a feed.
- Parameters:
- Returns:
A context manager that has as target either the result or
None
, if the feed didn’t change.- Return type:
contextmanager(RetrieveResult or None)
- Raises:
- validate_url(url)
Check if
url
is valid for this retriever.- Raises:
InvalidFeedURLError – If
url
is not valid.
- class reader._parser.FeedForUpdateRetrieverType(*args, **kwargs)
Bases:
RetrieverType
[T_co
],Protocol
A
RetrieverType
that can change update-relevant information.- process_feed_for_update(feed)
Change update-relevant information about a feed before it is passed to the retriever (
RetrieverType.__call__()
).- Parameters:
feed (FeedForUpdate) – Feed information.
- Returns:
The passed-in feed information, possibly modified.
- Return type:
- class reader._parser.ParserType(*args, **kwargs)
A callable that knows how to parse a retrieved feed.
- __call__(url, resource, headers)
Parse a feed.
- class reader._parser.HTTPAcceptParserType(*args, **kwargs)
Bases:
ParserType
[T_cv
],Protocol
A
ParserType
that knows what content it can handle.- property http_accept
The content types this parser supports, as an
Accept
HTTP header value.
- class reader._parser.EntryPairsParserType(*args, **kwargs)
Bases:
ParserType
[T_cv
],Protocol
A
ParserType
that can modify entry data before being stored.- process_entry_pairs(url, pairs)
Process entry data before being stored.
- Parameters:
url (str) – The feed URL.
pairs (iterable(tuple(EntryData, EntryForUpdate or None))) – (entry data, entry for update) pairs.
- Returns:
(entry data, entry for update) pairs, possibly modified.
- Return type:
iterable(tuple(EntryData, EntryForUpdate or None))
- class reader._requests_utils.RequestHook(*args, **kwargs)
Hook to modify a
Request
before it is sent.- __call__(session, request, **kwargs)
Modify a request before it is sent.
- Parameters:
session (requests.Session) – The session that will send the request.
request (requests.Request) – The request to be sent.
- Keyword Arguments:
**kwargs – Will be passed to
send()
.- Returns:
A (possibly modified) request to be sent. If none, send the initial request.
- Return type:
requests.Request or None
- class reader._requests_utils.ResponseHook(*args, **kwargs)
Hook to repeat a request depending on the
Response
.- __call__(session, response, request, **kwargs)
Repeat a request depending on the response.
- Parameters:
session (requests.Session) – The session that sent the request.
request (requests.Request) – The sent request.
response (requests.Response) – The received response.
- Keyword Arguments:
**kwargs – Were passed to
send()
.- Returns:
A (possibly new) request to be sent, or None, to return the current response.
- Return type:
requests.Request or None
Data objects
- class reader._parser.RetrieveResult(resource, mime_type=None, http_etag=None, http_last_modified=None, headers=None)
The result of retrieving a feed, plus metadata.
- resource
The result of retrieving a feed. Usually, a readable binary file. Passed to the parser.
- mime_type = None
The MIME type of the resource. Used to select an appropriate parser.
- http_etag = None
The HTTP
ETag
header associated with the resource. Passed back to the retriever on the next update.
- http_last_modified = None
The HTTP
Last-Modified
header associated with the resource. Passed back to the retriever on the next update.
- headers = None
The HTTP response headers associated with the resource. Passed to the parser.
- class reader._types.ParsedFeed(feed, entries, http_etag=None, http_last_modified=None, mime_type=None)
A parsed feed.
- feed
The feed.
- entries
Iterable of entries.
- http_etag
The HTTP
ETag
header associated with the feed resource. Passed back to the retriever on the next update.
- http_last_modified
The HTTP
Last-Modified
header associated with the feed resource. Passed back to the retriever on the next update.
- mime_type
The MIME type of the feed resource. Used by
process_entry_pairs()
to select an appropriate parser.
- class reader._types.FeedData(url, updated=None, title=None, link=None, author=None, subtitle=None, version=None)
Feed data that comes from the feed.
Attributes are a subset of those of
Feed
.- url
- updated = None
- title = None
- link = None
- author = None
- subtitle = None
- version = None
- property resource_id
- property hash
- class reader._types.EntryData(feed_url, id, updated=None, title=None, link=None, author=None, published=None, summary=None, content=(), enclosures=())
Entry data that comes from the feed.
Attributes are a subset of those of
Entry
.- feed_url
- id
- updated = None
- title = None
- link = None
- author = None
- published = None
- summary = None
- content = ()
- enclosures = ()
- property resource_id
- property hash
- class reader._types.FeedForUpdate(url, updated, http_etag, http_last_modified, stale, last_updated, last_exception, hash)
Update-relevant information about an existing feed, from Storage.
- url
The feed URL.
- updated
The date the feed was last updated, according to the feed.
- http_etag
The HTTP
ETag
header from the last update.
- http_last_modified
The HTTP
Last-Modified
header from the last update.
- stale
Whether the next update should update all entries, regardless of their .updated.
- last_updated
The date the feed was last updated, according to reader; none if never.
- last_exception
Whether the feed had an exception at the last update.
- class reader._types.EntryForUpdate(updated, published, hash, hash_changed)
Update-relevant information about an existing entry, from Storage.
- updated
The date the entry was last updated, according to the entry.
- published
The date the entry was published, according to the entry.
- hash_changed
The number of updates due to a different
hash
since the last timeupdated
changed.
Storage
- class reader._storage.Storage
Data access object used for all storage except search.
- delete_entries(entries, *, added_by=None)
Delete a list of entries.
- Parameters:
- Raises:
EntryNotFoundError – An entry does not exist.
EntryError – An entry has a different
added_by
from the given one.
Recipes
Parsing a feed retrieved with something other than reader
Example of using the reader internal API to parse a feed retrieved asynchronously with HTTPX:
$ python examples/parser_only.py
death and gravity
Has your password been pwned? Or, how I almost failed to search a 37 GB text file in under 1 millisecond (in Python)
import asyncio
import io
import httpx
from reader._parser import default_parser
from werkzeug.http import parse_options_header
url = "https://death.andgravity.com/_feed/index.xml"
meta_parser = default_parser()
async def main():
async with httpx.AsyncClient() as client:
response = await client.get(url)
# to select the parser, we need the MIME type of the response
content_type = response.headers.get('content-type')
if content_type:
mime_type, _ = parse_options_header(content_type)
else:
mime_type = None
# select the parser (raises ParseError if none found)
parser, _ = meta_parser.get_parser(url, mime_type)
# wrap the content in a readable binary file
file = io.BytesIO(response.content)
# parse the feed; not doing parser(url, file, response.headers) directly
# because parsing is CPU-intensive and would block the event loop
feed, entries = await asyncio.to_thread(parser, url, file, response.headers)
print(feed.title)
print(entries[0].title)
if __name__ == '__main__':
asyncio.run(main())