User guide
This page gives a tour of reader’s features, and a few examples of how to use them.
Note
Before starting, make sure that reader is installed and up-to-date.
The Reader object
The Reader
object persists feed and entry state
and provides operations on them.
To create a new Reader,
call make_reader()
with the path to a database file:
>>> from reader import make_reader
>>> reader = make_reader("db.sqlite")
The default (and currently only) storage uses SQLite,
so the path behaves like the database
argument of sqlite3.connect()
:
If the database does not exist, it will be created automatically.
You can pass
":memory:"
to use a temporary in-memory database; the data will disappear when the reader is closed.
Lifecycle
In order to perform maintenance tasks and release underlying resources in a predictable manner, you should use the reader as a context manager:
with make_reader('db.sqlite') as reader:
... # do stuff with reader
For convenience, you can also use the reader directly.
In this case, maintenance tasks may sometimes (rarely) be performed
before arbitrary method calls return.
You can still release the underlying resources
by calling close()
.
with reader
is roughly equivalent to with contextlib.closing(reader)
,
but the former suspends regular maintenance tasks
for the duration of the with block.
In either case, you can reuse the reader object after closing it; database connections will be re-created automatically.
Threading
You can use the same reader instance from multiple threads:
>>> Thread(target=reader.update_feeds).start()
You should use the reader as a context manager
or call its close()
method
from each thread where it is used.
It is not always possible to close the reader from your code,
especially when you do not control how threads are shut down
– for example, if you want
to use a reader across requests in a Flask web application,
or with a ThreadPoolExecutor
.
If you do not close the reader, it will attempt
to call close()
before the thread ends.
Currently, this does not work on PyPy,
or if the thread was not created through the threading
module
(but note that database connections will eventually be closed anyway
when garbage-collected).
Temporary databases
In order to maximize the usefulness temporary databases,
the database connection is closed (and the data discarded)
only when calling close()
,
not when using the reader as a context manager.
The reader cannot be reused after calling close()
.
>>> reader = make_reader(':memory:')
>>> with reader:
... reader.set_tag((), 'tag')
...
>>> list(reader.get_tag_keys(()))
['tag']
>>> reader.close()
>>> list(reader.get_tag_keys(()))
Traceback (most recent call last):
...
reader.exceptions.StorageError: usage error: cannot reuse a private database after close()
It is not possible to use a private, temporary SQLite database from other threads, since each connection would be to a different database:
>>> Thread(target=reader.update_feeds).start()
Exception in thread Thread-1 (update_feeds):
Traceback (most recent call last):
...
reader.exceptions.StorageError: usage error: cannot use a private database from threads other than the creating thread
Adding feeds
To add a feed, call the add_feed()
method with the feed URL:
>>> reader.add_feed("https://www.relay.fm/cortex/feed")
>>> reader.add_feed("http://www.hellointernet.fm/podcast?format=rss")
Most of the attributes of a new feed are empty (to populate them, the feed must be updated):
>>> feed = reader.get_feed("http://www.hellointernet.fm/podcast?format=rss")
>>> print(feed)
Feed(url='http://www.hellointernet.fm/podcast?format=rss', updated=None, title=None, ...)
File-system access
reader supports http(s):// and local (file:) feeds.
For security reasons, local feeds are disabled by default.
You can allow full file-system access or restrict it to a single directory
by using the feed_root
make_reader()
argument:
>>> # all local feed paths allowed
>>> reader = make_reader("db.sqlite", feed_root='')
>>> # local feed paths are relative to /feeds
>>> reader = make_reader("db.sqlite", feed_root='/feeds')
>>> # ok, resolves to /feeds/feed.xml
>>> reader.add_feed("feed.xml")
>>> # ok, resolves to /feeds/also/feed.xml
>>> reader.add_feed("file:also/feed.xml")
>>> # error, resolves to /feed.xml, which is above /feeds
>>> reader.add_feed("file:../feed.xml")
Traceback (most recent call last):
...
ValueError: path cannot be outside root: '/feed.xml'
Note that it is possible to add invalid feeds; updating them will still fail, though:
>>> reader.add_feed("file:../feed.xml", allow_invalid_url=True)
>>> reader.update_feed("file:../feed.xml")
Traceback (most recent call last):
...
reader.exceptions.ParseError: path cannot be outside root: '/feed.xml': 'file:../feed.xml'
Deleting feeds
To delete a feed and all the data associated with it,
use delete_feed()
:
>>> reader.delete_feed("https://www.example.com/feed.xml")
Updating feeds
To retrieve the latest version of a feed, along with any new entries,
it must be updated.
You can update all the feeds by using the update_feeds()
method:
>>> reader.update_feeds()
>>> reader.get_feed(feed)
Feed(url='http://www.hellointernet.fm/podcast?format=rss', updated=datetime.datetime(2020, 2, 28, 9, 34, 2, tzinfo=datetime.timezone.utc), title='Hello Internet', ...)
To retrive feeds in parallel, use the workers
flag:
>>> reader.update_feeds(workers=10)
You can also update a specific feed using update_feed()
:
>>> reader.update_feed("http://www.hellointernet.fm/podcast?format=rss")
If supported by the server, reader uses the ETag and Last-Modified headers to only retrieve feeds if they changed (details). Even so, you should not update feeds too often, to avoid wasting the feed publisher’s resources, and potentially getting banned; every 30 minutes seems reasonable.
To support updating newly-added feeds off the regular update schedule,
you can use the new_only
flag;
you can call this more often (e.g. every minute):
>>> reader.update_feeds(new_only=True)
If you need the status of each feed as it gets updated
(for instance, to update a progress bar),
you can use update_feeds_iter()
instead,
and get a (url, updated feed or none or exception) pair for each feed:
>>> for url, value in reader.update_feeds_iter():
... if value is None:
... print(url, "not modified")
... elif isinstance(value, Exception):
... print(url, "error:", value)
... else:
... print(url, value.new, "new,", value.updated, "updated")
...
http://www.hellointernet.fm/podcast?format=rss 100 new, 0 updated
https://www.relay.fm/cortex/feed not modified
Disabling feed updates
Sometimes, it is useful to skip a feed when using update_feeds()
;
for example, the feed does not exist anymore,
and you want to stop requesting it unnecessarily during regular updates,
but still want to keep its entries (so you cannot remove it).
disable_feed_updates()
allows you to do exactly that:
>>> reader.disable_feed_updates(feed)
You can check if updates are enabled for a feed by looking at its
updates_enabled
attribute:
>>> reader.get_feed(feed).updates_enabled
False
Getting feeds
As seen in the previous sections,
get_feed()
returns a Feed
object
with more information about a feed:
>>> from prettyprinter import pprint, install_extras;
>>> install_extras(include=['dataclasses'])
>>> feed = reader.get_feed(feed)
>>> pprint(feed)
reader.types.Feed(
url='http://www.hellointernet.fm/podcast?format=rss',
updated=datetime.datetime(
year=2020,
month=2,
day=28,
hour=9,
minute=34,
second=2,
tzinfo=datetime.timezone.utc
),
title='Hello Internet',
link='http://www.hellointernet.fm/',
author='CGP Grey',
added=datetime.datetime(2020, 10, 12, tzinfo=datetime.timezone.utc),
last_updated=datetime.datetime(2020, 10, 12, tzinfo=datetime.timezone.utc)
)
To get all the feeds, use the get_feeds()
method:
>>> for feed in reader.get_feeds():
... print(
... feed.title or feed.url,
... f"by {feed.author or 'unknown author'},",
... f"updated on {feed.updated or 'never'}",
... )
...
Cortex by Relay FM, updated on 2020-09-14 12:15:00+00:00
Hello Internet by CGP Grey, updated on 2020-02-28 09:34:02+00:00
get_feeds()
also allows
filtering feeds by their tags, if the last update succeeded,
or if updates are enabled, and changing the feed sort order.
Changing feed URLs
Sometimes, feeds move from one URL to another.
This can be handled naively by removing the old feed and adding the new URL; however, all the data associated with the old feed would get lost, including any old entries (some feeds only have the last X entries).
To change the URL of a feed in-place, use change_feed_url()
:
>>> reader.change_feed_url(
... "https://www.example.com/old.xml",
... "https://www.example.com/new.xml"
... )
Sometimes, the id of the entries changes as well;
you can handle duplicates by using
the entry_dedupe
plugin.
Getting entries
You can get all the entries, most-recent first,
by using get_entries()
,
which generates Entry
objects:
>>> for entry, _ in zip(reader.get_entries(), range(10)):
... print(entry.feed.title, '-', entry.title)
...
Cortex - 106: Clear and Boring
...
Hello Internet - H.I. #136: Dog Bingo
get_entries()
allows filtering entries by their feed,
flags, feed tags, or enclosures,
and changing the entry sort order.
Here is an example of getting entries for a single feed:
>>> feed.title
'Hello Internet'
>>> entries = list(reader.get_entries(feed=feed))
>>> for entry in entries[:2]:
... print(entry.feed.title, '-', entry.title)
...
Hello Internet - H.I. #136: Dog Bingo
Hello Internet - H.I. #135: Place Your Bets
Entry flags
Entries can be marked as read
or as important
.
The flags can be used for filtering:
>>> reader.mark_entry_as_read(entries[0])
>>> entries = list(reader.get_entries(feed=feed, read=False))
>>> for entry in entries[:2]:
... printentry.title)
...
H.I. #135: Place Your Bets
# H.I. 134: Boxing Day
The time when a flag was last modified is recorded, and is available via
read_modified
and important_modified
:
>>> for entry in reader.get_entries(feed=feed, limit=2):
... print(entry.title, '-', entry.read, entry.read_modified)
...
H.I. #136: Dog Bingo - True 2021-10-08 08:00:00+00:00
H.I. #135: Place Your Bets - False None
Full-text search
reader supports full-text searches over the entries’ content
through the search_entries()
method.
>>> reader.update_search()
>>> for result in reader.search_entries('mars'):
... print(result.metadata['.title'].apply('*', '*'))
...
H.I. #106: Water on *Mars*
search_entries()
generates EntrySearchResult
objects
containing snippets of relevant entry/feed fields,
with the parts that matched highlighted.
By default, results are filtered by relevance;
you can sort them most-recent first by passing sort='recent'
.
Also, you can filter them just as with get_entries()
.
The search index is not updated automatically;
to keep it in sync, you need to call update_search()
when entries change (e.g. after updating/deleting feeds).
update_search()
only updates
the entries that changed since the last call,
so it is OK to call it relatively often.
Because search adds minor overhead to other Reader
methods
and can almost double the size of the database,
it can be turned on/off through the
enable_search()
/ disable_search()
methods.
This is persistent across instances using the same database,
and only needs to be done once.
You can also use the search_enabled
make_reader()
argument
for the same purpose.
By default, search is disabled,
and enabled automatically on the first update_search()
call.
Counting things
You can get aggregated feed and entry counts by using one of the
get_feed_counts()
,
get_entry_counts()
, or
search_entry_counts()
methods:
>>> reader.get_feed_counts()
FeedCounts(total=156, broken=5, updates_enabled=154)
>>> reader.get_entry_counts()
EntryCounts(total=12494, read=10127, important=115, has_enclosures=2823, averages=...)
>>> reader.search_entry_counts('feed: death and gravity')
EntrySearchCounts(total=16, read=16, important=0, has_enclosures=0, averages=...)
The _counts
methods support the same filtering arguments
as their non-_counts
counterparts.
The following example shows how to get counts only for feeds/entries
with a specific tag:
>>> for tag in itertools.chain(reader.get_tag_keys((None,)), [False]):
... feeds = reader.get_feed_counts(tags=[tag])
... entries = reader.get_entry_counts(feed_tags=[tag])
... print(f"{tag or '<no tag>'}: {feeds.total} feeds, {entries.total} entries ")
...
podcast: 27 feeds, 2838 entries
python: 39 feeds, 1929 entries
self: 5 feeds, 240 entries
tech: 90 feeds, 7075 entries
webcomic: 6 feeds, 1865 entries
<no tag>: 23 feeds, 1281 entries
For entry counts, the averages
attribute
is the average number of entries per day during the last 1, 3, 12 months,
as a 3-tuple (e.g. to get an idea of how often a feed gets updated):
>>> reader.get_entry_counts().averages
(8.066666666666666, 8.054945054945055, 8.446575342465753)
>>> reader.search_entry_counts('feed: death and gravity').averages
(0.03333333333333333, 0.06593406593406594, 0.043835616438356165)
This example shows how to convert them to monthly statistics:
>>> periods = [(30, 1, 'month'), (91, 3, '3 months'), (365, 12, 'year')]
>>> for avg, (days, months, label) in zip(counts.averages, periods):
... entries = round(avg * days / months, 1)
... print(f"{entries} entries/month (past {label})")
...
1.0 entries/month (past month)
2.0 entries/month (past 3 months)
1.3 entries/month (past year)
Pagination
get_feeds()
, get_entries()
,
and search_entries()
can be used in a paginated fashion.
The limit
argument allows limiting the number of results returned;
the starting_after
argument allows skipping results until after
a specific one.
To get the first page, use only limit
:
>>> for entry in reader.get_entries(limit=2):
... print(entry.title)
...
H.I. #136: Dog Bingo
H.I. #135: Place Your Bets
To get the next page, use the last result from a call as
starting_after
in the next call:
>>> for entry in reader.get_entries(limit=2, starting_after=entry):
... print(entry.title)
...
# H.I. 134: Boxing Day
Star Wars: The Rise of Skywalker, Hello Internet Christmas Special
Plugins
reader supports plugins as a way to extend its default behavior.
To use a built-in plugin, pass the plugin name to make_reader()
:
>>> reader = make_reader("db.sqlite", plugins=[
... "reader.enclosure_dedupe",
... "reader.entry_dedupe",
... ])
You can find the full list of built-in plugins here,
and the list of plugins used by default in reader.plugins.DEFAULT_PLUGINS
.
Custom plugins
In addition to built-in plugins, reader also supports custom plugins.
A custom plugin is any callable that takes a Reader
instance
and potentially modifies it in some (useful) way.
To use custom plugins, pass them to make_reader()
:
>>> def function_plugin(reader):
... print(f"got {reader}")
...
>>> class ClassPlugin:
... def __init__(self, **options):
... self.options = options
... def __call__(self, reader):
... print(f"got options {self.options} and {reader}")
...
>>> reader = make_reader("db.sqlite", plugins=[
... function_plugin,
... ClassPlugin(option=1),
... ])
got <reader.core.Reader object at 0x7f8897824a00>
got options {'option': 1} and <reader.core.Reader object at 0x7f8897824a00>
For a real-world example, see the implementation of the enclosure_dedupe built-in plugin. Using it as a custom plugin looks like this:
>>> from reader.plugins import enclosure_dedupe
>>> reader = make_reader("db.sqlite", plugins=[enclosure_dedupe.init_reader])
Feed and entry arguments
As you may have noticed in the examples above,
feed URLs and Feed
objects can be used interchangeably
as method arguments.
This is by design.
Likewise, wherever an entry argument is expected,
you can either pass a (feed URL, entry id) tuple
or an Entry
(or EntrySearchResult
) object.
You can get this unique identifier in a uniform way by using
the resource_id
property.
This is useful when you need to refer to a reader object in a generic way
from outside Python (e.g. to make a link to the next page
of feeds/entries in a web application).
Streaming methods
All methods that return iterators
(get_feeds()
, get_entries()
etc.)
generate the results lazily.
Some examples of how this is useful:
Consuming the first 100 entries should take roughly the same amount of time, whether you have 1000 or 100000 entries.
Likewise, if you don’t keep the entries around (e.g. append them to a list), memory usage should remain relatively constant regardless of the total number of entries returned.
Reserved names
In order to expose reader and plugin functionality directly to the end user,
names starting with .reader.
and .plugin.
are reserved.
This applies to the following names:
tag keys
the top-level keys of dict tag values
Currently, there are no reader-reserved names; new ones will be documented here.
The prefixes can be changed using
reserved_name_scheme
.
Note that changing reserved_name_scheme
does not rename the actual entities,
it just controls how new reserved names are built.
Because of this, I recommend choosing a scheme
before setting up a new reader database,
and sticking with that scheme for its lifetime.
To change the scheme of an existing database,
you must rename the entities listed above yourself.
When choosing a reserved_name_scheme
,
the reader_prefix
and plugin_prefix
should not overlap,
otherwise the reader core and various plugins may interfere each other.
(For example, if both prefixes are set to .
,
reader-reserved key user_title
and a plugin named user_title
that uses just the plugin name (with no key)
will both end up using the .user_title
tag.)
That said, reader will ensure names reserved by the core and built-in plugin names will never collide, so this is a concern only if you plan to use third-party plugins.
Reserved names can be built programmatically using
make_reader_reserved_name()
and make_plugin_reserved_name()
.
Code that wishes to work with any scheme
should always use these methods to construct reserved names
(especially third-party plugins).
Advanced feedparser features
reader uses feedparser (“Universal Feed Parser”) to parse feeds. It comes with a number of advanced features, most of which reader uses transparently.
Two of these features are worth mentioning separately, since they change the content of the feed, and, although always enabled at the moment, they may become optional in the future; note that disabling them is not currently possible.
Sanitization
Quoting:
Most feeds embed HTML markup within feed elements. Some feeds even embed other types of markup, such as SVG or MathML. Since many feed aggregators use a web browser (or browser component) to display content, Universal Feed Parser sanitizes embedded markup to remove things that could pose security risks.
You can find more details about which markup and elements are sanitized in the feedparser documentation.
The following corresponding reader attributes are sanitized:
Relative link resolution
Quoting:
Many feed elements and attributes are URIs. Universal Feed Parser resolves relative URIs according to the XML:Base specification. […]
In addition [to elements treated as URIs], several feed elements may contain HTML or XHTML markup. Certain elements and attributes in HTML can be relative URIs, and Universal Feed Parser will resolve these URIs according to the same rules as the feed elements listed above.
You can find more details about which elements are treated as URIs and HTML markup in the feedparser documentation.
The following corresponding reader attributes are treated as URIs:
The following corresponding reader attributes may be treated as HTML markup, depending on their type attribute or feedparser defaults:
Errors and exceptions
All exceptions that Reader
explicitly raises inherit from
ReaderError
.
If there’s an issue retrieving or parsing the feed,
update_feed()
will raise a ParseError
with the original exception (if any) as cause.
update_feeds()
will just log the exception and move on.
In both cases, information about the cause will be stored on the feed in
last_exception
.
Any unexpected exception raised by the underlying storage implementation
will be reraised as a StorageError
,
with the original exception as cause.
Search methods will raise a SearchError
.
Any unexpected exception raised by the underlying search implementation
will be also be reraised as a SearchError
,
with the original exception as cause.
When trying to create a feed, entry, or tag that already exists,
or to operate on one that does not exist,
a corresponding *ExistsError
or *NotFoundError
will be raised.
All functions and methods may raise
ValueError
or TypeError
implicitly or explicitly
if passed invalid arguments.