cisticola.scraper.base module

exception cisticola.scraper.base.ChannelDoesNotExistError

Bases: Exception

The specified channel does not exist or has been deleted.

class cisticola.scraper.base.Scraper

Bases: object

Base class for defining platform-specific scrapers for scraping all posts from a given channel on that specific platform.

archive_blob(blob: bytes, content_type: str, key: str) → str

Upload raw bytes of a media file to the storage archive.

Parameters:

blob (bytes) – Raw bytes of the media file to be archived.
content_type (str) – Content-Type of media. e.g. "video/mp4".
key (str) – Unique identifier for the media file.

Returns:

archived_url – URL specifying the file on the storage archive.

Return type:

str

archive_files(result: ScraperResult) → ScraperResult

Archive files corresponding to archived_url dict keys, if the files have not previously been archived.

Parameters:: result (ScraperResult) – Previously scraped ScraperResult.
Returns:: Same ScraperResult as result, but with all URLs in archived_url dict archived.
Return type:: ScraperResult

can_handle(channel: Channel) → bool

Whether or not the scraper can scrape the specified channel.

Parameters:: channel (Channel) – Channel to be scraped.
Returns:: True if the scraper is capable of scraping channel, False if not.
Return type:: bool

cookiefilename = 'cookiefile.txt'

get_posts(channel: Channel, since: ScraperResult | None = None) → Generator[ScraperResult, None, None]

Scrape all posts from the specified Channel.

Parameters:

channel (Channel) – Channel to be scraped.
since (ScraperResult or None) – Most recently scraped ScraperResult from a previous scrape, or None if scraper has not run before.

Yields:

ScraperResult – Scraper result from a single post/comment from the specified Channel.

get_username_from_url(url: str) → str

Extract a channel’s username from its URL.

Parameters:: url (str) – URL of the channel on a given platform e.g. "https://twitter.com/EliotHiggins"
Returns:: username – Extracted username of the channel. e.g. "EliotHiggins"
Return type:: str

m3u8_url_to_blob(url: str, key: str | None = None) → Tuple[bytes, str, str]

Download media file from a specified media URL, where the media file is formatted as an m3u8 playlist, which is then decoded to an mp4 file.

Parameters:

url (str) – URL of m3u8 playlist file from original post. e.g. "https://media.gettr.com/group47/origin/2022/03/15/01/cbc436c1-1a1a-4b97-671d-c42109f3ec9b/out.m3u8"
key (str or None) – Pre-defined unique identifier for the media file.

Returns:

blob (bytes) – Raw bytes of the downloaded media file.
content_type (str) – Content-Type of media. e.g. "video/mp4".
key (str) – Unique identifier for the media file.

url_to_blob(url: str, key: str | None = None) → Tuple[bytes, str, str]

Download media file from a specified media file URL.

Parameters:

url (str) – URL of media file from original post. e.g. "https://pbs.twimg.com/media/FN0j0dYWUAcQxfK?format=png&name=medium"
key (str or None) – Pre-defined unique identifier for the media file.

Returns:

blob (bytes) – Raw bytes of the downloaded media file.
content_type (str) – Content-Type of media. e.g. "image/jpeg".
key (str) – Unique identifier for the media file.

url_to_key(url: str, content_type: str) → str

Generate a unique identifier for media from a specified post.

Parameters:

url (str) – URL of original post. e.g. "https://twitter.com/bellingcat/status/1503397267675533313"
content_type (str) – Content-Type of media. e.g. "image/jpeg"

Returns:

key – Unique identifier for the media file from a specified post based on the original post URL and the media’s Content-Type.

Return type:

str

ytdlp_url_to_blob(url: str, key: str | None = None) → Tuple[bytes, str, str]

Download media file from a specified media URL, using a fork of youtube-dl that enables faster downloading.

Parameters:

url (str) – URL of media file from original post. e.g. "https://rumble.com/embed/vgt7gh/"
key (str or None) – Pre-defined unique identifier for the media file.

Returns:

blob (bytes) – Raw bytes of the downloaded media file.
content_type (str) – Content-Type of media. e.g. "video/mp4".
key (str) – Unique identifier for the media file.

class cisticola.scraper.base.ScraperController

Bases: object

Registers scrapers, uses them to generate ScraperResults. Synchronizes everything with database via ORM.

archive_unarchived_media(chronological=False)

Archive previously unarchived media URLs from all raw_post rows.

Parameters:: chronological (bool) – If True, media attachments are archived starting with the oldest post If False, media attachments are archived in random order

archive_unarchived_media_batch(session=None, chronological=False)

Archive previously unarchived media URLs from a batch of raw_post rows.

Parameters:

session (sqlalchemy.orm.Session or None) – SQLAlchemy Session that interfaces with the database
chronological (bool) – If True, media attachments are archived starting with the oldest post If False, media attachments are archived in random order

connect_to_db(engine)

Connect the specified SQLAlchemy engine to the controller.

Parameters:: engine (sqlalchemy.engine.Engine) – Instance of SQLAlchemy engine to connect to

register_scraper(scraper: Scraper)

Add a single Scraper instance to the list of available Scrapers.

Parameters:: scraper (cisticola.scraper.Scraper) – Instance of platform-specific scraper to be controlled by the ScraperController

register_scrapers(scrapers: List[Scraper])

Add a a list of Scraper instances to the list of available Scrapers.

Parameters:: scrapers (<list>cisticola.scraper.Scraper) – List of instances of platform-specific scrapers to be controlled by the ScraperController

remove_all_scrapers(): Reset the ScraperController so that it doesn’t control any scrapers

reset_db(): Drop all data from the connected SQLAlchemy database.

scrape_all_channel_info(): Scrape profile information from all channels in the database.

scrape_all_channels(fetch_old: bool = False)

Scrape posts from all channels in the database, that satisfy a researcher-specified criteria

Parameters:: fetch_old (bool) – If True, scrape all posts from channels, regardless of when channel was last scraped. If False, scrape only posts that are more recent than the previous scrape of each channel.

scrape_channel_info(channels: List[Channel])

Scrape channel info for specified channels.

Parameters:: channels (list[Channel]) – List of Channel instances to be scraped

scrape_channels(channels: List[Channel], fetch_old: bool = False)

Scrape all posts from a specified list of channels.

Parameters:

channels (list[Channel]) – List of Channel instances to be scraped
fetch_old (bool) – If True, scrape all posts from channels, regardless of when channel was last scraped. If False, scrape only posts that are more recent than the previous scrape of each channel.