cisticola.scraper.base module
- exception cisticola.scraper.base.ChannelDoesNotExistError
Bases:
ExceptionThe specified channel does not exist or has been deleted.
- class cisticola.scraper.base.Scraper
Bases:
objectBase class for defining platform-specific scrapers for scraping all posts from a given channel on that specific platform.
- archive_blob(blob: bytes, content_type: str, key: str) str
Upload raw bytes of a media file to the storage archive.
- Parameters:
blob (bytes) – Raw bytes of the media file to be archived.
content_type (str) – Content-Type of media. e.g.
"video/mp4".key (str) – Unique identifier for the media file.
- Returns:
archived_url – URL specifying the file on the storage archive.
- Return type:
str
- archive_files(result: ScraperResult) ScraperResult
Archive files corresponding to
archived_urldict keys, if the files have not previously been archived.- Parameters:
result (ScraperResult) – Previously scraped ScraperResult.
- Returns:
Same ScraperResult as
result, but with all URLs inarchived_urldict archived.- Return type:
- can_handle(channel: Channel) bool
Whether or not the scraper can scrape the specified channel.
- Parameters:
channel (Channel) – Channel to be scraped.
- Returns:
Trueif the scraper is capable of scrapingchannel,Falseif not.- Return type:
bool
- cookiefilename = 'cookiefile.txt'
- get_posts(channel: Channel, since: ScraperResult | None = None) Generator[ScraperResult, None, None]
Scrape all posts from the specified Channel.
- Parameters:
channel (Channel) – Channel to be scraped.
since (ScraperResult or None) – Most recently scraped ScraperResult from a previous scrape, or
Noneif scraper has not run before.
- Yields:
ScraperResult – Scraper result from a single post/comment from the specified Channel.
- get_username_from_url(url: str) str
Extract a channel’s username from its URL.
- Parameters:
url (str) – URL of the channel on a given platform e.g.
"https://twitter.com/EliotHiggins"- Returns:
username – Extracted username of the channel. e.g.
"EliotHiggins"- Return type:
str
- m3u8_url_to_blob(url: str, key: str | None = None) Tuple[bytes, str, str]
Download media file from a specified media URL, where the media file is formatted as an m3u8 playlist, which is then decoded to an mp4 file.
- Parameters:
url (str) – URL of m3u8 playlist file from original post. e.g.
"https://media.gettr.com/group47/origin/2022/03/15/01/cbc436c1-1a1a-4b97-671d-c42109f3ec9b/out.m3u8"key (str or None) – Pre-defined unique identifier for the media file.
- Returns:
blob (bytes) – Raw bytes of the downloaded media file.
content_type (str) – Content-Type of media. e.g.
"video/mp4".key (str) – Unique identifier for the media file.
- url_to_blob(url: str, key: str | None = None) Tuple[bytes, str, str]
Download media file from a specified media file URL.
- Parameters:
url (str) – URL of media file from original post. e.g.
"https://pbs.twimg.com/media/FN0j0dYWUAcQxfK?format=png&name=medium"key (str or None) – Pre-defined unique identifier for the media file.
- Returns:
blob (bytes) – Raw bytes of the downloaded media file.
content_type (str) – Content-Type of media. e.g.
"image/jpeg".key (str) – Unique identifier for the media file.
- url_to_key(url: str, content_type: str) str
Generate a unique identifier for media from a specified post.
- Parameters:
url (str) – URL of original post. e.g.
"https://twitter.com/bellingcat/status/1503397267675533313"content_type (str) – Content-Type of media. e.g.
"image/jpeg"
- Returns:
key – Unique identifier for the media file from a specified post based on the original post URL and the media’s Content-Type.
- Return type:
str
- ytdlp_url_to_blob(url: str, key: str | None = None) Tuple[bytes, str, str]
Download media file from a specified media URL, using a fork of youtube-dl that enables faster downloading.
- Parameters:
url (str) – URL of media file from original post. e.g.
"https://rumble.com/embed/vgt7gh/"key (str or None) – Pre-defined unique identifier for the media file.
- Returns:
blob (bytes) – Raw bytes of the downloaded media file.
content_type (str) – Content-Type of media. e.g.
"video/mp4".key (str) – Unique identifier for the media file.
- class cisticola.scraper.base.ScraperController
Bases:
objectRegisters scrapers, uses them to generate ScraperResults. Synchronizes everything with database via ORM.
- archive_unarchived_media(chronological=False)
Archive previously unarchived media URLs from all raw_post rows.
- Parameters:
chronological (bool) – If
True, media attachments are archived starting with the oldest post IfFalse, media attachments are archived in random order
- archive_unarchived_media_batch(session=None, chronological=False)
Archive previously unarchived media URLs from a batch of raw_post rows.
- Parameters:
session (sqlalchemy.orm.Session or None) – SQLAlchemy Session that interfaces with the database
chronological (bool) – If
True, media attachments are archived starting with the oldest post IfFalse, media attachments are archived in random order
- connect_to_db(engine)
Connect the specified SQLAlchemy engine to the controller.
- Parameters:
engine (sqlalchemy.engine.Engine) – Instance of SQLAlchemy engine to connect to
- register_scraper(scraper: Scraper)
Add a single Scraper instance to the list of available Scrapers.
- Parameters:
scraper (cisticola.scraper.Scraper) – Instance of platform-specific scraper to be controlled by the ScraperController
- register_scrapers(scrapers: List[Scraper])
Add a a list of Scraper instances to the list of available Scrapers.
- Parameters:
scrapers (<list>cisticola.scraper.Scraper) – List of instances of platform-specific scrapers to be controlled by the ScraperController
- remove_all_scrapers()
Reset the ScraperController so that it doesn’t control any scrapers
- reset_db()
Drop all data from the connected SQLAlchemy database.
- scrape_all_channel_info()
Scrape profile information from all channels in the database.
- scrape_all_channels(fetch_old: bool = False)
Scrape posts from all channels in the database, that satisfy a researcher-specified criteria
- Parameters:
fetch_old (bool) – If
True, scrape all posts from channels, regardless of when channel was last scraped. IfFalse, scrape only posts that are more recent than the previous scrape of each channel.
- scrape_channel_info(channels: List[Channel])
Scrape channel info for specified channels.
- Parameters:
channels (list[Channel]) – List of Channel instances to be scraped
- scrape_channels(channels: List[Channel], fetch_old: bool = False)
Scrape all posts from a specified list of channels.
- Parameters:
channels (list[Channel]) – List of Channel instances to be scraped
fetch_old (bool) – If
True, scrape all posts from channels, regardless of when channel was last scraped. IfFalse, scrape only posts that are more recent than the previous scrape of each channel.