cisticola.base module

class cisticola.base.Audio(raw_id, post, url, original_url, scraper, transformer, platform, date, date_archived, date_transformed, exif=None)

Bases: Media

Class for organizing information about an audio file.

date: datetime: Datetime (relative to UTC) that the scraped post was created at.

date_archived: datetime: Datetime (relative to UTC) that the scraped post was archived at.

date_transformed: datetime: Datetime (UTC) that the scraped post was transformed at.

exif: str | None: JSON dump of the dict containing metadata information for the media file.

id: Unique numerical ID of the media file.

ocr: Text contents of the media file, extracted using optical character recognition.

original_url: str: Original URL of the media from the the original post.

platform: str: Name of platform from which result was scraped, e.g. "Twitter".

post: int: ID number of the media’s corresponging scraped post in the analysis table.

raw_id: int: ID number of the media’s corresponding scraped post in the raw_posts table.

scraper: str: String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

transformer: str: String specifying name and version of transformer used to tranform result, e.g. "TwitterTransformer 0.0.1".

type

url: str: URL of the original post.

class cisticola.base.Channel(name, platform_id, category, platform, url, screenname, country=None, influencer=None, public=None, chat=None, notes='', source=None)

Bases: object

Information about a specific channel to be scraped.

category: str: User-specified category for the channel, e.g. "explicit_qanon".

chat: bool | None: Whether or not the channel is a chat (i.e. allows users who are not the channel creator to post/message)

country: str | None: 2 digit country code for the country of origin for the channel, e.g. "RU".

hydrate()

id: Unique numerical ID of the channel.

influencer: str | None: Name of influencer, if channel belongs to an influencer that operates on multiple platforms.

name: str | None: Name of channel (different from username because it can be non-unique and contain emojis), e.g. T🕊Редакция Президент Гордон🕊".

notes: str: Any other additional notes about the channel.

platform: str: Name of platform the given channel is on, e.g. "Telegram".

platform_id: str | None: String that uniquely identifies the channel on the given platform, e.g. "-1001101170442".

public: bool | None: Whether or not the channel is publicly-accessible.

screenname: str: Screen name/username of channel.

source: str | None: Did the channel come from a researcher or a scraping process?

url: str | None: URL for the given channel on the platform, e.g. "https://t.me/prezidentgordonteam"

class cisticola.base.ChannelInfo(raw_channel_info_id, channel, platform_id, platform, scraper, transformer, screenname, name, description, description_url, description_location, followers, following, verified, date_created, date_archived, date_transformed)

Bases: object

A processed set of information about a channel.

channel: int: Primary key of the channels table corresponding to the channel whose information was scraped and processed

date_archived: datetime: Datetime (relative to UTC) that the scraped channel info was archived at.

date_created: datetime | None: Datetime at which the channel was created.

date_transformed: datetime: Datetime (UTC) that the scraped channel info was transformed at.

description: str: Channel’s description/bio included in their profile.

description_location: str: Channel’s profile location specified in the channel description.

description_url: str: Channel’s profile website linked in the channel description.

followers: int: Number of followers/subscribers.

following: int: Number of accounts the channel follows/is subscribed to.

hydrate()

id: Unique numerical ID of the processed channel information.

name: str: Name of channel (different from username because it can be non-unique and contain emojis), e.g. T🕊Редакция Президент Гордон🕊".

platform: str: Name of platform from which result was scraped, e.g. "Twitter".

platform_id: str

raw_channel_info_id: int: Primary key of the raw_channel_info table this object was transformed from.

scraper: str: String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

screenname: str: Screen name/username of channel.

transformer: str: String specifying name and version of transformer used to tranform result, e.g. "TwitterTransformer 0.0.1".

verified: bool: Whether or not the channel is “verified” on the given platform.

class cisticola.base.Image(raw_id, post, url, original_url, scraper, transformer, platform, date, date_archived, date_transformed, exif=None, ocr=None)

Bases: Media

Class for organizing information about an image file.

date: datetime: Datetime (relative to UTC) that the scraped post was created at.

date_archived: datetime: Datetime (relative to UTC) that the scraped post was archived at.

date_transformed: datetime: Datetime (UTC) that the scraped post was transformed at.

exif: str | None: JSON dump of the dict containing metadata information for the media file.

hydrate(blob=None): Download image file as bytes blob and extract Exif and OCR content from the image.

hydrate_ocr(blob): Extract OCR (optical character recognition) data from image bytes blob.

id: Unique numerical ID of the media file.

ocr: str: Extracted OCR content from image

original_url: str: Original URL of the media from the the original post.

post: int: ID number of the media’s corresponging scraped post in the analysis table.

raw_id: int: ID number of the media’s corresponding scraped post in the raw_posts table.

scraper: str: String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

transformer: str: String specifying name and version of transformer used to tranform result, e.g. "TwitterTransformer 0.0.1".

type

url: str: URL of the original post.

class cisticola.base.Media(raw_id, post, url, original_url, scraper, transformer, platform, date, date_archived, date_transformed, exif=None)

Bases: object

Base class for organizing information about a media file.

date: datetime: Datetime (relative to UTC) that the scraped post was created at.

date_archived: datetime: Datetime (relative to UTC) that the scraped post was archived at.

date_transformed: datetime: Datetime (UTC) that the scraped post was transformed at.

exif: str | None: JSON dump of the dict containing metadata information for the media file.

get_blob(): Download media file as bytes blob.

hydrate(blob=None): Download media file as bytes blob and extract data from content.

hydrate_exif(blob): Extract Exif metadata from bytes blob.

id: Unique numerical ID of the media file.

ocr: Text contents of the media file, extracted using optical character recognition.

original_url: str: Original URL of the media from the the original post.

platform: str: Name of platform from which result was scraped, e.g. "Twitter".

post: int: ID number of the media’s corresponging scraped post in the analysis table.

raw_id: int: ID number of the media’s corresponding scraped post in the raw_posts table.

scraper: str: String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

transformer: str: String specifying name and version of transformer used to tranform result, e.g. "TwitterTransformer 0.0.1".

type

url: str: URL of the original post.

class cisticola.base.Post(raw_id, platform_id, scraper, transformer, platform, channel, date, date_archived, date_transformed, url, author_id, author_username, content, named_entities=<factory>, cryptocurrency_addresses=<factory>, hashtags=<factory>, outlinks=<factory>, detected_language='', normalized_content='', forwarded_from=None, reply_to=None, mentions=<factory>, likes=None, forwards=None, views=None, video_title=None, video_duration=None)

Bases: object

An object with fields for columns in the analysis table

author_id: str: String that uniquely identifies the channel on the given platform, e.g. "-1001101170442".

author_username: str: Username of author who made post.

channel: int: User-specified integer that uniquely identifies a channel, e.g. 15.

content: str: Text of the original post

cryptocurrency_addresses: list: Any cryptocurrency addresses in post

date: datetime: Datetime (relative to UTC) that the scraped post was created at.

date_archived: datetime: Datetime (relative to UTC) that the scraped post was archived at.

date_transformed: datetime: Datetime (UTC) that the scraped post was transformed at.

detected_language: str: Detected language of post

forwarded_from: int | None: The ID of the Channel that the post was forwarded or quoted from

forwards: int | None: Number of times the post was forwarded/retweeted/shared

hashtags: list: Hashtags in post

hydrate(): Populate additional fields from processed data, including language detection, named entity recognition, and extraction of outlinks, hashtags, and cryptocurrency addresses.

hydrate_spacy(): Extract named entities and normalize text content.

id: Unique numerical ID of processed post.

likes: int | None: Number of positive post reactions (e.g. likes, favorites, rumbles, upvotes, etc.)

mentions: list: Other users mentioned in the post

named_entities: list: Named entities detected in post

normalized_content: str: Normalized post content

outlinks: list: Links to any other websites

platform: str: Name of platform from which result was scraped, e.g. "Twitter".

platform_id: str: Platform specific post ID

raw_id: int: ID number of the scraped post in the raw_posts table

reply_to: int | None: The ID of the Post that this Post is a reply to

scraper: str: String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

transformer: str: String specifying name and version of transformer used to tranform result, e.g. "TwitterTransformer 0.0.1".

url: str: URL of the original post

video_duration: int | None: Video duration in seconds, if post is a video

video_title: str | None: Video title, if post is a video

views: int | None: Number of times the post was viewed

class cisticola.base.RawChannelInfo(scraper, platform, channel, raw_data, date_archived)

Bases: object

Minimally processed set of information from a scraper about one channel

channel: int: Foreign key of channel ID that this was scraped from

date_archived: datetime: Datetime (relative to UTC) that the scraped post was archived at.

id: Unique numerical ID of the raw scraped channel information.

platform: str: Name of platform from which result was scraped, e.g. "Twitter".

raw_data: str: JSON dump of dict that contains all data scraped for the post.

scraper: str: String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

class cisticola.base.ScraperResult(scraper, platform, channel, platform_id, date, raw_data, date_archived, archived_urls, media_archived)

Bases: object

Minimally processed set of information from a scraper about one post

archived_urls: dict: Dict in which the keys are the original media URLs from the post, and the corresponding values are the URLs of the archived media files.

channel: int: Foreign key of channel ID that this was scraped from

date: datetime: Datetime (relative to UTC) that the scraped post was created at.

date_archived: datetime: Datetime (relative to UTC) that the scraped post was archived at.

id: Unique numerical ID of the raw scraped post.

media_archived: datetime | None: What date was the media archived? (None if not archived)

platform: str: Name of platform from which result was scraped, e.g. "Twitter".

platform_id: str: String that uniquely identifies the scraped post on the given platform, e.g. "1503397267675533313"

raw_data: str: JSON dump of dict that contains all data scraped for the post.

scraper: str: String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

class cisticola.base.Video(raw_id, post, url, original_url, scraper, transformer, platform, date, date_archived, date_transformed, exif=None)

Bases: Media

Class for organizing information about an video file.

date: datetime: Datetime (relative to UTC) that the scraped post was created at.

date_archived: datetime: Datetime (relative to UTC) that the scraped post was archived at.

date_transformed: datetime: Datetime (UTC) that the scraped post was transformed at.

exif: str | None: JSON dump of the dict containing metadata information for the media file.

id: Unique numerical ID of the media file.

ocr: Text contents of the media file, extracted using optical character recognition.

original_url: str: Original URL of the media from the the original post.

platform: str: Name of platform from which result was scraped, e.g. "Twitter".

post: int: ID number of the media’s corresponging scraped post in the analysis table.

raw_id: int: ID number of the media’s corresponding scraped post in the raw_posts table.

scraper: str: String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

transformer: str: String specifying name and version of transformer used to tranform result, e.g. "TwitterTransformer 0.0.1".

type

url: str: URL of the original post.