cisticola.base module

class cisticola.base.Audio(raw_id, post, url, original_url, scraper, transformer, platform, date, date_archived, date_transformed, exif=None)

Bases: Media

Class for organizing information about an audio file.

date: datetime

Datetime (relative to UTC) that the scraped post was created at.

date_archived: datetime

Datetime (relative to UTC) that the scraped post was archived at.

date_transformed: datetime

Datetime (UTC) that the scraped post was transformed at.

exif: str | None

JSON dump of the dict containing metadata information for the media file.

id

Unique numerical ID of the media file.

ocr

Text contents of the media file, extracted using optical character recognition.

original_url: str

Original URL of the media from the the original post.

platform: str

Name of platform from which result was scraped, e.g. "Twitter".

post: int

ID number of the media’s corresponging scraped post in the analysis table.

raw_id: int

ID number of the media’s corresponding scraped post in the raw_posts table.

scraper: str

String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

transformer: str

String specifying name and version of transformer used to tranform result, e.g. "TwitterTransformer 0.0.1".

type
url: str

URL of the original post.

class cisticola.base.Channel(name, platform_id, category, platform, url, screenname, country=None, influencer=None, public=None, chat=None, notes='', source=None)

Bases: object

Information about a specific channel to be scraped.

category: str

User-specified category for the channel, e.g. "explicit_qanon".

chat: bool | None

Whether or not the channel is a chat (i.e. allows users who are not the channel creator to post/message)

country: str | None

2 digit country code for the country of origin for the channel, e.g. "RU".

hydrate()
id

Unique numerical ID of the channel.

influencer: str | None

Name of influencer, if channel belongs to an influencer that operates on multiple platforms.

name: str | None

Name of channel (different from username because it can be non-unique and contain emojis), e.g. T🕊Редакция Президент Гордон🕊".

notes: str

Any other additional notes about the channel.

platform: str

Name of platform the given channel is on, e.g. "Telegram".

platform_id: str | None

String that uniquely identifies the channel on the given platform, e.g. "-1001101170442".

public: bool | None

Whether or not the channel is publicly-accessible.

screenname: str

Screen name/username of channel.

source: str | None

Did the channel come from a researcher or a scraping process?

url: str | None

URL for the given channel on the platform, e.g. "https://t.me/prezidentgordonteam"

class cisticola.base.ChannelInfo(raw_channel_info_id, channel, platform_id, platform, scraper, transformer, screenname, name, description, description_url, description_location, followers, following, verified, date_created, date_archived, date_transformed)

Bases: object

A processed set of information about a channel.

channel: int

Primary key of the channels table corresponding to the channel whose information was scraped and processed

date_archived: datetime

Datetime (relative to UTC) that the scraped channel info was archived at.

date_created: datetime | None

Datetime at which the channel was created.

date_transformed: datetime

Datetime (UTC) that the scraped channel info was transformed at.

description: str

Channel’s description/bio included in their profile.

description_location: str

Channel’s profile location specified in the channel description.

description_url: str

Channel’s profile website linked in the channel description.

followers: int

Number of followers/subscribers.

following: int

Number of accounts the channel follows/is subscribed to.

hydrate()
id

Unique numerical ID of the processed channel information.

name: str

Name of channel (different from username because it can be non-unique and contain emojis), e.g. T🕊Редакция Президент Гордон🕊".

platform: str

Name of platform from which result was scraped, e.g. "Twitter".

platform_id: str
raw_channel_info_id: int

Primary key of the raw_channel_info table this object was transformed from.

scraper: str

String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

screenname: str

Screen name/username of channel.

transformer: str

String specifying name and version of transformer used to tranform result, e.g. "TwitterTransformer 0.0.1".

verified: bool

Whether or not the channel is “verified” on the given platform.

class cisticola.base.Image(raw_id, post, url, original_url, scraper, transformer, platform, date, date_archived, date_transformed, exif=None, ocr=None)

Bases: Media

Class for organizing information about an image file.

date: datetime

Datetime (relative to UTC) that the scraped post was created at.

date_archived: datetime

Datetime (relative to UTC) that the scraped post was archived at.

date_transformed: datetime

Datetime (UTC) that the scraped post was transformed at.

exif: str | None

JSON dump of the dict containing metadata information for the media file.

hydrate(blob=None)

Download image file as bytes blob and extract Exif and OCR content from the image.

hydrate_ocr(blob)

Extract OCR (optical character recognition) data from image bytes blob.

id

Unique numerical ID of the media file.

ocr: str

Extracted OCR content from image

original_url: str

Original URL of the media from the the original post.

post: int

ID number of the media’s corresponging scraped post in the analysis table.

raw_id: int

ID number of the media’s corresponding scraped post in the raw_posts table.

scraper: str

String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

transformer: str

String specifying name and version of transformer used to tranform result, e.g. "TwitterTransformer 0.0.1".

type
url: str

URL of the original post.

class cisticola.base.Media(raw_id, post, url, original_url, scraper, transformer, platform, date, date_archived, date_transformed, exif=None)

Bases: object

Base class for organizing information about a media file.

date: datetime

Datetime (relative to UTC) that the scraped post was created at.

date_archived: datetime

Datetime (relative to UTC) that the scraped post was archived at.

date_transformed: datetime

Datetime (UTC) that the scraped post was transformed at.

exif: str | None

JSON dump of the dict containing metadata information for the media file.

get_blob()

Download media file as bytes blob.

hydrate(blob=None)

Download media file as bytes blob and extract data from content.

hydrate_exif(blob)

Extract Exif metadata from bytes blob.

id

Unique numerical ID of the media file.

ocr

Text contents of the media file, extracted using optical character recognition.

original_url: str

Original URL of the media from the the original post.

platform: str

Name of platform from which result was scraped, e.g. "Twitter".

post: int

ID number of the media’s corresponging scraped post in the analysis table.

raw_id: int

ID number of the media’s corresponding scraped post in the raw_posts table.

scraper: str

String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

transformer: str

String specifying name and version of transformer used to tranform result, e.g. "TwitterTransformer 0.0.1".

type
url: str

URL of the original post.

class cisticola.base.Post(raw_id, platform_id, scraper, transformer, platform, channel, date, date_archived, date_transformed, url, author_id, author_username, content, named_entities=<factory>, cryptocurrency_addresses=<factory>, hashtags=<factory>, outlinks=<factory>, detected_language='', normalized_content='', forwarded_from=None, reply_to=None, mentions=<factory>, likes=None, forwards=None, views=None, video_title=None, video_duration=None)

Bases: object

An object with fields for columns in the analysis table

author_id: str

String that uniquely identifies the channel on the given platform, e.g. "-1001101170442".

author_username: str

Username of author who made post.

channel: int

User-specified integer that uniquely identifies a channel, e.g. 15.

content: str

Text of the original post

cryptocurrency_addresses: list

Any cryptocurrency addresses in post

date: datetime

Datetime (relative to UTC) that the scraped post was created at.

date_archived: datetime

Datetime (relative to UTC) that the scraped post was archived at.

date_transformed: datetime

Datetime (UTC) that the scraped post was transformed at.

detected_language: str

Detected language of post

forwarded_from: int | None

The ID of the Channel that the post was forwarded or quoted from

forwards: int | None

Number of times the post was forwarded/retweeted/shared

hashtags: list

Hashtags in post

hydrate()

Populate additional fields from processed data, including language detection, named entity recognition, and extraction of outlinks, hashtags, and cryptocurrency addresses.

hydrate_spacy()

Extract named entities and normalize text content.

id

Unique numerical ID of processed post.

likes: int | None

Number of positive post reactions (e.g. likes, favorites, rumbles, upvotes, etc.)

mentions: list

Other users mentioned in the post

named_entities: list

Named entities detected in post

normalized_content: str

Normalized post content

Links to any other websites

platform: str

Name of platform from which result was scraped, e.g. "Twitter".

platform_id: str

Platform specific post ID

raw_id: int

ID number of the scraped post in the raw_posts table

reply_to: int | None

The ID of the Post that this Post is a reply to

scraper: str

String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

transformer: str

String specifying name and version of transformer used to tranform result, e.g. "TwitterTransformer 0.0.1".

url: str

URL of the original post

video_duration: int | None

Video duration in seconds, if post is a video

video_title: str | None

Video title, if post is a video

views: int | None

Number of times the post was viewed

class cisticola.base.RawChannelInfo(scraper, platform, channel, raw_data, date_archived)

Bases: object

Minimally processed set of information from a scraper about one channel

channel: int

Foreign key of channel ID that this was scraped from

date_archived: datetime

Datetime (relative to UTC) that the scraped post was archived at.

id

Unique numerical ID of the raw scraped channel information.

platform: str

Name of platform from which result was scraped, e.g. "Twitter".

raw_data: str

JSON dump of dict that contains all data scraped for the post.

scraper: str

String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

class cisticola.base.ScraperResult(scraper, platform, channel, platform_id, date, raw_data, date_archived, archived_urls, media_archived)

Bases: object

Minimally processed set of information from a scraper about one post

archived_urls: dict

Dict in which the keys are the original media URLs from the post, and the corresponding values are the URLs of the archived media files.

channel: int

Foreign key of channel ID that this was scraped from

date: datetime

Datetime (relative to UTC) that the scraped post was created at.

date_archived: datetime

Datetime (relative to UTC) that the scraped post was archived at.

id

Unique numerical ID of the raw scraped post.

media_archived: datetime | None

What date was the media archived? (None if not archived)

platform: str

Name of platform from which result was scraped, e.g. "Twitter".

platform_id: str

String that uniquely identifies the scraped post on the given platform, e.g. "1503397267675533313"

raw_data: str

JSON dump of dict that contains all data scraped for the post.

scraper: str

String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

class cisticola.base.Video(raw_id, post, url, original_url, scraper, transformer, platform, date, date_archived, date_transformed, exif=None)

Bases: Media

Class for organizing information about an video file.

date: datetime

Datetime (relative to UTC) that the scraped post was created at.

date_archived: datetime

Datetime (relative to UTC) that the scraped post was archived at.

date_transformed: datetime

Datetime (UTC) that the scraped post was transformed at.

exif: str | None

JSON dump of the dict containing metadata information for the media file.

id

Unique numerical ID of the media file.

ocr

Text contents of the media file, extracted using optical character recognition.

original_url: str

Original URL of the media from the the original post.

platform: str

Name of platform from which result was scraped, e.g. "Twitter".

post: int

ID number of the media’s corresponging scraped post in the analysis table.

raw_id: int

ID number of the media’s corresponding scraped post in the raw_posts table.

scraper: str

String specifying name and version of scraper used to generate result, e.g. "TwitterScraper 0.0.1".

transformer: str

String specifying name and version of transformer used to tranform result, e.g. "TwitterTransformer 0.0.1".

type
url: str

URL of the original post.