cisticola.base module
- class cisticola.base.Audio(raw_id, post, url, original_url, scraper, transformer, platform, date, date_archived, date_transformed, exif=None)
Bases:
MediaClass for organizing information about an audio file.
- date: datetime
Datetime (relative to UTC) that the scraped post was created at.
- date_archived: datetime
Datetime (relative to UTC) that the scraped post was archived at.
- date_transformed: datetime
Datetime (UTC) that the scraped post was transformed at.
- exif: str | None
JSON dump of the dict containing metadata information for the media file.
- id
Unique numerical ID of the media file.
- ocr
Text contents of the media file, extracted using optical character recognition.
- original_url: str
Original URL of the media from the the original post.
- platform: str
Name of platform from which result was scraped, e.g.
"Twitter".
- post: int
ID number of the media’s corresponging scraped post in the
analysistable.
- raw_id: int
ID number of the media’s corresponding scraped post in the
raw_poststable.
- scraper: str
String specifying name and version of scraper used to generate result, e.g.
"TwitterScraper 0.0.1".
- transformer: str
String specifying name and version of transformer used to tranform result, e.g.
"TwitterTransformer 0.0.1".
- type
- url: str
URL of the original post.
- class cisticola.base.Channel(name, platform_id, category, platform, url, screenname, country=None, influencer=None, public=None, chat=None, notes='', source=None)
Bases:
objectInformation about a specific channel to be scraped.
- category: str
User-specified category for the channel, e.g.
"explicit_qanon".
- chat: bool | None
Whether or not the channel is a chat (i.e. allows users who are not the channel creator to post/message)
- country: str | None
2 digit country code for the country of origin for the channel, e.g.
"RU".
- hydrate()
- id
Unique numerical ID of the channel.
- influencer: str | None
Name of influencer, if channel belongs to an influencer that operates on multiple platforms.
- name: str | None
Name of channel (different from username because it can be non-unique and contain emojis), e.g.
T🕊Редакция Президент Гордон🕊".
- notes: str
Any other additional notes about the channel.
- platform: str
Name of platform the given channel is on, e.g.
"Telegram".
- platform_id: str | None
String that uniquely identifies the channel on the given platform, e.g.
"-1001101170442".
- public: bool | None
Whether or not the channel is publicly-accessible.
- screenname: str
Screen name/username of channel.
- source: str | None
Did the channel come from a researcher or a scraping process?
- url: str | None
URL for the given channel on the platform, e.g.
"https://t.me/prezidentgordonteam"
- class cisticola.base.ChannelInfo(raw_channel_info_id, channel, platform_id, platform, scraper, transformer, screenname, name, description, description_url, description_location, followers, following, verified, date_created, date_archived, date_transformed)
Bases:
objectA processed set of information about a channel.
- channel: int
Primary key of the
channelstable corresponding to the channel whose information was scraped and processed
- date_archived: datetime
Datetime (relative to UTC) that the scraped channel info was archived at.
- date_created: datetime | None
Datetime at which the channel was created.
- date_transformed: datetime
Datetime (UTC) that the scraped channel info was transformed at.
- description: str
Channel’s description/bio included in their profile.
- description_location: str
Channel’s profile location specified in the channel description.
- description_url: str
Channel’s profile website linked in the channel description.
- followers: int
Number of followers/subscribers.
- following: int
Number of accounts the channel follows/is subscribed to.
- hydrate()
- id
Unique numerical ID of the processed channel information.
- name: str
Name of channel (different from username because it can be non-unique and contain emojis), e.g.
T🕊Редакция Президент Гордон🕊".
- platform: str
Name of platform from which result was scraped, e.g.
"Twitter".
- platform_id: str
- raw_channel_info_id: int
Primary key of the
raw_channel_infotable this object was transformed from.
- scraper: str
String specifying name and version of scraper used to generate result, e.g.
"TwitterScraper 0.0.1".
- screenname: str
Screen name/username of channel.
- transformer: str
String specifying name and version of transformer used to tranform result, e.g.
"TwitterTransformer 0.0.1".
- verified: bool
Whether or not the channel is “verified” on the given platform.
- class cisticola.base.Image(raw_id, post, url, original_url, scraper, transformer, platform, date, date_archived, date_transformed, exif=None, ocr=None)
Bases:
MediaClass for organizing information about an image file.
- date: datetime
Datetime (relative to UTC) that the scraped post was created at.
- date_archived: datetime
Datetime (relative to UTC) that the scraped post was archived at.
- date_transformed: datetime
Datetime (UTC) that the scraped post was transformed at.
- exif: str | None
JSON dump of the dict containing metadata information for the media file.
- hydrate(blob=None)
Download image file as bytes blob and extract Exif and OCR content from the image.
- hydrate_ocr(blob)
Extract OCR (optical character recognition) data from image bytes blob.
- id
Unique numerical ID of the media file.
- ocr: str
Extracted OCR content from image
- original_url: str
Original URL of the media from the the original post.
- post: int
ID number of the media’s corresponging scraped post in the
analysistable.
- raw_id: int
ID number of the media’s corresponding scraped post in the
raw_poststable.
- scraper: str
String specifying name and version of scraper used to generate result, e.g.
"TwitterScraper 0.0.1".
- transformer: str
String specifying name and version of transformer used to tranform result, e.g.
"TwitterTransformer 0.0.1".
- type
- url: str
URL of the original post.
- class cisticola.base.Media(raw_id, post, url, original_url, scraper, transformer, platform, date, date_archived, date_transformed, exif=None)
Bases:
objectBase class for organizing information about a media file.
- date: datetime
Datetime (relative to UTC) that the scraped post was created at.
- date_archived: datetime
Datetime (relative to UTC) that the scraped post was archived at.
- date_transformed: datetime
Datetime (UTC) that the scraped post was transformed at.
- exif: str | None
JSON dump of the dict containing metadata information for the media file.
- get_blob()
Download media file as bytes blob.
- hydrate(blob=None)
Download media file as bytes blob and extract data from content.
- hydrate_exif(blob)
Extract Exif metadata from bytes blob.
- id
Unique numerical ID of the media file.
- ocr
Text contents of the media file, extracted using optical character recognition.
- original_url: str
Original URL of the media from the the original post.
- platform: str
Name of platform from which result was scraped, e.g.
"Twitter".
- post: int
ID number of the media’s corresponging scraped post in the
analysistable.
- raw_id: int
ID number of the media’s corresponding scraped post in the
raw_poststable.
- scraper: str
String specifying name and version of scraper used to generate result, e.g.
"TwitterScraper 0.0.1".
- transformer: str
String specifying name and version of transformer used to tranform result, e.g.
"TwitterTransformer 0.0.1".
- type
- url: str
URL of the original post.
- class cisticola.base.Post(raw_id, platform_id, scraper, transformer, platform, channel, date, date_archived, date_transformed, url, author_id, author_username, content, named_entities=<factory>, cryptocurrency_addresses=<factory>, hashtags=<factory>, outlinks=<factory>, detected_language='', normalized_content='', forwarded_from=None, reply_to=None, mentions=<factory>, likes=None, forwards=None, views=None, video_title=None, video_duration=None)
Bases:
objectAn object with fields for columns in the analysis table
- author_id: str
String that uniquely identifies the channel on the given platform, e.g.
"-1001101170442".
- author_username: str
Username of author who made post.
- channel: int
User-specified integer that uniquely identifies a channel, e.g.
15.
- content: str
Text of the original post
- cryptocurrency_addresses: list
Any cryptocurrency addresses in post
- date: datetime
Datetime (relative to UTC) that the scraped post was created at.
- date_archived: datetime
Datetime (relative to UTC) that the scraped post was archived at.
- date_transformed: datetime
Datetime (UTC) that the scraped post was transformed at.
- detected_language: str
Detected language of post
- forwarded_from: int | None
The ID of the Channel that the post was forwarded or quoted from
- forwards: int | None
Number of times the post was forwarded/retweeted/shared
- hashtags: list
Hashtags in post
- hydrate()
Populate additional fields from processed data, including language detection, named entity recognition, and extraction of outlinks, hashtags, and cryptocurrency addresses.
- hydrate_spacy()
Extract named entities and normalize text content.
- id
Unique numerical ID of processed post.
- likes: int | None
Number of positive post reactions (e.g. likes, favorites, rumbles, upvotes, etc.)
- mentions: list
Other users mentioned in the post
- named_entities: list
Named entities detected in post
- normalized_content: str
Normalized post content
- outlinks: list
Links to any other websites
- platform: str
Name of platform from which result was scraped, e.g.
"Twitter".
- platform_id: str
Platform specific post ID
- raw_id: int
ID number of the scraped post in the
raw_poststable
- reply_to: int | None
The ID of the Post that this Post is a reply to
- scraper: str
String specifying name and version of scraper used to generate result, e.g.
"TwitterScraper 0.0.1".
- transformer: str
String specifying name and version of transformer used to tranform result, e.g.
"TwitterTransformer 0.0.1".
- url: str
URL of the original post
- video_duration: int | None
Video duration in seconds, if post is a video
- video_title: str | None
Video title, if post is a video
- views: int | None
Number of times the post was viewed
- class cisticola.base.RawChannelInfo(scraper, platform, channel, raw_data, date_archived)
Bases:
objectMinimally processed set of information from a scraper about one channel
- channel: int
Foreign key of channel ID that this was scraped from
- date_archived: datetime
Datetime (relative to UTC) that the scraped post was archived at.
- id
Unique numerical ID of the raw scraped channel information.
- platform: str
Name of platform from which result was scraped, e.g.
"Twitter".
- raw_data: str
JSON dump of dict that contains all data scraped for the post.
- scraper: str
String specifying name and version of scraper used to generate result, e.g.
"TwitterScraper 0.0.1".
- class cisticola.base.ScraperResult(scraper, platform, channel, platform_id, date, raw_data, date_archived, archived_urls, media_archived)
Bases:
objectMinimally processed set of information from a scraper about one post
- archived_urls: dict
Dict in which the keys are the original media URLs from the post, and the corresponding values are the URLs of the archived media files.
- channel: int
Foreign key of channel ID that this was scraped from
- date: datetime
Datetime (relative to UTC) that the scraped post was created at.
- date_archived: datetime
Datetime (relative to UTC) that the scraped post was archived at.
- id
Unique numerical ID of the raw scraped post.
- media_archived: datetime | None
What date was the media archived? (None if not archived)
- platform: str
Name of platform from which result was scraped, e.g.
"Twitter".
- platform_id: str
String that uniquely identifies the scraped post on the given platform, e.g.
"1503397267675533313"
- raw_data: str
JSON dump of dict that contains all data scraped for the post.
- scraper: str
String specifying name and version of scraper used to generate result, e.g.
"TwitterScraper 0.0.1".
- class cisticola.base.Video(raw_id, post, url, original_url, scraper, transformer, platform, date, date_archived, date_transformed, exif=None)
Bases:
MediaClass for organizing information about an video file.
- date: datetime
Datetime (relative to UTC) that the scraped post was created at.
- date_archived: datetime
Datetime (relative to UTC) that the scraped post was archived at.
- date_transformed: datetime
Datetime (UTC) that the scraped post was transformed at.
- exif: str | None
JSON dump of the dict containing metadata information for the media file.
- id
Unique numerical ID of the media file.
- ocr
Text contents of the media file, extracted using optical character recognition.
- original_url: str
Original URL of the media from the the original post.
- platform: str
Name of platform from which result was scraped, e.g.
"Twitter".
- post: int
ID number of the media’s corresponging scraped post in the
analysistable.
- raw_id: int
ID number of the media’s corresponding scraped post in the
raw_poststable.
- scraper: str
String specifying name and version of scraper used to generate result, e.g.
"TwitterScraper 0.0.1".
- transformer: str
String specifying name and version of transformer used to tranform result, e.g.
"TwitterTransformer 0.0.1".
- type
- url: str
URL of the original post.