cisticola.transformer.base module

class cisticola.transformer.base.ETLController

Bases: object

An ETLController will transform raw scraped data (ScrapedResult objects) into a more detailed format for analysis by using Transformer objects that have been registered with the controller.

connect_to_db(engine: Engine)

Connect the ETLController to a SQLAlchemy engine.

Parameters:

engine (sqlalchemy.engine.Engine) – Instance of SQLAlchemy Engine object to connect to

flush_posts(session)

Save all outstanding posts to the database. For efficiency, instead of saving posts one at a time, the ETLController maintains a list of posts (posts_to_insert) and saves them in bulk.

Parameters:

session (sqlalchemy.orm.Session) – SQLAlchemy Session that interfaces with the database

insert_or_select(obj, session, hydrate: bool = True)

Insert an object into the database or return an existing object from the database.

Parameters:
  • obj – Instance of ORM-mapped class in the cisticola.base module to be inserted into the database

  • session (sqlalchemy.orm.Session) – SQLAlchemy Session that interfaces with the database

  • hydrate (bool) – If True, additional data fields are extracted from the object and populated in the given database table

Return type:

Object that has been inserted into the database, or existing object in the database, or None.

insert_post(obj, session, hydrate: bool = True, flush: bool = False)

Insert an object into the connected database.

Parameters:
  • obj – Instance of ORM-mapped class in the cisticola.base module to be inserted into the database

  • session (sqlalchemy.orm.Session) – SQLAlchemy Session that interfaces with the database

  • hydrate (bool) – If True, additional data fields are extracted from the object and populated in the given database table

  • flush (bool) – If True, the object is returned with additional populated data fields (such as a primary key ID). If False, the object is added to posts_to_insert and nothing is returned

Return type:

None, or instance of ORM-mapped class from cisticola.base that has been inserted into the database, with additional data fields if flush argument is True.

posts_to_insert = []
register_transformer(transformer: Transformer)

Add a single Transformer instance to the list of available Transformers.

Parameters:

transformer (Transformer) – Instance of platform-specific Transformer to be controlled by the ETLController

register_transformers(transformers)

Add a a list of Transformer instances to the list of available Transformers.

Parameters:

scrapers (<list>cisticola.scraper.Scraper) – List of instances of platform-specific Transformers to be controlled by the ETLController

transform_all_untransformed(hydrate: bool = True, min_date=datetime.datetime(2010, 1, 1, 0, 0))

Transform all ScraperResult objects in the database that do not have an equivalent Post object stored.

Parameters:
  • hydrate (bool) – Whether or not to fully hydrate transformed media. Default True.

  • min_date (datetime.datetime) – Posts made before this date are not transformed.

transform_all_untransformed_info()

Transform all RawChannelInfo objects in the database that do not have an equivalent ChannelInfo object stored.

transform_all_untransformed_media(hydrate=True)

Transform all ScraperResult objects in the database that do not have an equivalent Post object stored.

Parameters:

hydrate (bool) – Whether or not to fully hydrate transformed media. Default True.

transform_info(results: List[ChannelInfo])

Transform raw RawChannelInfo objects into ChannelInfo objects.

Parameters:

results (List[ChannelInfo]) – A list of ChannelInfo objects to be transformed

transform_media(results: List, hydrate: bool = True)

Transform raw ScraperResults objects into Post objects and Media objects, then add them to the database.

Parameters:
  • results (List[ScraperResult]) – A list of ScraperResult objects to be transformed

  • hydrate (bool) – Whether or not to fully hydrate transformed media. Default True.

transform_results(results: List[ScraperResult], hydrate: bool = True)

Transform raw ScraperResults objects into Post objects and Media objects. Then, adds them to the database.

Parameters:
  • results (List[ScraperResult]) – A list of ScraperResult objects to be transformed

  • hydrate (bool) – Whether or not to fully hydrate transformed media. Default True.

class cisticola.transformer.base.Transformer

Bases: object

Interface class for transformers.

can_handle(data: ScraperResult) bool

Specifies whether or not a Transformer is capable of handling a particular piece of scraped data.

Parameters:

data (ScraperResult) – The ScraperResult object to check for ability to handle.

Returns:

True if it can be handled by this Transformer, false otherwise.

Return type:

bool

transform(data: ScraperResult, insert: Callable, session: Session, flush_posts: Callable)

Transform a ScraperResult into objects with additional parameters for analysis. This function can yield multiple objects, as it will find references to quoted/replied posts, media objects, and Channel objects and provide all of these to be inserted into the database.

Parameters:
  • data (ScraperResult) – The ScraperResult object to process.

  • insert (Callable) – A function that either inserts the object into a database or finds an object with the relevant unique constraints if applicable.

transform_media(data: ScraperResult, transformed: Post, insert: Callable)

Transform a post’s media attachment to standard form and insert into database.

Parameters:
  • data (cisticola.base.ScraperResult) – Raw post data of post that media file was attached to

  • transformed (cisticola.base.Post) – Transformed post data of post that media file was attached to

  • insert (Callable) – A function that either inserts the object into a database or finds an object with the relevant unique constraints if applicable.