cisticola.transformer.base module
- class cisticola.transformer.base.ETLController
Bases:
objectAn ETLController will transform raw scraped data (ScrapedResult objects) into a more detailed format for analysis by using Transformer objects that have been registered with the controller.
- connect_to_db(engine: Engine)
Connect the ETLController to a SQLAlchemy engine.
- Parameters:
engine (sqlalchemy.engine.Engine) – Instance of SQLAlchemy Engine object to connect to
- flush_posts(session)
Save all outstanding posts to the database. For efficiency, instead of saving posts one at a time, the ETLController maintains a list of posts (
posts_to_insert) and saves them in bulk.- Parameters:
session (sqlalchemy.orm.Session) – SQLAlchemy Session that interfaces with the database
- insert_or_select(obj, session, hydrate: bool = True)
Insert an object into the database or return an existing object from the database.
- Parameters:
obj – Instance of ORM-mapped class in the
cisticola.basemodule to be inserted into the databasesession (sqlalchemy.orm.Session) – SQLAlchemy Session that interfaces with the database
hydrate (bool) – If
True, additional data fields are extracted from the object and populated in the given database table
- Return type:
Object that has been inserted into the database, or existing object in the database, or None.
- insert_post(obj, session, hydrate: bool = True, flush: bool = False)
Insert an object into the connected database.
- Parameters:
obj – Instance of ORM-mapped class in the
cisticola.basemodule to be inserted into the databasesession (sqlalchemy.orm.Session) – SQLAlchemy Session that interfaces with the database
hydrate (bool) – If
True, additional data fields are extracted from the object and populated in the given database tableflush (bool) – If
True, the object is returned with additional populated data fields (such as a primary key ID). IfFalse, the object is added toposts_to_insertand nothing is returned
- Return type:
None, or instance of ORM-mapped class from
cisticola.basethat has been inserted into the database, with additional data fields ifflushargument isTrue.
- posts_to_insert = []
- register_transformer(transformer: Transformer)
Add a single Transformer instance to the list of available Transformers.
- Parameters:
transformer (Transformer) – Instance of platform-specific Transformer to be controlled by the ETLController
- register_transformers(transformers)
Add a a list of Transformer instances to the list of available Transformers.
- Parameters:
scrapers (<list>cisticola.scraper.Scraper) – List of instances of platform-specific Transformers to be controlled by the ETLController
- transform_all_untransformed(hydrate: bool = True, min_date=datetime.datetime(2010, 1, 1, 0, 0))
Transform all ScraperResult objects in the database that do not have an equivalent Post object stored.
- Parameters:
hydrate (bool) – Whether or not to fully hydrate transformed media. Default True.
min_date (datetime.datetime) – Posts made before this date are not transformed.
- transform_all_untransformed_info()
Transform all RawChannelInfo objects in the database that do not have an equivalent ChannelInfo object stored.
- transform_all_untransformed_media(hydrate=True)
Transform all ScraperResult objects in the database that do not have an equivalent Post object stored.
- Parameters:
hydrate (bool) – Whether or not to fully hydrate transformed media. Default True.
- transform_info(results: List[ChannelInfo])
Transform raw RawChannelInfo objects into ChannelInfo objects.
- Parameters:
results (List[ChannelInfo]) – A list of ChannelInfo objects to be transformed
- transform_media(results: List, hydrate: bool = True)
Transform raw ScraperResults objects into Post objects and Media objects, then add them to the database.
- Parameters:
results (List[ScraperResult]) – A list of ScraperResult objects to be transformed
hydrate (bool) – Whether or not to fully hydrate transformed media. Default
True.
- transform_results(results: List[ScraperResult], hydrate: bool = True)
Transform raw ScraperResults objects into Post objects and Media objects. Then, adds them to the database.
- Parameters:
results (List[ScraperResult]) – A list of ScraperResult objects to be transformed
hydrate (bool) – Whether or not to fully hydrate transformed media. Default True.
- class cisticola.transformer.base.Transformer
Bases:
objectInterface class for transformers.
- can_handle(data: ScraperResult) bool
Specifies whether or not a Transformer is capable of handling a particular piece of scraped data.
- Parameters:
data (ScraperResult) – The ScraperResult object to check for ability to handle.
- Returns:
Trueif it can be handled by this Transformer, false otherwise.- Return type:
bool
- transform(data: ScraperResult, insert: Callable, session: Session, flush_posts: Callable)
Transform a ScraperResult into objects with additional parameters for analysis. This function can yield multiple objects, as it will find references to quoted/replied posts, media objects, and Channel objects and provide all of these to be inserted into the database.
- Parameters:
data (ScraperResult) – The ScraperResult object to process.
insert (Callable) – A function that either inserts the object into a database or finds an object with the relevant unique constraints if applicable.
- transform_media(data: ScraperResult, transformed: Post, insert: Callable)
Transform a post’s media attachment to standard form and insert into database.
- Parameters:
data (cisticola.base.ScraperResult) – Raw post data of post that media file was attached to
transformed (cisticola.base.Post) – Transformed post data of post that media file was attached to
insert (Callable) – A function that either inserts the object into a database or finds an object with the relevant unique constraints if applicable.