# Usage Labrea exposes a declarative way to define the data that your application uses, the relationships between different datasets, and the user inputs necessary for those datasets. ## Motivation Imagine that you have a csv file that contains a list of stores, their regions, and their total sales across multiple days, where each day is a different row. You read in the csv as a Pandas dataframe, and then in different parts of your program you need the set of distinct store codes, the distinct regions, and the date range captured by the file. You write functions like so to read in the data and derive these values. ```python from typing import Set, Tuple import datetime import pandas as pd def read_input(path: str) -> pd.DataFrame: return pd.read_csv(path) def get_distinct_stores(data: pd.DataFrame) -> Set[str]: return set(data['store_id']) def get_distinct_regions(data: pd.DataFrame) -> Set[str]: return set(data['region_id']) def get_min_date(data: pd.DataFrame) -> datetime.date: return min(data['date']) def get_max_date(data: pd.DataFrame) -> datetime.date: return max(data['date']) def get_date_range(data: pd.DataFrame) -> Tuple[datetime.date, datetime.date]: return get_min_date(data), get_max_date(data) ``` This is ok, but there is a subtle problem with the way this is written. Each of our `get_*` functions are all designed to use the output of read_input, but there is no explicit dependency being declared anywhere. The only way to know that `get_distinct_stores` should take the output of `read_input` is through comments. As an application grows, these kinds of implicit dependencies can lead to a lack of maintainability and bugs that are hard to identify the cause of. One option would be to move the `read_input` call inside each of our `get_*` functions, like so: ```python from typing import Set import pandas as pd def read_input(path: str) -> pd.DataFrame: return pd.read_csv(path) def get_distinct_stores(path: str) -> Set[str]: return set(read_input(path)['store_id']) def get_distinct_regions(path: str) -> Set[str]: return set(read_input(path)['region_id']) ... ``` This helps make our dependencies more clear, but at the cost of greatly increased coupling. Imagine we change `read_input` to take a `fmt` parameter, which specifies if the file is a csv or an excel file. We would have to add that `fmt` parameter to the signature of *every* one of our `get_*` functions, which greatly hurts maintainability. Additionally, the performance of our code took a hit because the inputs are being read in every time read_input is called, even though it's the same data being used in each `get_*` call. ## Datasets The way Labrea handles this is with `Dataset`s. A `Dataset` is a defined as a function that takes some input parameters (called `Option`s), and has explicit dependencies on other datasets. When the dataset is created (by passing a config dictionary of `Option`s), all the dependencies are resolved automatically for you. This is how we would handle our problem from before using Labrea. ```python from typing import Set, Tuple import datetime import pandas as pd from labrea import dataset, Option @dataset def input_data(path: str = Option('INPUT.PATH')) -> pd.DataFrame: return pd.read_csv(path) @dataset def distinct_stores(data: pd.DataFrame = input_data) -> Set[str]: return set(data['store_id']) @dataset def distinct_regions(data: pd.DataFrame = input_data) -> Set[str]: return set(data['region_id']) @dataset def min_date(data: pd.DataFrame = input_data) -> datetime.date: return min(data['date']) @dataset def max_date(data: pd.DataFrame = input_data) -> datetime.date: return max(data['date']) @dataset def date_range( min_date: datetime.date = min_date, max_date: datetime.date = max_date ) -> Tuple[datetime.date, datetime.date]: return min_date, max_date options = { 'INPUT': { 'PATH': '/path/to/input.csv' } } distinct_regions(options) == {'014', '620', '706'} ``` All of our functions have been converted into `Dataset`s using the `@dataset` decorator. The inputs to our functions all have defaults which are either `Option`s, or other `Dataset`s. These `Dataset`s take a dictionary of options as their input, extract the necessary options using the `Option`s, and recursively calculate any dependent datasets before calling the body of the function. This allows us to decouple the implementations of each of our functions, while also explicitly declaring any dependencies from one piece of data to the next. It also helps with our issue of upstream dependencies taking new arguments. If we wanted to take that `fmt` argument in `input_data`, we can add it like so, with no impact to the rest of our datasets. ```python from typing import Set import pandas as pd from labrea import dataset, Option @dataset def input_data( path: str = Option('INPUT.PATH'), fmt: str = Option('INPUT.FMT', 'csv') ) -> pd.DataFrame: if fmt == 'csv': return pd.read_csv(path) elif fmt == 'excel': return pd.read_excel(path) else: raise ValueError('Only csv and excel files are accepted.') @dataset def distinct_stores(data: pd.DataFrame = input_data) -> Set[str]: return set(data['store_id']) @dataset def distinct_regions(data: pd.DataFrame = input_data) -> Set[str]: return set(data['region_id']) options = { 'INPUT': { 'PATH': '/path/to/input.xlsx', 'FMT': 'excel' } } distinct_regions(options) == {'014', '620', '706'} ``` ## Additional Features ### Caching Results By default, when a dataset is evaluated with some inputs the result is cached in memory, so if the same inputs are provided it does not need to be recalculated. This might be undesirable (for example if your dataset should return random data on each call); you can disable this using the `@dataset.nocache` decorator. ```python import random from labrea import dataset, Option @dataset def same_random_number_every_time( minimum: float = Option('MIN'), maximum: float = Option('MAX') ): return random.random() * (maximum - minimum) + minimum @dataset.nocache def new_random_number_every_time( minimum: float = Option('MIN'), maximum: float = Option('MAX') ): return random.random() * (maximum - minimum) + minimum config = { 'MIN': 1, 'MAX': 2 } same_random_number_every_time(config) ## 1.8286543357828648 same_random_number_every_time(config) ## 1.8286543357828648 new_random_number_every_time(config) ## 1.226523659380299 new_random_number_every_time(config) ## 1.907915105351007 ``` ### Default Values to Options Options can take a default value. If the value is a string, you can use [confectioner](https://github.com/8451/confectioner)-style templating syntax to impute a default based on other config entries. ```python from labrea import Option config = { 'A': 'a', 'V': 'b' } Option('X', 1)(config) ## 1 Option('Y', '{A}/{V}')(config) ## 'a/b' ``` ### Switches Sometimes your dataset might have different dependencies depending on some input parameter or other condition. We can express this simply using `switch`es. In this example, we have different logic for cloud vs on-prem, and can express that this way. `switch` takes a string representing the option we want to switch over, a dictionary mapping config values to corresponding datasets, and (optionally) a default value if the config value is missing or does not appear in our mapping. ```python from labrea import dataset, switch @dataset def cloud_inputs(): ... @dataset def onprem_inputs(): ... @dataset def final_data( inputs = switch( 'ENVIRONMENT', { 'CLOUD': cloud_inputs, 'ONPREM': onprem_inputs } ) ): ... final_data({'ENVIRONMENT': 'CLOUD'}) ## uses cloud_inputs as inputs arg final_data({'ENVIRONMENT': 'ONPREM'}) ## uses onprem_inputs as inputs arg ``` The first argument to `switch` can also be another dataset. In this example, we could automatically determine the environment in another dataset rather than pass it explicitly in the config. ```python from labrea import dataset, switch @dataset def inferred_environment(): ... @dataset def final_data( inputs = switch( inferred_environment, { 'CLOUD': cloud_inputs, 'ONPREM': onprem_inputs } ) ): ... ``` ### Coalesce `Coalesce` allows you to provide a sequence of `Dataset`s (or `Option`s, `Switch`es, etc.), and use the first one that can evaluate. ```python from labrea import Coalesce, Option x = Coalesce(Option('A'), Option('V'), Option('C')) x({'A': 1}) == 1 x({'V': 2}) == 2 x({'C': 3}) == 3 x({'A': 1, 'V': 2}) == 1 x({'V': 2, 'C': 3}) == 2 x() ## EvaluationError y = Coalesce(Option('A'), Option('V'), None) y({'A': 1}) == 1 y({'V': 2}) == 2 y({'A': 1, 'V': 2}) == 1 y() is None ``` ### Overloads You can write multiple implementations to the same dataset using the `.overload` method of the dataset. For example, if you want to write a unit test where you mock up reading in some external data, you could write an overload that provides mock data. ```python from typing import List from labrea import dataset, Option @dataset(dispatch='INPUT.SOURCE') def input_data( path: str = Option('INPUT.PATH') ) -> List[str]: with open(path) as file: return file.readlines() @input_data.overload(alias='MOCK') def mock_input_data() -> List[str]: return ['a', 'b', 'c'] ``` Now, we can control which implementation is used by setting the `INPUT.SOURCE` option in our config. By default, if nothing is provided, we use the default implementation in the body of `input_data`. ```python input_data({'INPUT': {'PATH': '/input/data/path'}}) ## Use default implementation input_data({'INPUT': {'SOURCE': 'MOCK'}}) == ['a', 'b', 'c'] input_data({'INPUT': {'SOURCE': 'UNKNOWN_SOURCE'}}) ## Error ``` #### Abstract Datasets We can also have datasets that have no default implementation, called abstract datasets. ```python from typing import List from labrea import abstractdataset, Option @abstractdataset(dispatch='INPUT.SOURCE') def input_data() -> List[str]: ... @input_data.overload(alias='FLAT_FILE') def flat_file_input_data( path: str = Option('INPUT.PATH') ) -> List[str]: with open(path) as file: return file.readlines() @input_data.overload(alias='MOCK') def mock_input_data() -> List[str]: return ['a', 'b', 'c'] ``` ### Interfaces We can write collections of multiple datasets whose implementations are connected using interfaces. For example, you may want your application to pull from a SQL database in production, but from a CSV file in development. You can define an interface that specifies the datasets that need to be implemented, and then provide different implementations for each environment. #### Interface Definition ```python import pandas as pd from labrea import abstractdataset, dataset, interface, Option @interface(dispatch='ENVIRONMENT') class DataSource: @staticmethod # Adding staticmethod appeases linters/IDEs that don't understand interfaces @abstractdataset def store() -> pd.DataFrame: """Returns a dataframe of store data.""" @staticmethod @abstractdataset def region() -> pd.DataFrame: """Returns a dataframe of region data.""" @staticmethod @dataset def store_ids( store_: pd.DataFrame = store.__func__ # Use .__func__ to refer to the abstract dataset itself ) -> set[str]: """Derives the set of store ids from the store dataframe. This implementation is shared across all environments by default, but can be overridden if necessary.""" return set(store_['store_id']) @staticmethod @dataset def region_ids( region_: pd.DataFrame = region.__func__ ) -> set[str]: return set(region_['region_id']) ``` #### Development Implementation ```python @DataSource.implementation(alias='DEVELOPMENT') class DevDataSource: @staticmethod @dataset def store( path: str = Option('DEV.STORE.PATH') ) -> pd.DataFrame: return pd.read_csv(path) @staticmethod @dataset def region( path: str = Option('DEV.REGION.PATH') ) -> pd.DataFrame: return pd.read_csv(path) ``` #### Production Implementation ```python def open_connection(connection_string: str): ... @DataSource.implementation(alias='PRODUCTION') class ProdDataSource: @staticmethod @dataset def store( connection_string: str = Option('PROD.CONNECTION_STRING') ) -> pd.DataFrame: with open_connection(connection_string) as conn: return pd.read_sql('SELECT * FROM stores', conn) @staticmethod @dataset def region( connection_string: str = Option('PROD.CONNECTION_STRING') ) -> pd.DataFrame: with open_connection(connection_string) as conn: return pd.read_sql('SELECT * FROM regions', conn) ``` Now, in your code, you can use `DataSource.store` and `DataSource.region` like normal datasets, and the implementation will be chosen based on the `ENVIRONMENT` option in your config. ```python @dataset def num_stores( store_ids: set[str] = DataSource.store_ids ) -> int: return len(store_ids) ``` ### Collections Labrea provides a few helper functions for creating collections of datasets. For example, you might have a list of datasets that you want to provide as a single input to another dataset. You can use the `evaluatable_list` function to accomplist this. ```python from labrea import evaluatable_list, dataset @dataset def x() -> int: return 1 @dataset def y() -> int: return 2 @dataset def z( x_and_y: list[int] = evaluatable_list(x, y) ) -> list[int]: return x_and_y z() == [1, 2] ``` The `evaluatable_tuple` and `evaluatable_set` functions work similarly. There is also a `evaluatable_dict` function that takes a dictionary mapping (static) keys to labrea objects. ```python from labrea import evaluatable_dict @dataset def z( xy_dict: dict[str, int] = evaluatable_dict({'x': x, 'y': y}) ) -> dict[str, int]: return xy_dict z() == {'x': 1, 'y': 2} ``` ### Map Sometimes you want to use a dataset multiple times with different options. This can be accomplished using the `Map` type. `Map` takes a dataset and a dictionary mapping option keys to labrea objects that return lists (or other iterables) of values. When the `Map` object is evaluated, it will call the dataset with each of the values in the dictionary, and return an iterable of tuples, where the first element is the options set on that iteration, and the second element is the value. Like the build-in `map`, this iterable is lazy and is not a list. ```python from labrea import dataset, Option, Map @dataset def x_plus_y( x: int = Option('X'), y: int = Option('Y') ) -> int: return x + y mapped = Map(x_plus_y, {'X': Option('X_LIST')}) for keys, value in mapped({'X_LIST': [1, 2, 3], 'Y': 10}): print(keys, value) ## {'X': 1} 11 ## {'X': 2} 12 ## {'X': 3} 13 ``` `Map` objects have a `.values` property that can be used to only get the values. ```python for value in mapped.values({'X_LIST': [1, 2, 3], 'Y': 10}): print(value) ## 11 ## 12 ## 13 ``` ### In-Line Transformations Sometimes you want to perform a transformation on a dataset (or other object) that is not worth creating a new dataset for. A common example is an Option that you want to parse as a date. You can use the `>>` operator (or the equivalent `.apply()` method) to perform this transformation in-line. The `>>` operator is shared by all Labrea objects. ```python from labrea import Option import datetime as dt start_date = Option('START_DATE') >> dt.datetime.fromisoformat start_date({'START_DATE': '2022-01-01'}) == dt.datetime(2022, 1, 1) ``` ### Pipelines Labrea datasets are really useful when you know the dependency tree in advance. However, sometimes you want to write code that performs some transformation on an arbitrary input, and perhaps you want to perform a series of these transformations in an arbitrary order. To accomplish this, Labrea exposes a `Pipeline` class, where each step is created using the `@pipeline_step` decorator. Pipelines look similar to datasets, but their first argument is always the input to the pipeline, and should not have a default value. To combine pipeline steps into a pipeline, use the `+` operator. Pipelines can be evaluated like a dataset, and it will return a function of 1 variable. If you want to run the pipeline on a value, use the `.transform(input, options)` method. For example, if you were building a feature engineering pipeline, you could write a series of functions that take a dataframe and add new columns, and then chain different subsets of these functions together to create different feature sets. ```python import pandas as pd from labrea import pipeline_step, Option, dataset @dataset def store_sales(path: str = Option('PATH.STORE_SALES')) -> pd.DataFrame: pd.read_csv(path) @dataset def store_square_footage(path: str = Option('PATH.STORE_SQFT')) -> pd.DataFrame: pd.read_csv(path) @pipeline_step def add_sales( df: pd.DataFrame, sales: pd.DataFrame = store_sales ) -> pd.DataFrame: return pd.merge(df, sales, on='store_id', how='left') @pipeline_step def add_square_footage( df: pd.DataFrame, sqft: pd.DataFrame = store_square_footage ) -> pd.DataFrame: return pd.merge(df, sqft, on='store_id', how='left') @pipeline_step def add_sales_per_sqft( df: pd.DataFrame, sales: pd.DataFrame = store_sales, sqft: pd.DataFrame = store_square_footage ) -> pd.DataFrame: df = pd.merge(df, sales, on='store_id', how='left') df = pd.merge(df, sqft, on='store_id', how='left') df['sales_per_sqft'] = df['sales'] / df['sqft'] return df.drop(columns=['sales', 'sqft']) basic_features = add_sales + add_square_footage derived_features = add_sales_per_sqft all_features = basic_features + derived_features stores = pd.read_csv('/path/to/stores.csv') options = { 'PATH.STORE_SALES': '/path/to/store_sales.csv', 'PATH.STORE_SQFT': '/path/to/store_sqft.csv' } basic_features.transform(stores, options) ## Returns a dataframe with columns from stores and new columns sales and sqft derived_features.transform(stores, options) ## Returns a dataframe with columns from stores and a new column sales_per_sqft all_features.transform(stores, options) ## Returns a dataframe with columns from stores and new columns sales, sqft, and sales_per_sqft ``` Pipelines can also be used as inline transformations on other Labrea objects. ```python from labrea import dataset, Option, pipeline_step @dataset def letters() -> list[str]: return ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] @pipeline_step def take_first_n( lst: list[str], n: int = Option('N') ) -> list[str]: return lst[:n] first_n_letters = letters >> take_first_n first_n_letters({'N': 3}) == ['a', 'b', 'c'] ``` ### Helper Pipelines Labrea provides a few helper functions for creating common pipeline steps. These are `map`, `filter`, and `reduce`, all under the `labrea.functions` module. ```python from labrea import dataset import labrea.functions as lf @dataset def numbers() -> list[int]: return [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] sum_squared_evens = ( numbers >> lf.filter(lambda x: x % 2 == 0) >> lf.map(lambda x: x**2) >> lf.reduce(lambda x, y: x + y) ) sum_squared_evens() == 220 ``` ### Templates Similar to the built-in f-strings, Labrea provides a `Template` type for string interpolation. This can be useful for creating strings that depend on config values, or for creating strings that depend on the results of other datasets. ```python from labrea import dataset, Option, Template @dataset def b_dataset( b: str = Option('B') ) -> str: return b template = Template( '{A} {:b:}', b=b_dataset ) template({'A': 'Hello', 'B': 'World!'}) ## 'Hello World!' ``` ### Dataset Classes You may want to write classes with more complex behavior that use datasets and options as their inputs. Similar to the built-in dataclasses, we can use the `@datasetclass` decorator to create a class whose `__init__` method takes an options dictionary and automatically evaluates dependencies like a dataset. ```python import pandas as pd from labrea import dataset, datasetclass, Option @dataset def input_data( path: str = Option('INPUT.PATH'), fmt: str = Option('INPUT.FMT', 'csv') ) -> pd.DataFrame: if fmt == 'csv': return pd.read_csv(path) elif fmt == 'excel': return pd.read_excel(path) else: raise ValueError('Only csv and excel files are accepted.') @dataset def distinct_stores(data: pd.DataFrame = input_data) -> set[str]: return set(data['store_id']) @dataset def distinct_regions(data: pd.DataFrame = input_data) -> set[str]: return set(data['region_id']) @datasetclass class MyClass: data: pd.DataFrame = input_data stores: set[str] = distinct_stores region: set[str] = distinct_regions def lookup_store(self, store_id: str): if store_id not in self.stores: return None return self.data[self.data['store_id'] == store_id]] options = { 'INPUT.PATH': '/path/to/input.xlsx', 'INPUT.FMT': 'excel' } my_data = MyClass(options) my_data.regions == {'region_1', 'region_2', 'region_3'} my_data.lookup_store('') == pd.DataFrame(...) ``` ### Typing By default, labrea code is not going to pass a type checker (like MyPy) since the default arguments to datasets do not match the type annotations. For example: ```python from labrea import dataset, Option @dataset def add( x: int = Option('X'), y: int = Option('Y') ) -> int: return x + y ``` This will fail a type check because `Option('X')` is not an `int`, it's an `Option` object. However, when defining datasets, we can appease the type checker by adding `.result` to the end of all of our `Option`s like so: ```python from labrea import dataset, Option @dataset def add( x: int = Option('X').result, y: int = Option('Y').result ) -> int: return x + y ``` This will signal to the type checker that `Option('X').result` should be treated as the resulting value of the `Option`, rather than the option value itself. The `.result` property is shared among all Labrea types (`Option`, `Dataset`, etc.). Whenever you are defining a dataset or DatasetClass, use the `.result` suffix on all your dependencies to pass type checking. #### Subscripting For `Option`s, you can explicitly tell the type checker what the resulting type should be using `Option[](...).result` syntax. This can be useful if you want to share options across datasets and ensure that the types all match. ```python from labrea import dataset, Option X = Option[int]('X') ## PASSES @dataset def double( x: int = X.result ) -> int: return 2*x ## PASSES @dataset def halve( x: int = X.result ) -> float: return x/2.0 ## FAILS @dataset def first_char( x: str = X.result ) -> str: return x[0] ``` #### MyPy Plugin MyPy has trouble understanding the way the decorators for interfaces and dataset classes work. A MyPy plugin is provided to help with this. To use it, add the following to your `mypy.ini` file: ``` [mypy] plugins = labrea.mypy.plugin ```