Metadata-Version: 2.1
Name: ml-tracking
Version: 0.0.3
Summary: A tracking infrastructure for PatternAg machine learning projects and experiments.
Author-email: Deandra Alvear <deandra@pattern.ag>
Project-URL: Homepage, https://github.com/pttrnag/ds-automation/tree/ml-tracking/ml-tracking
Project-URL: Bug Tracker, https://github.com/pttrnag/ds-automation/tree/ml-tracking/ml-tracking/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE

# Machine Learning Tracking Infrastructure
This codebase contains an importable script that anyone can use during model development to track model performance and experiments. The script tracks ML experiments by recording all pertinent information in project-specific csv files. The csv is then appended each time an experiment is performed to create a running log of all experiments run within a project. All experiment information is displayed in a Looker Data Studio dashboard to make ML experimentation more accessible for viewing.

In addition to displaying tracking information, performance visuals such as confusion matrices and regression plots are displayed in the dashboard to assess model perfomance.

- [GCP bucket](https://console.cloud.google.com/storage/browser/production-patternag-ds-ml-tracking)
- [Looker Dashboard](https://lookerstudio.google.com/s/uLeEtSVV8hU)

## Installation
[Pypi project page](https://pypi.org/project/ml-tracking/)
~~~python
pip3 install ml_tracking
~~~

## How to Use
Import the `ml_tracking` module and run `log_experiment()` with the arguments outlined below:
~~~python
from ml_tracking.ml_tracking import log_experiment

log_experiment(
    kind = str, # classification/regression
    project = str, # the name of the project
    parameters = dict, # required information for tracking
    y_true = list, # the actual y values
    y_pred = list, # the predictions generated by the model
    extra_parameters = dict/None, # optional information to be tracked
    new_csv = bool/None # whether or not a new csv will be created
)
~~~

### Required Arguments:
- `kind`: The type of modeling being done, passed in as a string; should be either 'regression' or 'classification'.
- `project`: The overarching project identifier or name; trackers will be named using the convention: `{project_name}_{date}_{i}.csv`. __Project names containing slashes or underscores will have these characters replaced with dashes in the GCP bucket and dashboard.__ 
    
    _Example:_

    "tracker_testing_1_/first_run" -> "tracker-testing-1-first-run"
- `parameters`: The required information needed from the user:

     ~~~python
    parameters = {
        'dataset_uri': str,
        'target_column': str,
        'test_set': list,
        'model': sklearn or other model object
        }
    ~~~ 
    - `dataset_uri`: The path to the bucket, as a string containing the finalized dataset being used. If the dataset is not currently in a bucket, it should be uploaded to one. This is required for replicating experiments.
    - `target_column`: A string containing the name of the column used as the dependent variable.
    - `test_set`: A list of the sample_uuids or other id that can be used to identify which samples are in the test set.
    - `model`: The model object; __must be the model object__, not the model name as a string.

- `y_true`: A list containing the true y-values in the test set.
- `y_pred`: A list containing the model predictions; must be in the same order as `y_true`.
- `extra_parameters`: A dictionary with any additional information the user wishes to track.
    ~~~python
    extra_parameters = {
        'scaler':'MinMaxScaler', 
        'data_cleaning':'removed features > 50% nulls'
        }
    ~~~
- `new_csv`: Setting this to True will create a new tracking csv; if set to False, new experiments will be appended to the most recent tracking file created.

## Viewing results in the dashboard
1. Use the project drop-down menu to select a project
2. Look in the tracking table to identify the correct prediction id
    - If many experiments were logged under the same project name, there will be many prediction ids
3. Select the prediction id of interest in the prediction id drop-down menu
4. Steps 2 & 3 can be skipped by also looking in the model scoring table and selecting the prediction id with the best performance.

![img](files/result_select.png)

