Metadata-Version: 2.1
Name: project-data-catalog
Version: 0.3.1
Summary: A catalog to define, create, store, and access datasets
Project-URL: Homepage, https://github.com/numerical-io/data_catalog
License-Expression: MIT
Requires-Python: >=3.7
Requires-Dist: dask>=0.2.0
Requires-Dist: pandas>=0.19
Requires-Dist: pytz>=2014
Requires-Dist: s3fs>=0.2.0
Description-Content-Type: text/markdown

# Data Catalog

A catalog to define, create, store, and access datasets.

This python package aims to streamline data engineering and data analysis during data science projects:

- organize all datasets used or created by your project,
- define datasets as a transformation of others,
- easily propagate updates when datasets are updated,
- avoid boilerplate code,
- access datasets from anywhere in your code, without having to remember file paths,
- share dataset definitions within a project team,
- document datasets,
- and enable smooth transitions from exploration to deployment.

Many data cataloging python packages exist (kedro, prefect, ...) for slightly different use cases. This package is tailored for managing datasets during data science projects. The emphasis is on minimal boilerplate, easy data access, and no-effort updates.


## Installation

Use a python environment compatible with this project, e.g. with conda:
```
conda create -n my_env python "pandas>=0.19" "dask>=0.2.0" "s3fs>=0.2.0" "pytz>=2011k" pytest pyarrow
```

Install this package:
```
pip install git+https://github.com/numerical-io/data_catalog.git@main
```

## Example 1: a catalog of datasets

A data catalog is defined in a python module or package. The following catalog defines three classes, each representing a dataset:

- DatasetA, defined by code (`create` function),
- DatasetB, defined as a transformation of DatasetA (`create` function and `parents` attribute),
- DatasetC, defined and read from a CSV file (`relative_path` attribute).

Each class inherits from a class defining the data format on disk. This example uses CSV and parquet files.

```python
# example_catalog.py

import pandas as pd
from data_catalog.datasets import CsvDataset, ParquetDataset


class DatasetA(CsvDataset):
    """A dataset defined in code, and saved as a CSV file.
    """

    def create(self):
        df_a = pd.DataFrame({"col1": [1, 2, 3]})
        return df_a

    read_kwargs = {"index_col": 0}


class DatasetB(ParquetDataset):
    """A dataset defined from another dataset, and saved as a Parquet file.
    """

    parents = [DatasetA]

    def create(self, df_a):
        df_b = 2 * df_a
        return df_b


class DatasetC(CsvDataset):
    """A dataset defined in a CSV file.
    """

    relative_path = "dataset_c.csv"

```

This catalog definition contains all that is needed. The datasets defined in code can be generated by a succession of tasks, that we encode in a task graph. The task graph follows the Dask DAG format, and we execute it with Dask. (The graph itself is otherwise independent from Dask, and could be run by an engine of your choice.)

```python
from data_catalog.taskgraph import create_task_graph
from dask.threaded import get

from example_catalog import DatasetA, DatasetB

# Define a context. The context is necessary to instanciate datasets.
# It contains an URI indicating where to save all the catalog's datasets.
context = {
    "catalog_uri": "file:///path/to/data/folder"
}

# Generate a task graph to create datasets, resolving dependencies between them.
datasets = [DatasetA, DatasetB] # leave out DatasetC unless you provide a file dataset_c.csv
taskgraph, targets = create_task_graph(datasets, context)

# Let Dask generate all datasets on disk
_ = get(taskgraph, targets)
```

Once the files are created, you can access datasets from anywhere in your project.

```python
dataset_b = DatasetB(context)

# Load into a dataframe
df = dataset_b.read()

# View its description
dataset_b.description()

# View the file path
dataset_b.path()

```

## Example 2: a catalog with collections of datasets

Sometimes data is available as a collection of identically formatted files. Collections of datasets are available to handle this case.

Collections can be defined in a catalog as follows:

```python
# example_catalog.py

import pandas as pd
from data_catalog.datasets import ParquetDataset
from data_catalog.collections import FileCollection, same_key_in


class CollectionA(FileCollection):
    """A collection of datasets saved as Parquet files.
    """

    def keys(self):
        return ["file_1", "file_2", "file_3"]

    class Item(ParquetDataset):
        def create(self):
            df = pd.DataFrame({"col1": [1, 2, 3]})
            return df


class CollectionB(FileCollection):
    """A collection defined from CollectionA.

    Each item corresponds to one item in CollectionA.
    """

    def keys(self):
        return ["file_1", "file_2", "file_3"]

    class Item(ParquetDataset):
        parents = [same_key_in(CollectionA)]

        def create(self, df):
            return 2 * df


class DatasetD(ParquetDataset):
    """A dataset concatenating all items from CollectionA.
    """

    parents = [CollectionA]

    def create(self, collection_a):
        df = pd.concat(collection_a)
        return df
```

The generation of files is identical as in the previous example:

```python
from data_catalog.taskgraph import create_task_graph
from dask.threaded import get

from example_catalog import CollectionA, CollectionB, DatasetD

# Define the catalog's context
context = {
    "catalog_uri": "file:///path/to/data/folder"
}

# Generate the task graph and run it with Dask
taskgraph, targets = create_task_graph(
    [CollectionA, CollectionB, DatasetD], context
)
_ = get(taskgraph, targets)
```

You can then access data anywhere in your project.

```python
# Load a collection
CollectionA(context).read()

# Get a single dataset from a collection.
item_2 = CollectionA.get("file_2")

# item_2 is a usual dataset object
df = item_2(context).read()
```

The task graph only includes necessary updates. If all files exist and parents have older update times than their children, no task will be executed. If however you modify a file, the task graph will contain tasks to update all its descendants. When modifying the code of a dataset, remove the corresponding file to trigger its re-creation, and the updates of all its descendants.


## Dataset attributes

When defining a dataset class, you can set the following attributes:

- `parents`: A list of dataset/collection classes from which this dataset is defined.
- `create`: A method to create the dataset. It takes as inputs, aside from `self`, the data loaded from all classes in `parents`. The number of input arguments (not counting `self`) must therefore be equal to the length of `parents`. The method must return the created data.
- `relative_path`: The file path, relative to the catalog URI.
- `file_extension`: The file extension.
- `is_binary_file`: A boolean indicating whether the file is a text or binary file.
- `read_kwargs`: A dict of keyword arguments for reading the dataset.
- `write_kwargs`: A dict of keyword arguments for writing the dataset.

All these attributes are optional, and have default values if omitted.

When relative_path is missing, it is inferred from the class name and path in the package. For instance, a CSV dataset `SomeDataset` defined in the submodule `example_catalog.part_one` will have a relative path set to `part_one/SomeDataset.csv`.

If a docstring is set, it becomes the dataset description available through the `description()` method.

Datasets must inherit from a subclass of `AbstractDataset`. The data catalog provides a few such classes for common cases: `CsvDataset`, `ParquetDataset`, `PickleDataset`, `ExcelDataset`, and `YamlDataset`.


## Collection attributes

A collection can have the following attributes:

- `Item`: A nested class defining a dataset in the collection. It is a template for each item in the collection.
- `keys`: A method returning a list of keys. Each key maps to a collection item. Files in the collection are named after keys, and conversely.
- `relative_path`: If set, this path refers to the directory containing collection data files. This value is used to define the `relative_path` for each `Item`.

Collections inherit from `FileCollection`.

Collection have a class method `get` that returns dataset classes for given keys.


## Managing the catalog

The data files reside at the URI set in the `context` variable, used for instanciating all objects. The catalog supports, as of now, URI's pointing to local files (`file://`) or to S3 (`s3://`). Note that the catalog itself is defined independently of its location; only data instances are dependent on the context. This facilitates the creation of several copies, e.g. for sharing between different users or versioning datasets.

To view all datasets and collections defined in a catalog, use the following functions:
```python
from data_catalog.utils import describe_catalog, list_catalog

import example_catalog

# Get dataset names and descriptions, in a dict.
describe_catalog(example_catalog)

# List all classes representing datasets and collections.
# If example_catalog is a package, the list will contain
# classes from all _imported_ submodules.
list_catalog(example_catalog)

```

When running the task graph, each task logs messages to a logger named data_catalog. This logger configuration will show messages on sys.stderr:

```python
import logging

logger = logging.getLogger("data_catalog")

logger.setLevel(logging.INFO)
log_formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
log_handler = logging.StreamHandler()
log_handler.setLevel(logging.INFO)
log_handler.setFormatter(log_formatter)
logger.addHandler(log_handler)
```
