Metadata-Version: 2.1
Name: kappadata
Version: 0.0.8
Summary: pytorch dataset wrappers for in-memory caching
Home-page: https://github.com/BenediktAlkin/KappaData
Project-URL: Source Code, https://github.com/BenediktAlkin/KappaData
Project-URL: Bug Tracker, https://github.com/BenediktAlkin/KappaData/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE

# KappaData

[![publish](https://github.com/BenediktAlkin/KappaData/actions/workflows/publish.yaml/badge.svg)](https://github.com/BenediktAlkin/KappaData/actions/workflows/publish.yaml)

Utilities for [datasets and dataloading](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) with [pytorch](https://pytorch.org/)
- modular datasets
- caching datasets in-memory
- various dataset filters and other manipulation (filter by class, limit size to a %, ...)

### Modular datasets
[pytorch datasets](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) load all data in the `__getitem__`.
KappaData decouples the `__getitem__` such that single properties of the dataset can be loaded independently.
#### Image classification dataset example
Let's take an image classification dataset as an example. A sample consists of an image with an associated class label. 
```
class ImageClassificationDataset(torch.utils.data.Dataset):
  def __init__(self, image_paths):
    super().__init__()
    self.image_paths = image_paths
  def __len__(self):
    return len(self.image_paths)
    
  def __getitem__(self, idx):
    img = load_image(self.image_paths[idx])
    class_label = image_path_to_class_label(self.image_paths[idx])
    return img, class_label
```

If you training process contains something that only requires the class labels, the dataset has to additionally load 
all the images which can take a long time (whereas loading only labels is very fast).
With KappaData the `__getitem__` method is split into subparts:
```
class ImageClassificationDataset(kappadata.KDDataset):
  def __init__(self, image_paths):
    super().__init__()
    self.image_paths = image_paths
  def __len__(self):
    return len(self.image_paths)
    
  def getitem_x(self, idx, ctx=None):
    return load_image(self.image_paths[idx])
  def getitem_y(self, idx, ctx=None):
    return image_path_to_class_label(self.image_paths[idx])
```
Now each subpart of the dataset can be retrieved by wrapping the dataset into a `ModeWrapper`:
```
ds = ImageClassificationDataset(image_paths=...)
for y in kappadata.ModeWrapper(ds, mode="y"):
  ...
```

#### Augmentation parameters
With KappaData you can also retrieve various properties of your data prepocessing (e.g. augmentation parameters).
The following example shows how you can retrieve the parameters [torchvision.transforms.RandomResizedCrop](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html).
```
import torchvision.transforms.functional as F
class MyRandomResizedCrop(torchvision.transforms.RandomResizedCrop):
    def forward(self, img, ctx=None):
        # make random resized crop
        i, j, h, w = self.get_params(img, self.scale, self.ratio)
        cropped = F.resized_crop(img, i, j, h, w, self.size, self.interpolation)
        # store parameters
        if ctx is not None:
          ctx["crop_parameters"] = (i, j, h, w)
        return cropped
  
class ImageClassificationDataset(kappadata.KDDataset):
  def __init__(self, ...):
    ...
    self.random_resized_crop = MyRandomResizedCrop()
  ...
  def getitem_x(self, idx, ctx=None):
    img = load_image(self.image_paths[idx])
    return self.random_resized_crop(img, ctx=ctx)
```

When you want to access the parameters simply pass `return_ctx=True` to the `ModeWrapper`:
```
ds = ImageClassificationDataset(image_paths=...)
for x, ctx in kappadata.ModeWrapper(ds, mode="x", return_ctx=True):
  print(ctx["crop_parameters"])
```

## Caching datasets in-memory
### SharedDictDataset
`kappadata.SharedDictDataset` provides a wrapper to store arbitrary datasets in-memory via dictionary shared between all 
worker processes ([python multiprocessing](https://docs.python.org/3/library/multiprocessing.html)).
The shared memory part is important for [dataloading](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) 
with `num_workers > 0`. Small and medium sized datasets can be cached in-memory to avoid bottlenecks when loading data
from a disk. For example even the full [ImageNet](https://www.image-net.org/) can be cached on many servers
as it has ~130GB and its not too uncommon for GPU servers to have more RAM than that.

`cached_ds = kappadata.SharedDictDataset(dataset=ImageClassificationDataset(...))`

### RedisDataset [EXPERIMENTAL]
`kappadata.RedisDataset` provides an in-memory cache via the [redis](https://redis.io/) in-memory database.
This enables sharing data between multiple GPU-proceses (not only worker processes) for multi-GPU training.

### Caching image datasets
Naively caching image datasets can lead to high memory consumption because image data is usually stored in a compressed
format and decompressed by the dataset. To reduce memory, the raw uncompressed data needs to be cached.

Example cache a [torchvision.datasets.ImageFolder](https://pytorch.org/vision/stable/generated/torchvision.datasets.ImageFolder.html):
```
from kappadata.loading.image_folder import raw_image_loader, raw_image_folder_sample_to_pil_sample 
class CachedImageFolder(kappadata.KDDataset):
    def __init__(self, ...):
      # modify ImageFolder to load raw samples (NOTE: can't apply transforms onto raw data)
      self.ds = torchvision.datasets.ImageFolder(..., transform=None, loader=raw_image_loader)
      # initialize cached dataset that decompresses the raw data into a PIL image
      self.cached_ds = kappadata.SharedDictDataset(self.ds, transform=raw_image_folder_sample_to_pil_sample)
      # store transforms to apply after decompression
      self.transform = ...
    def getitem_x(self, idx, ctx=None):
      x, _ = self.cached_ds[idx]
      if self.transform is not None:
          x = self.transform(x)
      return x
```


## Automatically copy datasets to a local (fast) disk
Datasets are often stored on a global (slow) storage and before training moved to a local (fast) disk.
`kappadata.copy_folder_from_global_to_local` provides an utility function to do this automatically:
- local path doesn't exist -> automatically copy from global to local
- local path exists -> do nothing
- local path exists but is incomplete -> clear directory and copy again

```
from pathlib import Path
from kappadata import copy_folder_from_global_to_local
global_path = Path("/system/data/ImageNet")
local_path = Path("/local/data")
# /system/data/ImageNet contains a 'train' and a 'val' folder -> copy whole dataset
copy_folder_from_global_to_local(global_path, local_path)
# copy only "train"
copy_folder_from_global_to_local(global_path, local_path, relative_path="train")
```
The above code will also work (without modification) if `/system/data/ImageNet` contains only 2 zip files 
`train.zip` and `val.zip`


## Dataset manipulation/filters
- Filter by class 
  - `kappadata.ClassFilterWrapper(ds, valid_classes=[0, 1])`
  - `kappadata.ClassFilterWrapper(ds, invalid_classes=[0, 1])`
- Balance data by oversampling underrepresented classes `kappadata.OversamplingWrapper(ds)`
- Subset by specifying percentages 
  - `kappadata.PercentFilterWrapper(ds, from_percent=0.25)`
  - `kappadata.PercentFilterWrapper(ds, to_percent=0.75)`
  - `kappadata.PercentFilterWrapper(ds, from_percent=0.25, to_percent=0.75)`
- Repeat the whole dataset
  - repeat twice: `kappadata.RepeatWrapper(ds, repetitions=2)`
  - repeat until size is > 100 `kappadata.RepeatWrapper(ds, min_size=100)`
- Shuffle dataset
  - `kappadata.ShuffleWrapper(ds, seed=5)`
