Metadata-Version: 2.1
Name: rdt
Version: 0.5.0.dev0
Summary: Reversible Data Transforms
Home-page: https://github.com/sdv-dev/RDT
Author: MIT Data To AI Lab
Author-email: dailabmit@gmail.com
License: MIT license
Description: <p align="left">
          <a href="https://dai.lids.mit.edu">
            <img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt="DAI-Lab" />
          </a>
          <i>An Open Source Project from the <a href="https://dai.lids.mit.edu">Data to AI Lab, at MIT</a></i>
        </p>
        
        [![Development Status](https://img.shields.io/badge/Development%20Status-2%20--%20Pre--Alpha-yellow)](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
        [![PyPi Shield](https://img.shields.io/pypi/v/RDT.svg)](https://pypi.python.org/pypi/RDT)
        [![Tests](https://github.com/sdv-dev/RDT/workflows/Run%20Tests/badge.svg)](https://github.com/sdv-dev/RDT/actions?query=workflow%3A%22Run+Tests%22+branch%3Amaster)
        [![Downloads](https://pepy.tech/badge/rdt)](https://pepy.tech/project/rdt)
        [![Coverage Status](https://codecov.io/gh/sdv-dev/RDT/branch/master/graph/badge.svg)](https://codecov.io/gh/sdv-dev/RDT)
        
        <img align="center" width=40% src="docs/images/rdt-logo.png">
        
        * Website: https://sdv.dev
        * Documentation: https://sdv.dev/SDV
        * Repository: https://github.com/sdv-dev/RDT
        * License: [MIT](https://github.com/sdv-dev/RDT/blob/master/LICENSE)
        * Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
        
        # Overview
        
        **RDT** is a Python library used to transform data for data science libraries and preserve
        the transformations in order to revert them as needed.
        
        # Install
        
        **RDT** is part of the **SDV** project and is automatically installed alongside it. For
        details about this process please visit the [SDV Installation Guide](
        https://sdv.dev/SDV/getting_started/install.html)
        
        Optionally, **RDT** can also be installed as a standalone library using the following commands:
        
        **Using `pip`:**
        
        ```bash
        pip install rdt
        ```
        
        **Using `conda`:**
        
        ```bash
        conda install -c sdv-dev -c conda-forge rdt
        ```
        
        For more installation options please visit the [RDT installation Guide](INSTALL.md)
        
        
        # Quickstart
        
        In this short series of tutorials we will guide you through a series of steps that will
        help you getting started using **RDT** to transform columns, tables and datasets.
        
        ## Transforming a column
        
        In this first guide, you will learn how to use **RDT** in its simplest form, transforming
        a single column loaded as a `pandas.DataFrame` object.
        
        ### 1. Load the demo data
        
        You can load some demo data using the `rdt.get_demo` function, which will return some random
        data for you to play with.
        
        ```python3
        from rdt import get_demo
        
        data = get_demo()
        ```
        
        This will return a `pandas.DataFrame` with 10 rows and 4 columns, one of each data type supported:
        
        ```
           0_int    1_float 2_str          3_datetime
        0   38.0  46.872441     b 2021-02-10 21:50:00
        1   77.0  13.150228   NaN 2021-07-19 21:14:00
        2   21.0        NaN     b                 NaT
        3   10.0  37.128869     c 2019-10-15 21:39:00
        4   91.0  41.341214     a 2020-10-31 11:57:00
        5   67.0  92.237335     a                 NaT
        6    NaN  51.598682   NaN 2020-04-01 01:56:00
        7    NaN  42.204396     c 2020-03-12 22:12:00
        8   68.0        NaN     c 2021-02-25 16:04:00
        9    7.0  31.542918     a 2020-07-12 03:12:00
        ```
        
        Notice how the data is random, so your output might look a bit different. Also notice how
        RDT introduced some null values randomly.
        
        ### 2. Load the transformer
        
        In this example we will use the datetime column, so let's load a `DatetimeTransformer`.
        
        ```python3
        from rdt.transformers import DatetimeTransformer
        
        transformer = DatetimeTransformer()
        ```
        
        ### 3. Fit the Transformer
        
        Before being able to transform the data, we need the transformer to learn from it.
        
        We will do this by calling its `fit` method passing the column that we want to transform.
        
        ```python3
        transformer.fit(data['3_datetime'])
        ```
        
        ### 4. Transform the data
        
        Once the transformer is fitted, we can pass the data again to its `transform` method in order
        to get the transformed version of the data.
        
        ```python3
        transformed = transformer.transform(data['3_datetime'])
        ```
        
        The output will be a `numpy.ndarray` with two columns, one with the datetimes transformed
        to integer timestamps, and another one indicating with 1s which values were null in the
        original data.
        
        ```
        array([[1.61299380e+18, 0.00000000e+00],
               [1.62672924e+18, 0.00000000e+00],
               [1.59919923e+18, 1.00000000e+00],
               [1.57117554e+18, 0.00000000e+00],
               [1.60414542e+18, 0.00000000e+00],
               [1.59919923e+18, 1.00000000e+00],
               [1.58570616e+18, 0.00000000e+00],
               [1.58405112e+18, 0.00000000e+00],
               [1.61426904e+18, 0.00000000e+00],
               [1.59452352e+18, 0.00000000e+00]])
        ```
        
        ### 5. Revert the column transformation
        
        In order to revert the previous transformation, the transformed data can be passed to
        the `reverse_transform` method of the transformer:
        
        ```python3
        reversed_data = transformer.reverse_transform(transformed)
        ```
        
        The output will be a `pandas.Series` containing the reverted values, which should be exactly
        like the original ones.
        
        ```
        0   2021-02-10 21:50:00
        1   2021-07-19 21:14:00
        2                   NaT
        3   2019-10-15 21:39:00
        4   2020-10-31 11:57:00
        5                   NaT
        6   2020-04-01 01:56:00
        7   2020-03-12 22:12:00
        8   2021-02-25 16:04:00
        9   2020-07-12 03:12:00
        dtype: datetime64[ns]
        ```
        
        ## Transforming a table
        
        Once we know how to transform a single column, we can try to go the next level and transform
        a table with multiple columns.
        
        ### 1. Load the HyperTransformer
        
        In order to manuipulate a complete table we will need to load a `rdt.HyperTransformer`.
        
        ```python3
        from rdt import HyperTransformer
        
        ht = HyperTransformer()
        ```
        
        ### 2. Fit the HyperTransformer
        
        Just like the transfomer, the HyperTransformer needs to be fitted before being able to transform
        data.
        
        This is done by calling its `fit` method passing the `data` DataFrame.
        
        ```python3
        ht.fit(data)
        ```
        
        ### 3. Transform the table data
        
        Once the HyperTransformer is fitted, we can pass the data again to its `transform` method in order
        to get the transformed version of the data.
        
        ```python3
        transformed = ht.transform(data)
        ```
        
        The output, will now be another `pandas.DataFrame` with the numerical representation of our
        data.
        
        ```
            0_int  0_int#1    1_float  1_float#1  2_str    3_datetime  3_datetime#1
        0  38.000      0.0  46.872441        0.0   0.70  1.612994e+18           0.0
        1  77.000      0.0  13.150228        0.0   0.90  1.626729e+18           0.0
        2  21.000      0.0  44.509511        1.0   0.70  1.599199e+18           1.0
        3  10.000      0.0  37.128869        0.0   0.15  1.571176e+18           0.0
        4  91.000      0.0  41.341214        0.0   0.45  1.604145e+18           0.0
        5  67.000      0.0  92.237335        0.0   0.45  1.599199e+18           1.0
        6  47.375      1.0  51.598682        0.0   0.90  1.585706e+18           0.0
        7  47.375      1.0  42.204396        0.0   0.15  1.584051e+18           0.0
        8  68.000      0.0  44.509511        1.0   0.15  1.614269e+18           0.0
        9   7.000      0.0  31.542918        0.0   0.45  1.594524e+18           0.0
        ```
        
        ### 4. Revert the table transformation
        
        In order to revert the transformation and recover the original data from the transformed one,
        we need to call `reverse_transform` method of the `HyperTransformer` instance passing it the
        transformed data.
        
        ```python3
        reversed_data = ht.reverse_transform(transformed)
        ```
        
        Which should output, again, a table that looks exactly like the original one.
        
        ```
           0_int    1_float 2_str          3_datetime
        0   38.0  46.872441     b 2021-02-10 21:50:00
        1   77.0  13.150228   NaN 2021-07-19 21:14:00
        2   21.0        NaN     b                 NaT
        3   10.0  37.128869     c 2019-10-15 21:39:00
        4   91.0  41.341214     a 2020-10-31 11:57:00
        5   67.0  92.237335     a                 NaT
        6    NaN  51.598682   NaN 2020-04-01 01:56:00
        7    NaN  42.204396     c 2020-03-12 22:12:00
        8   68.0        NaN     c 2021-02-25 16:04:00
        9    7.0  31.542918     a 2020-07-12 03:12:00
        ```
        
        # The Synthetic Data Vault
        
        <p>
          <a href="https://sdv.dev">
            <img width=30% src="https://github.com/sdv-dev/SDV/blob/master/docs/images/SDV-Logo-Color-Tagline.png?raw=true">
          </a>
          <p><i>This repository is part of <a href="https://sdv.dev">The Synthetic Data Vault Project</a></i></p>
        </p>
        
        * Website: https://sdv.dev
        * Documentation: https://sdv.dev/SDV
        
        
        # History
        
        ## 0.4.2 - 2021-06-08
        
        This release adds a new method to the `CategoricalTransformer` to solve a bug where
        the transformer becomes unusable after being pickled and unpickled if it had `NaN`
        values in the data which it was fit on.
        
        It also fixes some grammar mistakes in the documentation.
        
        ### Issues closed
        
        * CategoricalTransformer with NaN values cannot be pickled bug - Issue [#164](https://github.com/sdv-dev/RDT/issues/164) by @pvk-developer and @csala
        
        ### Documentation changes
        
        * docs: fix typo - PR [#163](https://github.com/sdv-dev/RDT/issues/163) by @sbrugman
        
        ## 0.4.1 - 2021-03-29
        
        This release improves the `HyperTransformer` memory usage when working with a
        high number of columns or a high number of categorical values when using one hot encoding.
        
        ### Issues closed
        
        * `Boolean`, `Datetime` and `LabelEncoding` transformers fail with 2D `ndarray` - Issue [#160](https://github.com/sdv-dev/RDT/issues/160) by @pvk-developer
        * `HyperTransformer`: Memory usage increase when `reverse_transform` is called - Issue [#156](https://github.com/sdv-dev/RDT/issues/152) by @pvk-developer and @AnupamaGangadhar
        
        ## 0.4.0 - 2021-02-24
        
        In this release a change in the HyperTransformer allows using it to transform and
        reverse transform a subset of the columns seen during training.
        
        The anonymization functionality which was deprecated and not being used has also
        been removed along with the Faker dependency.
        
        ### Issues closed
        
        * Allow the HyperTransformer to be used on a subset of the columns - Issue [#152](https://github.com/sdv-dev/RDT/issues/152) by @csala
        * Remove faker - Issue [#150](https://github.com/sdv-dev/RDT/issues/150) by @csala
        
        ## 0.3.0 - 2021-01-27
        
        This release changes the behavior of the `HyperTransformer` to prevent it from
        modifying any column in the given `DataFrame` if the `transformers` dictionary
        is passed empty.
        
        ### Issues closed
        
        * If transformers is an empty dict, do nothing - Issue [#149](https://github.com/sdv-dev/RDT/issues/149) by @csala
        
        ## 0.2.10 - 2020-12-18
        
        This release adds a new argument to the `HyperTransformer` which gives control over
        which transformers to use by default for each `dtype` if no specific transformer
        has been specified for the field.
        
        This is also the first version to be officially released on conda.
        
        ### Issues closed
        
        * Add `dtype_transformers` argument to HyperTransformer - Issue [#148](https://github.com/sdv-dev/RDT/issues/148) by @csala
        * Makes Copulas an optional dependency - Issue [#144](https://github.com/sdv-dev/RDT/issues/144) by @fealho
        
        ## 0.2.9 - 2020-11-27
        
        This release fixes a bug that prevented the `CategoricalTransformer` from working properly
        when being passed data that contained numerical data only, without any strings, but also
        contained `None` or `NaN` values.
        
        ### Issues closed
        
        * KeyError: nan - CategoricalTransformer fails on numerical + nan data only - Issue [#142](https://github.com/sdv-dev/RDT/issues/142) by @csala
        
        ## 0.2.8 - 2020-11-20
        
        This release fixes a few minor bugs, including some which prevented RDT from fully working
        on Windows systems.
        
        Thanks to this fixes, as well as a new testing infrastructure that has been set up, from now
        on RDT is officially supported on Windows systems, as well as on the Linux and macOS systems
        which were previously supported.
        
        ### Issues closed
        
        * TypeError: unsupported operand type(s) for: 'NoneType' and 'int' - Issue [#132](https://github.com/sdv-dev/RDT/issues/132) by @csala
        * Example does not work on Windows - Issue [#114](https://github.com/sdv-dev/RDT/issues/114) by @csala
        * OneHotEncodingTransformer producing all zeros - Issue [#135](https://github.com/sdv-dev/RDT/issues/135) by @fealho
        * OneHotEncodingTransformer support for lists and lists of lists - Issue [#137](https://github.com/sdv-dev/RDT/issues/137) by @fealho
        
        ## 0.2.7 - 2020-10-16
        
        In this release we drop the support for the now officially dead Python 3.5
        and introduce a new feature in the DatetimeTransformer which reduces the dimensionality
        of the generated numerical values while also ensuring that the reverted datetimes
        maintain the same level as time unit precision as the original ones.
        
        * Drop Py35 support - Issue [#129](https://github.com/sdv-dev/RDT/issues/129) by @csala
        * Add option to drop constant parts of the datetimes - Issue [#130](https://github.com/sdv-dev/RDT/issues/130) by @csala
        
        ## 0.2.6 - 2020-10-05
        
        * Add GaussianCopulaTransformer - Issue [#125](https://github.com/sdv-dev/RDT/issues/125) by @csala
        * dtype category error - Issue [#124](https://github.com/sdv-dev/RDT/issues/124) by @csala
        
        ## 0.2.5 - 2020-09-18
        
        Miunor bugfixing release.
        
        # Bugs Fixed
        
        * Handle NaNs in OneHotEncodingTransformer - Issue [#118](https://github.com/sdv-dev/RDT/issues/118) by @csala
        * OneHotEncodingTransformer fails if there is only one category - Issue [#119](https://github.com/sdv-dev/RDT/issues/119) by @csala
        * All NaN column produces NaN values enhancement - Issue [#121](https://github.com/sdv-dev/RDT/issues/121) by @csala
        * Make the CategoricalTransformer learn the column dtype and restore it back - Issue [#122](https://github.com/sdv-dev/RDT/issues/122) by @csala
        
        ## 0.2.4 - 2020-08-08
        
        ### General Improvements
        
        * Support Python 3.8 - Issue [#117](https://github.com/sdv-dev/RDT/issues/117) by @csala
        * Support pandas >1 - Issue [#116](https://github.com/sdv-dev/RDT/issues/116) by @csala
        
        ## 0.2.3 - 2020-07-09
        
        * Implement OneHot and Label encoding as transformers - Issue [#112](https://github.com/sdv-dev/RDT/issues/112) by @csala
        
        ## 0.2.2 - 2020-06-26
        
        ### Bugs Fixed
        
        * Escape `column_name` in hypertransformer - Issue [#110](https://github.com/sdv-dev/RDT/issues/110) by @csala
        
        ## 0.2.1 - 2020-01-17
        
        ### Bugs Fixed
        
        * Boolean Transformer fails to revert when there are NO nulls - Issue [#103](https://github.com/sdv-dev/RDT/issues/103) by @JDTheRipperPC
        
        ## 0.2.0 - 2019-10-15
        
        This version comes with a brand new API and internal implementation, removing the old
        metadata JSON from the user provided arguments, and making each transformer work only
        with `pandas.Series` of their corresponding data type.
        
        As part of this change, several transformer names have been changed and a new BooleanTransformer
        and a feature to automatically decide which transformers to use based on dtypes have been added.
        
        Unit test coverage has also been increased to 100%.
        
        Special thanks to @JDTheRipperPC and @csala for the big efforts put in making this
        release possible.
        
        ### Issues
        
        * Drop the usage of meta - Issue [#72](https://github.com/sdv-dev/RDT/issues/72) by @JDTheRipperPC
        * Make CatTransformer.probability_map deterministic - Issue [#25](https://github.com/sdv-dev/RDT/issues/25) by @csala
        
        ## 0.1.3 - 2019-09-24
        
        ### New Features
        
        * Add attributes NullTransformer and col_meta - Issue [#30](https://github.com/sdv-dev/RDT/issues/30) by @ManuelAlvarezC
        
        ### General Improvements
        
        * Integrate with CodeCov - Issue [#89](https://github.com/sdv-dev/RDT/issues/89) by @csala
        * Remake Sphinx Documentation - Issue [#96](https://github.com/sdv-dev/RDT/issues/96) by @JDTheRipperPC
        * Improve README - Issue [#92](https://github.com/sdv-dev/RDT/issues/92) by @JDTheRipperPC
        * Document RELEASE workflow - Issue [#93](https://github.com/sdv-dev/RDT/issues/93) by @JDTheRipperPC
        * Add support to Python 3.7 - Issue [#38](https://github.com/sdv-dev/RDT/issues/38) by @ManuelAlvarezC
        * Create way to pass HyperTransformer table dict - Issue [#45](https://github.com/sdv-dev/RDT/issues/45) by @ManuelAlvarezC
        
        ## 0.1.2
        
        * Add a numerical transformer for positive numbers.
        * Add option to anonymize data on categorical transformer.
        * Move the `col_meta` argument from method-level to class-level.
        * Move the logic for missing values from the transformers into the `HyperTransformer`.
        * Removed unreacheble lines in `NullTransformer`.
        * `Numbertransfomer` to set default value to 0 when the column is null.
        * Add a CLA for collaborators.
        * Refactor performance-wise the transformers.
        
        ## 0.1.1
        
        * Improve handling of NaN in NumberTransformer and CatTransformer.
        * Add unittests for HyperTransformer.
        * Remove unused methods `get_types` and `impute_table` from HyperTransformer.
        * Make NumberTransformer enforce dtype int on integer data.
        * Make DTTransformer check data format before transforming.
        * Add minimal API Reference.
        * Merge `rdt.utils` into `HyperTransformer` class. 
        
        ## 0.1.0
        
        * First release on PyPI.
        
Keywords: rdt
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6,<3.9
Description-Content-Type: text/markdown
Provides-Extra: copulas
Provides-Extra: dev
Provides-Extra: test
