Metadata-Version: 2.1
Name: rdt
Version: 0.2.9
Summary: Reversible Data Transforms
Home-page: https://github.com/sdv-dev/RDT
Author: MIT Data To AI Lab
Author-email: dailabmit@gmail.com
License: MIT license
Description: <p align="left">
        <img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt="DAI-Lab" />
        <i>An open source project from Data to AI Lab at MIT.</i>
        </p>
        
        [![Development Status](https://img.shields.io/badge/Development%20Status-2%20--%20Pre--Alpha-yellow)](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
        [![PyPi Shield](https://img.shields.io/pypi/v/RDT.svg)](https://pypi.python.org/pypi/RDT)
        [![Travis CI Shield](https://travis-ci.com/sdv-dev/RDT.svg?branch=master)](https://travis-ci.com/sdv-dev/RDT)
        [![Coverage Status](https://codecov.io/gh/sdv-dev/RDT/branch/master/graph/badge.svg)](https://codecov.io/gh/sdv-dev/RDT)
        [![Downloads](https://pepy.tech/badge/rdt)](https://pepy.tech/project/rdt)
        
        # RDT: Reversible Data Transforms
        
        * License: [MIT](https://github.com/sdv-dev/RDT/blob/master/LICENSE)
        * Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
        * Homepage: https://github.com/sdv-dev/RDT
        
        ## Overview
        
        **RDT** is a Python library used to transform data for data science libraries and preserve
        the transformations in order to revert them as needed.
        
        # Install
        
        ## Requirements
        
        **RDT** has been developed and tested on [Python 3.6, 3.7 and 3.8](https://www.python.org/downloads/)
        on GNU/Linux, macOS and Windows systems.
        
        Also, although it is not strictly required, the usage of a [virtualenv](
        https://virtualenv.pypa.io/en/latest/) is highly recommended in order to avoid
        interfering with other software installed in the system where **RDT** is run.
        
        ## Install with pip
        
        The easiest and recommended way to install **RDT** is using [pip](
        https://pip.pypa.io/en/stable/):
        
        ```bash
        pip install rdt
        ```
        
        This will pull and install the latest stable release from [PyPi](https://pypi.org/).
        
        If you want to install from source or contribute to the project please read the
        [Contributing Guide](CONTRIBUTING.rst).
        
        
        # Quickstart
        
        In this short series of tutorials we will guide you through a series of steps that will
        help you getting started using **RDT** to transform columns, tables and datasets.
        
        ## Transforming a column
        
        In this first guide, you will learn how to use **RDT** in its simplest form, transforming
        a single column loaded as a `pandas.DataFrame` object.
        
        ### 1. Load the demo data
        
        You can load some demo data using the `rdt.get_demo` function, which will return some random
        data for you to play with.
        
        ```python3
        from rdt import get_demo
        
        data = get_demo()
        ```
        
        This will return a `pandas.DataFrame` with 10 rows and 4 columns, one of each data type supported:
        
        ```
           0_int    1_float 2_str          3_datetime
        0   38.0  46.872441     b 2021-02-10 21:50:00
        1   77.0  13.150228   NaN 2021-07-19 21:14:00
        2   21.0        NaN     b                 NaT
        3   10.0  37.128869     c 2019-10-15 21:39:00
        4   91.0  41.341214     a 2020-10-31 11:57:00
        5   67.0  92.237335     a                 NaT
        6    NaN  51.598682   NaN 2020-04-01 01:56:00
        7    NaN  42.204396     c 2020-03-12 22:12:00
        8   68.0        NaN     c 2021-02-25 16:04:00
        9    7.0  31.542918     a 2020-07-12 03:12:00
        ```
        
        Notice how the data is random, so your output might look a bit different. Also notice how
        RDT introduced some null values randomly.
        
        ### 2. Load the transformer
        
        In this example we will use the datetime column, so let's load a `DatetimeTransformer`.
        
        ```python3
        from rdt.transformers import DatetimeTransformer
        
        transformer = DatetimeTransformer()
        ```
        
        ### 3. Fit the Transformer
        
        Before being able to transform the data, we need the transformer to learn from it.
        
        We will do this by calling its `fit` method passing the column that we want to transform.
        
        ```python3
        transformer.fit(data['3_datetime'])
        ```
        
        ### 4. Transform the data
        
        Once the transformer is fitted, we can pass the data again to its `transform` method in order
        to get the transformed version of the data.
        
        ```python3
        transformed = transformer.transform(data['3_datetime'])
        ```
        
        The output will be a `numpy.ndarray` with two columns, one with the datetimes transformed
        to integer timestamps, and another one indicating with 1s which values were null in the
        original data.
        
        ```
        array([[1.61299380e+18, 0.00000000e+00],
               [1.62672924e+18, 0.00000000e+00],
               [1.59919923e+18, 1.00000000e+00],
               [1.57117554e+18, 0.00000000e+00],
               [1.60414542e+18, 0.00000000e+00],
               [1.59919923e+18, 1.00000000e+00],
               [1.58570616e+18, 0.00000000e+00],
               [1.58405112e+18, 0.00000000e+00],
               [1.61426904e+18, 0.00000000e+00],
               [1.59452352e+18, 0.00000000e+00]])
        ```
        
        ### 5. Revert the column transformation
        
        In order to revert the previous transformation, the transformed data can be passed to
        the `reverse_transform` method of the transformer:
        
        ```python3
        reversed_data = transformer.reverse_transform(transformed)
        ```
        
        The output will be a `pandas.Series` containing the reverted values, which should be exactly
        like the original ones.
        
        ```
        0   2021-02-10 21:50:00
        1   2021-07-19 21:14:00
        2                   NaT
        3   2019-10-15 21:39:00
        4   2020-10-31 11:57:00
        5                   NaT
        6   2020-04-01 01:56:00
        7   2020-03-12 22:12:00
        8   2021-02-25 16:04:00
        9   2020-07-12 03:12:00
        dtype: datetime64[ns]
        ```
        
        ## Transforming a table
        
        Once we know how to transform a single column, we can try to go the next level and transform
        a table with multiple columns.
        
        ### 1. Load the HyperTransformer
        
        In order to manuipulate a complete table we will need to load a `rdt.HyperTransformer`.
        
        ```python3
        from rdt import HyperTransformer
        
        ht = HyperTransformer()
        ```
        
        ### 2. Fit the HyperTransformer
        
        Just like the transfomer, the HyperTransformer needs to be fitted before being able to transform
        data.
        
        This is done by calling its `fit` method passing the `data` DataFrame.
        
        ```python3
        ht.fit(data)
        ```
        
        ### 3. Transform the table data
        
        Once the HyperTransformer is fitted, we can pass the data again to its `transform` method in order
        to get the transformed version of the data.
        
        ```python3
        transformed = ht.transform(data)
        ```
        
        The output, will now be another `pandas.DataFrame` with the numerical representation of our
        data.
        
        ```
            0_int  0_int#1    1_float  1_float#1  2_str    3_datetime  3_datetime#1
        0  38.000      0.0  46.872441        0.0   0.70  1.612994e+18           0.0
        1  77.000      0.0  13.150228        0.0   0.90  1.626729e+18           0.0
        2  21.000      0.0  44.509511        1.0   0.70  1.599199e+18           1.0
        3  10.000      0.0  37.128869        0.0   0.15  1.571176e+18           0.0
        4  91.000      0.0  41.341214        0.0   0.45  1.604145e+18           0.0
        5  67.000      0.0  92.237335        0.0   0.45  1.599199e+18           1.0
        6  47.375      1.0  51.598682        0.0   0.90  1.585706e+18           0.0
        7  47.375      1.0  42.204396        0.0   0.15  1.584051e+18           0.0
        8  68.000      0.0  44.509511        1.0   0.15  1.614269e+18           0.0
        9   7.000      0.0  31.542918        0.0   0.45  1.594524e+18           0.0
        ```
        
        ### 4. Revert the table transformation
        
        In order to revert the transformation and recover the original data from the transformed one,
        we need to call `reverse_transform` method of the `HyperTransformer` instance passing it the
        transformed data.
        
        ```python3
        reversed_data = ht.reverse_transform(transformed)
        ```
        
        Which should output, again, a table that looks exactly like the original one.
        
        ```
           0_int    1_float 2_str          3_datetime
        0   38.0  46.872441     b 2021-02-10 21:50:00
        1   77.0  13.150228   NaN 2021-07-19 21:14:00
        2   21.0        NaN     b                 NaT
        3   10.0  37.128869     c 2019-10-15 21:39:00
        4   91.0  41.341214     a 2020-10-31 11:57:00
        5   67.0  92.237335     a                 NaT
        6    NaN  51.598682   NaN 2020-04-01 01:56:00
        7    NaN  42.204396     c 2020-03-12 22:12:00
        8   68.0        NaN     c 2021-02-25 16:04:00
        9    7.0  31.542918     a 2020-07-12 03:12:00
        ```
        
        
        # History
        
        ## 0.2.9 - 2020-11-27
        
        This release fixes a bug that prevented the `CategoricalTransformer` from working properly
        when being passed data that contained numerical data only, without any strings, but also
        contained `None` or `NaN` values.
        
        ### Issues closed
        
        * KeyError: nan - CategoricalTransformer fails on numerical + nan data only - Issue [#142](https://github.com/sdv-dev/RDT/issues/142) by @csala
        
        ## 0.2.8 - 2020-11-20
        
        This release fixes a few minor bugs, including some which prevented RDT from fully working
        on Windows systems.
        
        Thanks to this fixes, as well as a new testing infrastructure that has been set up, from now
        on RDT is officially supported on Windows systems, as well as on the Linux and macOS systems
        which were previously supported.
        
        ### Issues closed
        
        * TypeError: unsupported operand type(s) for: 'NoneType' and 'int' - Issue [#132](https://github.com/sdv-dev/RDT/issues/132) by @csala
        * Example does not work on Windows - Issue [#114](https://github.com/sdv-dev/RDT/issues/114) by @csala
        * OneHotEncodingTransformer producing all zeros - Issue [#135](https://github.com/sdv-dev/RDT/issues/135) by @fealho
        * OneHotEncodingTransformer support for lists and lists of lists - Issue [#137](https://github.com/sdv-dev/RDT/issues/137) by @fealho
        
        ## 0.2.7 - 2020-10-16
        
        In this release we drop the support for the now officially dead Python 3.5
        and introduce a new feature in the DatetimeTransformer which reduces the dimensionality
        of the generated numerical values while also ensuring that the reverted datetimes
        maintain the same level as time unit precision as the original ones.
        
        * Drop Py35 support - Issue [#129](https://github.com/sdv-dev/RDT/issues/129) by @csala
        * Add option to drop constant parts of the datetimes - Issue [#130](https://github.com/sdv-dev/RDT/issues/130) by @csala
        
        ## 0.2.6 - 2020-10-05
        
        * Add GaussianCopulaTransformer - Issue [#125](https://github.com/sdv-dev/RDT/issues/125) by @csala
        * dtype category error - Issue [#124](https://github.com/sdv-dev/RDT/issues/124) by @csala
        
        ## 0.2.5 - 2020-09-18
        
        Miunor bugfixing release.
        
        # Bugs Fixed
        
        * Handle NaNs in OneHotEncodingTransformer - Issue [#118](https://github.com/sdv-dev/RDT/issues/118) by @csala
        * OneHotEncodingTransformer fails if there is only one category - Issue [#119](https://github.com/sdv-dev/RDT/issues/119) by @csala
        * All NaN column produces NaN values enhancement - Issue [#121](https://github.com/sdv-dev/RDT/issues/121) by @csala
        * Make the CategoricalTransformer learn the column dtype and restore it back - Issue [#122](https://github.com/sdv-dev/RDT/issues/122) by @csala
        
        ## 0.2.4 - 2020-08-08
        
        ### General Improvements
        
        * Support Python 3.8 - Issue [#117](https://github.com/sdv-dev/RDT/issues/117) by @csala
        * Support pandas >1 - Issue [#116](https://github.com/sdv-dev/RDT/issues/116) by @csala
        
        ## 0.2.3 - 2020-07-09
        
        * Implement OneHot and Label encoding as transformers - Issue [#112](https://github.com/sdv-dev/RDT/issues/112) by @csala
        
        ## 0.2.2 - 2020-06-26
        
        ### Bugs Fixed
        
        * Escape `column_name` in hypertransformer - Issue [#110](https://github.com/sdv-dev/RDT/issues/110) by @csala
        
        ## 0.2.1 - 2020-01-17
        
        ### Bugs Fixed
        
        * Boolean Transformer fails to revert when there are NO nulls - Issue [#103](https://github.com/sdv-dev/RDT/issues/103) by @JDTheRipperPC
        
        ## 0.2.0 - 2019-10-15
        
        This version comes with a brand new API and internal implementation, removing the old
        metadata JSON from the user provided arguments, and making each transformer work only
        with `pandas.Series` of their corresponding data type.
        
        As part of this change, several transformer names have been changed and a new BooleanTransformer
        and a feature to automatically decide which transformers to use based on dtypes have been added.
        
        Unit test coverage has also been increased to 100%.
        
        Special thanks to @JDTheRipperPC and @csala for the big efforts put in making this
        release possible.
        
        ### Issues
        
        * Drop the usage of meta - Issue [#72](https://github.com/sdv-dev/RDT/issues/72) by @JDTheRipperPC
        * Make CatTransformer.probability_map deterministic - Issue [#25](https://github.com/sdv-dev/RDT/issues/25) by @csala
        
        ## 0.1.3 - 2019-09-24
        
        ### New Features
        
        * Add attributes NullTransformer and col_meta - Issue [#30](https://github.com/sdv-dev/RDT/issues/30) by @ManuelAlvarezC
        
        ### General Improvements
        
        * Integrate with CodeCov - Issue [#89](https://github.com/sdv-dev/RDT/issues/89) by @csala
        * Remake Sphinx Documentation - Issue [#96](https://github.com/sdv-dev/RDT/issues/96) by @JDTheRipperPC
        * Improve README - Issue [#92](https://github.com/sdv-dev/RDT/issues/92) by @JDTheRipperPC
        * Document RELEASE workflow - Issue [#93](https://github.com/sdv-dev/RDT/issues/93) by @JDTheRipperPC
        * Add support to Python 3.7 - Issue [#38](https://github.com/sdv-dev/RDT/issues/38) by @ManuelAlvarezC
        * Create way to pass HyperTransformer table dict - Issue [#45](https://github.com/sdv-dev/RDT/issues/45) by @ManuelAlvarezC
        
        ## 0.1.2
        
        * Add a numerical transformer for positive numbers.
        * Add option to anonymize data on categorical transformer.
        * Move the `col_meta` argument from method-level to class-level.
        * Move the logic for missing values from the transformers into the `HyperTransformer`.
        * Removed unreacheble lines in `NullTransformer`.
        * `Numbertransfomer` to set default value to 0 when the column is null.
        * Add a CLA for collaborators.
        * Refactor performance-wise the transformers.
        
        ## 0.1.1
        
        * Improve handling of NaN in NumberTransformer and CatTransformer.
        * Add unittests for HyperTransformer.
        * Remove unused methods `get_types` and `impute_table` from HyperTransformer.
        * Make NumberTransformer enforce dtype int on integer data.
        * Make DTTransformer check data format before transforming.
        * Add minimal API Reference.
        * Merge `rdt.utils` into `HyperTransformer` class. 
        
        ## 0.1.0
        
        * First release on PyPI.
        
Keywords: rdt
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6,<3.9
Description-Content-Type: text/markdown
Provides-Extra: test
Provides-Extra: dev
