synthcity.plugins.privacy.plugin_pategan module

Reference: James Jordon, Jinsung Yoon, Mihaela van der Schaar, “PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees,” International Conference on Learning Representations (ICLR), 2019. Paper link: https://openreview.net/forum?id=S1zk9iRqF7

class PATEGAN(max_iter: int = 1000, generator_n_layers_hidden: int = 2, generator_n_units_hidden: int = 100, generator_nonlin: str = 'relu', generator_n_iter: int = 10, generator_dropout: float = 0, discriminator_n_layers_hidden: int = 2, discriminator_n_units_hidden: int = 100, discriminator_nonlin: str = 'leaky_relu', discriminator_n_iter: int = 1, discriminator_dropout: float = 0.1, lr: float = 0.0001, weight_decay: float = 0.001, batch_size: int = 200, random_state: int = 0, clipping_value: int = 1, encoder_max_clusters: int = 5, device: Any = device(type='cpu'), n_teachers: int = 10, teacher_template: str = 'linear', epsilon: float = 1.0, delta: Optional[float] = None, lamda: float = 0.001, alpha: int = 100, encoder: Any = None)

Bases: synthcity.plugins.core.serializable.Serializable

Basic PATE-GAN framework.

fit(X_train: pandas.core.frame.DataFrame) Any
static load(buff: bytes) Any
static load_dict(representation: dict) Any
sample(count: int) numpy.ndarray
save() bytes
save_dict() dict
save_to_file(path: pathlib.Path) bytes
static version() str

API version

class PATEGANPlugin(n_iter: int = 200, generator_n_iter: int = 10, generator_n_layers_hidden: int = 2, generator_n_units_hidden: int = 500, generator_nonlin: str = 'relu', generator_dropout: float = 0, discriminator_n_layers_hidden: int = 2, discriminator_n_units_hidden: int = 500, discriminator_nonlin: str = 'leaky_relu', discriminator_n_iter: int = 1, discriminator_dropout: float = 0.1, lr: float = 0.001, weight_decay: float = 0.001, batch_size: int = 200, random_state: int = 0, clipping_value: int = 1, encoder_max_clusters: int = 5, n_teachers: int = 10, teacher_template: str = 'xgboost', epsilon: float = 1.0, delta: Optional[float] = None, lamda: float = 0.001, alpha: int = 100, encoder: Any = None, device: Any = device(type='cpu'), workspace: pathlib.Path = PosixPath('workspace'), compress_dataset: bool = False, sampling_patience: int = 500, **kwargs: Any)

Bases: synthcity.plugins.core.plugin.Plugin

Inheritance diagram of synthcity.plugins.privacy.plugin_pategan.PATEGANPlugin

PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees.

Parameters
  • generator_n_layers_hidden – int Number of hidden layers in the generator

  • generator_n_units_hidden – int Number of hidden units in each layer of the Generator

  • generator_nonlin – string, default ‘leaky_relu’ Nonlinearity to use in the generator. Can be ‘elu’, ‘relu’, ‘selu’ or ‘leaky_relu’.

  • n_iter – int Maximum number of iterations in the Generator.

  • generator_dropout – float Dropout value. If 0, the dropout is not used.

  • discriminator_n_layers_hidden – int Number of hidden layers in the discriminator

  • discriminator_n_units_hidden – int Number of hidden units in each layer of the discriminator

  • discriminator_nonlin – string, default ‘leaky_relu’ Nonlinearity to use in the discriminator. Can be ‘elu’, ‘relu’, ‘selu’ or ‘leaky_relu’.

  • discriminator_n_iter – int Maximum number of iterations in the discriminator.

  • discriminator_dropout – float Dropout value for the discriminator. If 0, the dropout is not used.

  • lr – float learning rate for optimizer.

  • weight_decay – float l2 (ridge) penalty for the weights.

  • batch_size – int Batch size

  • random_state – int random_state used

  • clipping_value – int, default 0 Gradients clipping value. Zero disables the feature

  • teacher_template – str Model to use for the teachers. Can be linear, xgboost.

  • epsilon – float Differential privacy parameter

  • delta – float Differential privacy parameter

  • lambda – float Noise size

  • encoder_max_clusters – int The max number of clusters to create for continuous columns when encoding

  • arguments (# Core Plugin) –

  • workspace – Path. Optional Path for caching intermediary results.

  • compress_dataset – bool. Default = False. Drop redundant features before training the generator.

  • sampling_patience – int. Max inference iterations to wait for the generated data to match the training schema.

Example

>>> from sklearn.datasets import load_iris
>>> from synthcity.plugins import Plugins
>>>
>>> X, y = load_iris(as_frame = True, return_X_y = True)
>>> X["target"] = y
>>>
>>> plugin = Plugins().get("pategan", n_iter = 100)
>>> plugin.fit(X)
>>>
>>> plugin.generate(50)
class Config

Bases: object

arbitrary_types_allowed = True
validate_assignment = True
fit(X: Union[synthcity.plugins.core.dataloader.DataLoader, pandas.core.frame.DataFrame], *args: Any, **kwargs: Any) Any

Training method the synthetic data plugin.

Parameters
  • X – DataLoader. The reference dataset.

  • cond

    Optional, Union[pd.DataFrame, pd.Series, np.ndarray] Optional Training Conditional. The training conditional can be used to control to output of some models, like GANs or VAEs. The content can be anything, as long as it maps to the training dataset X. Usage example:

    >>> from sklearn.datasets import load_iris
    >>> from synthcity.plugins.core.dataloader import GenericDataLoader
    >>> from synthcity.plugins.core.constraints import Constraints
    >>>
    >>> # Load in `test_plugin` the generative model of choice
    >>> # ....
    >>>
    >>> X, y = load_iris(as_frame=True, return_X_y=True)
    >>> X["target"] = y
    >>>
    >>> X = GenericDataLoader(X)
    >>> test_plugin.fit(X, cond=y)
    >>>
    >>> count = 10
    >>> X_gen = test_plugin.generate(count, cond=np.ones(count))
    >>>
    >>> # The Conditional only optimizes the output generation
    >>> # for GANs and VAEs, but does NOT guarantee the samples
    >>> # are only from that condition.
    >>> # If you want to guarantee that output contains only
    >>> # "target" == 1 samples, use Constraints.
    >>>
    >>> constraints = Constraints(
    >>>     rules=[
    >>>         ("target", "==", 1),
    >>>     ]
    >>> )
    >>> X_gen = test_plugin.generate(count,
    >>>         cond=np.ones(count),
    >>>         constraints=constraints
    >>>        )
    >>> assert (X_gen["target"] == 1).all()
    

Returns

self

classmethod fqdn() str

The Fully-Qualified name of the plugin.

generate(count: Optional[int] = None, constraints: Optional[synthcity.plugins.core.constraints.Constraints] = None, random_state: Optional[int] = None, **kwargs: Any) synthcity.plugins.core.dataloader.DataLoader

Synthetic data generation method.

Parameters
  • count – optional int. The number of samples to generate. If None, it generated len(reference_dataset) samples.

  • cond – Optional, Union[pd.DataFrame, pd.Series, np.ndarray]. Optional Generation Conditional. The conditional can be used only if the model was trained using a conditional too. If provided, it must have count length. Not all models support conditionals. The conditionals can be used in VAEs or GANs to speed-up the generation under some constraints. For model agnostic solutions, check out the constraints parameter.

  • constraints

    optional Constraints. Optional constraints to apply on the generated data. If none, the reference schema constraints are applied. The constraints are model agnostic, and will filter the output of the generative model. The constraints are a list of rules. Each rule is a tuple of the form (<feature>, <operation>, <value>).

    Valid Operations:
    • ”<”, “lt” : less than <value>

    • ”<=”, “le”: less or equal with <value>

    • ”>”, “gt” : greater than <value>

    • ”>=”, “ge”: greater or equal with <value>

    • ”==”, “eq”: equal with <value>

    • ”in”: valid for categorical features, and <value> must be array. for example, (“target”, “in”, [0, 1])

    • ”dtype”: <value> can be a data type. For example, (“target”, “dtype”, “int”)

    Usage example:
    >>> from synthcity.plugins.core.constraints import Constraints
    >>> constraints = Constraints(
    >>>   rules=[
    >>>             ("InterestingFeature", "==", 0),
    >>>         ]
    >>>     )
    >>>
    >>> syn_data = syn_model.generate(
            count=count,
            constraints=constraints
        ).dataframe()
    >>>
    >>> assert (syn_data["InterestingFeature"] == 0).all()
    

  • random_state – optional int. Optional random seed to use.

Returns

<count> synthetic samples

static hyperparameter_space(**kwargs: Any) List[synthcity.plugins.core.distribution.Distribution]

Returns the hyperparameter space for the derived plugin.

static load(buff: bytes) Any
static load_dict(representation: dict) Any
static name() str

The name of the plugin.

plot(plt: Any, X: synthcity.plugins.core.dataloader.DataLoader, count: Optional[int] = None, plots: list = ['marginal', 'associations', 'tsne'], **kwargs: Any) Any

Plot the real-synthetic distributions.

Parameters
  • plt – output

  • X – DataLoader. The reference dataset.

Returns

self

classmethod sample_hyperparameters(*args: Any, **kwargs: Any) Dict[str, Any]

Sample value from the hyperparameter space for the current plugin.

classmethod sample_hyperparameters_optuna(trial: Any, *args: Any, **kwargs: Any) Dict[str, Any]
save() bytes
save_dict() dict
save_to_file(path: pathlib.Path) bytes
schema() synthcity.plugins.core.schema.Schema

The reference schema

schema_includes(other: Union[synthcity.plugins.core.dataloader.DataLoader, pandas.core.frame.DataFrame]) bool

Helper method to test if the reference schema includes a Dataset

Parameters

other – DataLoader. The dataset to test

Returns

bool, if the schema includes the dataset or not.

training_schema() synthcity.plugins.core.schema.Schema

The internal schema

static type() str

The type of the plugin.

static version() str

API version

class Teachers(n_teachers: int, samples_per_teacher: int, lamda: float = 0.001, template: str = 'xgboost')

Bases: synthcity.plugins.core.serializable.Serializable

fit(X: numpy.ndarray, generator: Any) Any
static load(buff: bytes) Any
static load_dict(representation: dict) Any
pate_lamda(x: numpy.ndarray) Tuple[int, int, int]

Returns PATE_lambda(x).

Parameters

x (-) – feature vector

Returns

the number of label 0 and 1, respectively - out: label after adding laplace noise.

Return type

  • n0, n1

save() bytes
save_dict() dict
save_to_file(path: pathlib.Path) bytes
static version() str

API version

plugin

alias of synthcity.plugins.privacy.plugin_pategan.PATEGANPlugin