synthcity.plugins.core.models.tabular_encoder module

TabularEncoder module.

class BinEncoder(*args: Any, **kwargs: Any)

Bases: synthcity.plugins.core.models.tabular_encoder.TabularEncoder

Binary encoder (for SurvivalGAN).

Model continuous columns with a BayesianGMM and normalized to a scalar [0, 1] and a vector. Discrete columns are encoded using a scikit-learn OneHotEncoder.

activation_layout(discrete_activation: str, continuous_activation: str) → Sequence[Tuple[str, int]]

Get the layout of the activations.

Returns a list of tuple, describing each column as:

continuous, and with length 1 + number of GMM clusters.
discrete, and with length <N>, the length of the one-hot encoding.

cat_encoder_params: dict = {}

categorical_encoder: Union[str, type] = 'passthrough'

cont_encoder_params: dict = {'n_components': 2}

continuous_encoder: Union[str, type] = 'bayesian_gmm'

fit(raw_data: pandas.core.frame.DataFrame, discrete_columns: Optional[List] = None) → Any

Fit the TabularEncoder.

This step also counts the #columns in matrix data and span information.

get_column_info(name: str) → synthcity.plugins.core.models.tabular_encoder.FeatureInfo

inverse_transform(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame

Take matrix data and output raw data.

Output uses the same type as input to the transform function.

layout() → Sequence[synthcity.plugins.core.models.tabular_encoder.FeatureInfo]

Get the layout of the encoded dataset.

Returns a list of tuple, describing each column as:

continuous, and with length 1 + number of GMM clusters.
discrete, and with length <N>, the length of the one-hot encoding.

n_features() → int

transform(raw_data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame: Take raw data and output a matrix data.

class FeatureInfo(*, name: str, feature_type: str, transform: Any = None, output_dimensions: int, transformed_features: List[str], trans_feature_types: List[str])

Bases: pydantic.main.BaseModel

Config: alias of pydantic.config.BaseConfig

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) → Model: Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = ‘allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) → Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters

include – fields to include in new model
exclude – fields to exclude from new model, as with values this takes precedence over include
update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data
deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) → DictStrAny: Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

feature_type: str

classmethod from_orm(obj: Any) → Model

json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) → unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

name: str

output_dimensions: int

classmethod parse_file(path: Union[str, pathlib.Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: pydantic.parse.Protocol = None, allow_pickle: bool = False) → Model

classmethod parse_obj(obj: Any) → Model

classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: pydantic.parse.Protocol = None, allow_pickle: bool = False) → Model

classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') → DictStrAny

classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) → unicode

trans_feature_types: List[str]

transform: Any

transformed_features: List[str]

classmethod update_forward_refs(**localns: Any) → None: Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) → Model

class TabularEncoder(*args: Any, **kwargs: Any)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Tabular encoder.

Model continuous columns with a BayesianGMM and normalized to a scalar [0, 1] and a vector. Discrete columns are encoded using a scikit-learn OneHotEncoder.

activation_layout(discrete_activation: str, continuous_activation: str) → Sequence[Tuple[str, int]]

Get the layout of the activations.

Returns a list of tuple, describing each column as:

continuous, and with length 1 + number of GMM clusters.
discrete, and with length <N>, the length of the one-hot encoding.

cat_encoder_params: dict = {'handle_unknown': 'ignore', 'sparse_output': False}

categorical_encoder: Union[str, type] = 'onehot'

cont_encoder_params: dict = {'n_components': 10}

continuous_encoder: Union[str, type] = 'bayesian_gmm'

fit(raw_data: pandas.core.frame.DataFrame, discrete_columns: Optional[List] = None) → Any

Fit the TabularEncoder.

This step also counts the #columns in matrix data and span information.

get_column_info(name: str) → synthcity.plugins.core.models.tabular_encoder.FeatureInfo

inverse_transform(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame

Take matrix data and output raw data.

Output uses the same type as input to the transform function.

layout() → Sequence[synthcity.plugins.core.models.tabular_encoder.FeatureInfo]

Get the layout of the encoded dataset.

Returns a list of tuple, describing each column as:

continuous, and with length 1 + number of GMM clusters.
discrete, and with length <N>, the length of the one-hot encoding.

n_features() → int

transform(raw_data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame: Take raw data and output a matrix data.

class TimeSeriesBinEncoder(*args: Any, **kwargs: Any)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Time series Bin encoder.

Model continuous columns with a BayesianGMM and normalized to a scalar [0, 1] and a vector. Discrete columns are encoded using a scikit-learn OneHotEncoder.

fit(static_data: pandas.core.frame.DataFrame, temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, discrete_columns: Optional[List] = None) → synthcity.plugins.core.models.tabular_encoder.TimeSeriesBinEncoder: Fit the TimeSeriesBinEncoder

fit_transform(static: pandas.core.frame.DataFrame, temporal: List[pandas.core.frame.DataFrame], observation_times: List) → pandas.core.frame.DataFrame

transform(static_data: pandas.core.frame.DataFrame, temporal_data: List[pandas.core.frame.DataFrame], observation_times: List) → pandas.core.frame.DataFrame: Take raw data and output a matrix data.

class TimeSeriesTabularEncoder(*args: Any, **kwargs: Any)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

TimeSeries Tabular encoder.

Model continuous columns with a BayesianGMM and normalized to a scalar [0, 1] and a vector. Discrete columns are encoded using a scikit-learn OneHotEncoder.

activation_layout(discrete_activation: str, continuous_activation: str) → Tuple

activation_layout_temporal(discrete_activation: str, continuous_activation: str) → Any

fit(static_data: pandas.core.frame.DataFrame, temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, discrete_columns: Optional[List] = None) → synthcity.plugins.core.models.tabular_encoder.TimeSeriesTabularEncoder

fit_temporal(temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, discrete_columns: Optional[List] = None) → synthcity.plugins.core.models.tabular_encoder.TimeSeriesTabularEncoder

fit_transform(static_data: pandas.core.frame.DataFrame, temporal_data: List[pandas.core.frame.DataFrame], observation_times: List) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, List]

fit_transform_temporal(temporal_data: List[pandas.core.frame.DataFrame], observation_times: List) → Tuple[pandas.core.frame.DataFrame, List]

inverse_transform(static_encoded: pandas.core.frame.DataFrame, temporal_encoded: List[pandas.core.frame.DataFrame], observation_times: List) → pandas.core.frame.DataFrame

inverse_transform_observation_times(observation_times: List) → pandas.core.frame.DataFrame

inverse_transform_static(static_encoded: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame

inverse_transform_temporal(temporal_encoded: List[pandas.core.frame.DataFrame], observation_times: List) → pandas.core.frame.DataFrame

layout() → Tuple[List, List]

n_features() → Tuple

transform(static_data: pandas.core.frame.DataFrame, temporal_data: List[pandas.core.frame.DataFrame], observation_times: List) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, List]

transform_observation_times(observation_times: List) → List

transform_static(static_data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame

transform_temporal(temporal_data: List[pandas.core.frame.DataFrame], observation_times: List) → Tuple[pandas.core.frame.DataFrame, List]