synthcity.plugins.core.dataloader module

class DataLoader(data_type: str, data: Any, static_features: List[str] = [], temporal_features: List[str] = [], sensitive_features: List[str] = [], important_features: List[str] = [], outcome_features: List[str] = [], train_size: float = 0.8, random_state: int = 0, **kwargs: Any)

Bases: object

Inheritance diagram of synthcity.plugins.core.dataloader.DataLoader

Base class for all data loaders.

Each derived class must implement the following methods:: unpack() - a method that unpacks the columns and returns features and labels (X, y). decorate() - a method that creates a new instance of DataLoader by decorating the input data with the same DataLoader properties (e.g. sensitive features, target column, etc.) dataframe() - a method that returns the pandas dataframe that contains all features and samples numpy() - a method that returns the numpy array that contains all features and samples info() - a method that returns a dictionary of DataLoader information __len__() - a method that returns the number of samples in the DataLoader satisfies() - a method that tests if the current DataLoader satisfies the constraint provided match() - a method that returns a new DataLoader where the provided constraints are met from_info() - a static method that creates a DataLoader from the data and the information dictionary sample() - returns a new DataLoader that contains a random subset of N samples drop() - returns a new DataLoader with a list of columns dropped __getitem__() - getting features by names __setitem__() - setting features by names train() - returns a DataLoader containing the training set test() - returns a DataLoader containing the testing set fillna() - returns a DataLoader with NaN filled by the provided number(s)

If any method implementation is missing, the class constructor will fail.

Constructor Args:

data_type: str: The type of DataLoader, currently supports “generic”, “time_series” and “survival”.
data: Any: The object that contains the data
static_features: List[str]: List of feature names that are static features (as opposed to temporal features).
temporal_features:: List of feature names that are temporal features, i.e. observed over time.
sensitive_features: List[str]: Name of sensitive features.
important_features: List[str]: Default: None. Only relevant for SurvivalGAN method.
outcome_features:: The feature name that provides labels for downstream tasks.

abstract property columns: list

compress() → Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]

abstract compression_protected_features() → list

abstract dataframe() → pandas.core.frame.DataFrame

decode(encoders: Dict[str, Any]) → synthcity.plugins.core.dataloader.DataLoader

decompress(context: Dict) → synthcity.plugins.core.dataloader.DataLoader

abstract decorate(data: Any) → synthcity.plugins.core.dataloader.DataLoader

domain() → Optional[str]

abstract drop(columns: list = []) → synthcity.plugins.core.dataloader.DataLoader

encode(encoders: Optional[Dict[str, Any]] = None) → Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]

abstract fillna(value: Any) → synthcity.plugins.core.dataloader.DataLoader

abstract static from_info(data: pandas.core.frame.DataFrame, info: dict) → synthcity.plugins.core.dataloader.DataLoader

abstract get_fairness_column() → Union[str, Any]

hash() → str

abstract info() → dict

abstract is_tabular() → bool

abstract match(constraints: synthcity.plugins.core.constraints.Constraints) → synthcity.plugins.core.dataloader.DataLoader

abstract numpy() → numpy.ndarray

raw() → Any

abstract sample(count: int, random_state: int = 0) → synthcity.plugins.core.dataloader.DataLoader

abstract satisfies(constraints: synthcity.plugins.core.constraints.Constraints) → bool

abstract property shape: tuple

abstract test() → synthcity.plugins.core.dataloader.DataLoader

abstract train() → synthcity.plugins.core.dataloader.DataLoader

type() → str

abstract unpack(as_numpy: bool = False, pad: bool = False) → Any

property values: numpy.ndarray

class GenericDataLoader(data: Union[pandas.core.frame.DataFrame, list, numpy.ndarray], sensitive_features: List[str] = [], important_features: List[str] = [], target_column: Optional[str] = None, fairness_column: Optional[str] = None, domain_column: Optional[str] = None, random_state: int = 0, train_size: float = 0.8, **kwargs: Any)

Bases: synthcity.plugins.core.dataloader.DataLoader

Inheritance diagram of synthcity.plugins.core.dataloader.GenericDataLoader

Data loader for generic tabular data.

Constructor Args:

data: Union[pd.DataFrame, list, np.ndarray]: The dataset. Either a Pandas DataFrame or a Numpy Array.
sensitive_features: List[str]: Name of sensitive features.
important_features: List[str]: Default: None. Only relevant for SurvivalGAN method.
target_column: Optional[str]: The feature name that provides labels for downstream tasks.
fairness_column: Optional[str]: Optional fairness column label, used for fairness benchmarking.
domain_column: Optional[str]: Optional domain label, used for domain adaptation algorithms.
random_state: int: Defaults to zero.

Example

>>> from sklearn.datasets import load_diabetes
>>> from synthcity.plugins.core.dataloader import GenericDataLoader
>>> X, y = load_diabetes(return_X_y=True, as_frame=True)
>>> X["target"] = y
>>> # Important note: preprocessing data with OneHotEncoder or StandardScaler is not needed or recommended.
>>> # Synthcity handles feature encoding and standardization internally.
>>> loader = GenericDataLoader(X, target_column="target", sensitive_columns=["sex"],)

property columns: list

compress() → Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]

compression_protected_features() → list

dataframe() → pandas.core.frame.DataFrame

decode(encoders: Dict[str, Any]) → synthcity.plugins.core.dataloader.DataLoader

decompress(context: Dict) → synthcity.plugins.core.dataloader.DataLoader

decorate(data: Any) → synthcity.plugins.core.dataloader.DataLoader

domain() → Optional[str]

drop(columns: list = []) → synthcity.plugins.core.dataloader.DataLoader

encode(encoders: Optional[Dict[str, Any]] = None) → Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]

fillna(value: Any) → synthcity.plugins.core.dataloader.DataLoader

static from_info(data: pandas.core.frame.DataFrame, info: dict) → synthcity.plugins.core.dataloader.GenericDataLoader

get_fairness_column() → Union[str, Any]

hash() → str

info() → dict

is_tabular() → bool

match(constraints: synthcity.plugins.core.constraints.Constraints) → synthcity.plugins.core.dataloader.DataLoader

numpy() → numpy.ndarray

raw() → Any

sample(count: int, random_state: int = 0) → synthcity.plugins.core.dataloader.DataLoader

satisfies(constraints: synthcity.plugins.core.constraints.Constraints) → bool

property shape: tuple

test() → synthcity.plugins.core.dataloader.DataLoader

train() → synthcity.plugins.core.dataloader.DataLoader

type() → str

unpack(as_numpy: bool = False, pad: bool = False) → Any

property values: numpy.ndarray

class ImageDataLoader(data: Union[torch.utils.data.dataset.Dataset, Tuple[torch.Tensor, torch.Tensor]], height: int = 32, width: Optional[int] = None, random_state: int = 0, train_size: float = 0.8, **kwargs: Any)

Bases: synthcity.plugins.core.dataloader.DataLoader

Inheritance diagram of synthcity.plugins.core.dataloader.ImageDataLoader

Data loader for generic image data.

Constructor Args:

data: torch.utils.data.Dataset or torch.Tensor: The image dataset or a tuple of (tensor images, tensor labels)
random_state: int: Defaults to zero.
height: int. Default = 32: Height to use internally
width: Optional[int]: Optional width to use internally. If None, it is used the same value as height.
train_size: float = 0.8: Train dataset ratio.

Example

>>> dataset = datasets.MNIST(".", download=True)
>>>
>>> loader = ImageDataLoader(
>>>     data=dataset,
>>>     train_size=0.8,
>>>     height=32,
>>>     width=w32,
>>> )

property columns: list

compress() → Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]

compression_protected_features() → list

dataframe() → pandas.core.frame.DataFrame

decode(encoders: Dict[str, Any]) → synthcity.plugins.core.dataloader.DataLoader

decompress(context: Dict) → synthcity.plugins.core.dataloader.DataLoader

decorate(data: Any) → synthcity.plugins.core.dataloader.DataLoader

domain() → Optional[str]

drop(columns: list = []) → synthcity.plugins.core.dataloader.DataLoader

encode(encoders: Optional[Dict[str, Any]] = None) → Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]

fillna(value: Any) → synthcity.plugins.core.dataloader.DataLoader

static from_info(data: torch.utils.data.dataset.Dataset, info: dict) → synthcity.plugins.core.dataloader.ImageDataLoader

get_fairness_column() → None: Not implemented for ImageDataLoader

hash() → str

info() → dict

is_tabular() → bool

match(constraints: synthcity.plugins.core.constraints.Constraints) → synthcity.plugins.core.dataloader.DataLoader

numpy() → numpy.ndarray

raw() → Any

sample(count: int, random_state: int = 0) → synthcity.plugins.core.dataloader.DataLoader

satisfies(constraints: synthcity.plugins.core.constraints.Constraints) → bool

property shape: tuple

test() → synthcity.plugins.core.dataloader.DataLoader

train() → synthcity.plugins.core.dataloader.DataLoader

type() → str

unpack(as_numpy: bool = False, pad: bool = False) → Any

property values: numpy.ndarray

class SurvivalAnalysisDataLoader(data: pandas.core.frame.DataFrame, time_to_event_column: str, target_column: str, time_horizons: list = [], sensitive_features: List[str] = [], important_features: List[str] = [], fairness_column: Optional[str] = None, random_state: int = 0, train_size: float = 0.8, **kwargs: Any)

Bases: synthcity.plugins.core.dataloader.DataLoader

Inheritance diagram of synthcity.plugins.core.dataloader.SurvivalAnalysisDataLoader

Data Loader for Survival Analysis Data

Constructor Args:

data: Union[pd.DataFrame, list, np.ndarray]: The dataset. Either a Pandas DataFrame or a Numpy Array.
time_to_event_column: str: Survival Analysis specific time-to-event feature
target_column: str: The outcome: event or censoring.
sensitive_features: List[str]: Name of sensitive features.
important_features: List[str]: Default: None. Only relevant for SurvivalGAN method.
target_column: str: The feature name that provides labels for downstream tasks.
fairness_column: Optional[str]: Optional fairness column label, used for fairness benchmarking.
domain_column: Optional[str]: Optional domain label, used for domain adaptation algorithms.
random_state: int: Defaults to zero.
train_size: float: The ratio to use for train splits.

Example

>>> TODO

property columns: list

compress() → Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]

compression_protected_features() → list

dataframe() → pandas.core.frame.DataFrame

decode(encoders: Dict[str, Any]) → synthcity.plugins.core.dataloader.DataLoader

decompress(context: Dict) → synthcity.plugins.core.dataloader.DataLoader

decorate(data: Any) → synthcity.plugins.core.dataloader.DataLoader

domain() → Optional[str]

drop(columns: list = []) → synthcity.plugins.core.dataloader.DataLoader

encode(encoders: Optional[Dict[str, Any]] = None) → Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]

fillna(value: Any) → synthcity.plugins.core.dataloader.DataLoader

static from_info(data: pandas.core.frame.DataFrame, info: dict) → synthcity.plugins.core.dataloader.DataLoader

get_fairness_column() → Union[str, Any]

hash() → str

info() → dict

is_tabular() → bool

match(constraints: synthcity.plugins.core.constraints.Constraints) → synthcity.plugins.core.dataloader.DataLoader

numpy() → numpy.ndarray

raw() → Any

sample(count: int, random_state: int = 0) → synthcity.plugins.core.dataloader.DataLoader

satisfies(constraints: synthcity.plugins.core.constraints.Constraints) → bool

property shape: tuple

test() → synthcity.plugins.core.dataloader.DataLoader

train() → synthcity.plugins.core.dataloader.DataLoader

type() → str

unpack(as_numpy: bool = False, pad: bool = False) → Any

property values: numpy.ndarray

class TimeSeriesDataLoader(temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame] = None, static_data: Optional[pandas.core.frame.DataFrame] = None, sensitive_features: List[str] = [], important_features: List[str] = [], fairness_column: Optional[str] = None, random_state: int = 0, train_size: float = 0.8, seq_offset: int = 0, **kwargs: Any)

Bases: synthcity.plugins.core.dataloader.DataLoader

Inheritance diagram of synthcity.plugins.core.dataloader.TimeSeriesDataLoader

Data Loader for Time Series Data

Constructor Args:

temporal data: List[pd.DataFrame]: The temporal data. A list of pandas DataFrames
observation times: List: List of arrays mapping directly to index of each dataframe in temporal_data
outcome: Optional[pd.DataFrame] = None: pandas DataFrame thatn can be anything (eg, labels, regression outcome)
static_data: Optional[pd.DataFrame] = None: pandas DataFrame mapping directly to index of each dataframe in temporal_data
sensitive_features: List[str]: Name of sensitive features
important_features List[str]: Default: None. Only relevant for SurvivalGAN method
fairness_column: Optional[str]: Optional fairness column label, used for fairness benchmarking.
random_state: int: Defaults to zero.

Example

>>> TODO

property columns: list

compress() → Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]

compression_protected_features() → list

dataframe() → pandas.core.frame.DataFrame

decode(encoders: Dict[str, Any]) → synthcity.plugins.core.dataloader.DataLoader

decompress(context: Dict) → synthcity.plugins.core.dataloader.DataLoader

decorate(data: Any) → synthcity.plugins.core.dataloader.DataLoader

domain() → Optional[str]

drop(columns: list = []) → synthcity.plugins.core.dataloader.DataLoader

encode(encoders: Optional[Dict[str, Any]] = None) → Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]

static extract_masked_features(full_temporal_features: list) → tuple

fillna(value: Any) → synthcity.plugins.core.dataloader.DataLoader

filter_ids(ids_list: list) → pandas.core.frame.DataFrame

static from_info(data: pandas.core.frame.DataFrame, info: dict) → synthcity.plugins.core.dataloader.DataLoader

get_fairness_column() → Union[str, Any]

hash() → str

ids() → list

info() → dict

is_tabular() → bool

static mask_temporal_data(temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, fill: Any = 0) → Any

match(constraints: synthcity.plugins.core.constraints.Constraints) → synthcity.plugins.core.dataloader.DataLoader

numpy() → numpy.ndarray

static pack_raw_data(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame], fill: Any = nan, seq_offset: int = 0) → pandas.core.frame.DataFrame

static pad_and_mask(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame], only_features: Any = False, fill: Any = 0) → Any

static pad_raw_data(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame]) → Any

static pad_raw_features(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame]) → Any

raw() → Any

property raw_columns: list

sample(count: int, random_state: int = 0) → synthcity.plugins.core.dataloader.DataLoader

satisfies(constraints: synthcity.plugins.core.constraints.Constraints) → bool

static sequential_view(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame], id_col: str = 'seq_id', time_id_col: str = 'seq_time_id', seq_offset: int = 0) → Tuple[pandas.core.frame.DataFrame, dict]

property shape: tuple

test() → synthcity.plugins.core.dataloader.DataLoader

train() → synthcity.plugins.core.dataloader.DataLoader

type() → str

static unique_temporal_features(temporal_data: List[pandas.core.frame.DataFrame]) → List

static unmask_temporal_data(temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, fill: Any = nan) → Any

unpack(as_numpy: bool = False, pad: bool = False) → Any

unpack_and_decorate(data: pandas.core.frame.DataFrame) → synthcity.plugins.core.dataloader.DataLoader

static unpack_raw_data(data: pandas.core.frame.DataFrame, info: dict) → Tuple[Optional[pandas.core.frame.DataFrame], List[pandas.core.frame.DataFrame], List, Optional[pandas.core.frame.DataFrame]]

property values: numpy.ndarray

class TimeSeriesSurvivalDataLoader(temporal_data: List[pandas.core.frame.DataFrame], observation_times: Union[List, numpy.ndarray, pandas.core.series.Series], T: Union[pandas.core.series.Series, numpy.ndarray], E: Union[pandas.core.series.Series, numpy.ndarray], static_data: Optional[pandas.core.frame.DataFrame] = None, sensitive_features: List[str] = [], important_features: List[str] = [], time_horizons: list = [], fairness_column: Optional[str] = None, random_state: int = 0, train_size: float = 0.8, seq_offset: int = 0, **kwargs: Any)

Bases: synthcity.plugins.core.dataloader.TimeSeriesDataLoader

Inheritance diagram of synthcity.plugins.core.dataloader.TimeSeriesSurvivalDataLoader

Data loader for Time series survival data

Constructor Args:

temporal_data: List[pd.DataFrame}: The temporal data. A list of pandas DataFrames.
observation_times: List: List of arrays mapping directly to index of each dataframe in temporal_data
T: Union[pd.Series, np.ndarray, pd.Series]: Time-to-event data
E: Union[pd.Series, np.ndarray, pd.Series]: E is censored/event data
static_data Optional[pd.DataFrame] = None: pandas DataFrame of static features for each subject
sensitive_features: List[str]: Name of sensitive features
important_features: List[str}: Default: None. Only relevant for SurvivalGAN method.
fairness_column: Optional[str]: Optional fairness column label, used for fairness benchmarking.
random_state. int: Defaults to zero.

Example

>>> TODO

property columns: list

compress() → Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]

compression_protected_features() → list

dataframe() → pandas.core.frame.DataFrame

decode(encoders: Dict[str, Any]) → synthcity.plugins.core.dataloader.DataLoader

decompress(context: Dict) → synthcity.plugins.core.dataloader.DataLoader

decorate(data: Any) → synthcity.plugins.core.dataloader.DataLoader

domain() → Optional[str]

drop(columns: list = []) → synthcity.plugins.core.dataloader.DataLoader

encode(encoders: Optional[Dict[str, Any]] = None) → Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]

static extract_masked_features(full_temporal_features: list) → tuple

fillna(value: Any) → synthcity.plugins.core.dataloader.DataLoader

filter_ids(ids_list: list) → pandas.core.frame.DataFrame

static from_info(data: pandas.core.frame.DataFrame, info: dict) → synthcity.plugins.core.dataloader.DataLoader

get_fairness_column() → Union[str, Any]

hash() → str

ids() → list

info() → dict

is_tabular() → bool

static mask_temporal_data(temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, fill: Any = 0) → Any

match(constraints: synthcity.plugins.core.constraints.Constraints) → synthcity.plugins.core.dataloader.DataLoader

numpy() → numpy.ndarray

static pack_raw_data(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame], fill: Any = nan, seq_offset: int = 0) → pandas.core.frame.DataFrame

static pad_and_mask(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame], only_features: Any = False, fill: Any = 0) → Any

static pad_raw_data(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame]) → Any

static pad_raw_features(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame]) → Any

raw() → Any

property raw_columns: list

sample(count: int, random_state: int = 0) → synthcity.plugins.core.dataloader.DataLoader

satisfies(constraints: synthcity.plugins.core.constraints.Constraints) → bool

static sequential_view(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame], id_col: str = 'seq_id', time_id_col: str = 'seq_time_id', seq_offset: int = 0) → Tuple[pandas.core.frame.DataFrame, dict]

property shape: tuple

test() → synthcity.plugins.core.dataloader.DataLoader

train() → synthcity.plugins.core.dataloader.DataLoader

type() → str

static unique_temporal_features(temporal_data: List[pandas.core.frame.DataFrame]) → List

static unmask_temporal_data(temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, fill: Any = nan) → Any

unpack(as_numpy: bool = False, pad: bool = False) → Any

unpack_and_decorate(data: pandas.core.frame.DataFrame) → synthcity.plugins.core.dataloader.DataLoader

static unpack_raw_data(data: pandas.core.frame.DataFrame, info: dict) → Tuple[Optional[pandas.core.frame.DataFrame], List[pandas.core.frame.DataFrame], List, Optional[pandas.core.frame.DataFrame]]

property values: numpy.ndarray

create_from_info(data: Union[pandas.core.frame.DataFrame, torch.utils.data.dataset.Dataset], info: dict) → synthcity.plugins.core.dataloader.DataLoader: Helper for creating a DataLoader from existing information.