synthcity.plugins.core.dataloader module

class DataLoader(data_type: str, data: Any, static_features: List[str] = [], temporal_features: List[str] = [], sensitive_features: List[str] = [], important_features: List[str] = [], outcome_features: List[str] = [], train_size: float = 0.8, random_state: int = 0, **kwargs: Any)

Bases: object

Inheritance diagram of synthcity.plugins.core.dataloader.DataLoader

Base class for all data loaders.

Each derived class must implement the following methods:

unpack() - a method that unpacks the columns and returns features and labels (X, y). decorate() - a method that creates a new instance of DataLoader by decorating the input data with the same DataLoader properties (e.g. sensitive features, target column, etc.) dataframe() - a method that returns the pandas dataframe that contains all features and samples numpy() - a method that returns the numpy array that contains all features and samples info() - a method that returns a dictionary of DataLoader information __len__() - a method that returns the number of samples in the DataLoader satisfies() - a method that tests if the current DataLoader satisfies the constraint provided match() - a method that returns a new DataLoader where the provided constraints are met from_info() - a static method that creates a DataLoader from the data and the information dictionary sample() - returns a new DataLoader that contains a random subset of N samples drop() - returns a new DataLoader with a list of columns dropped __getitem__() - getting features by names __setitem__() - setting features by names train() - returns a DataLoader containing the training set test() - returns a DataLoader containing the testing set fillna() - returns a DataLoader with NaN filled by the provided number(s)

If any method implementation is missing, the class constructor will fail.

Constructor Args:
data_type: str

The type of DataLoader, currently supports “generic”, “time_series” and “survival”.

data: Any

The object that contains the data

static_features: List[str]

List of feature names that are static features (as opposed to temporal features).

temporal_features:

List of feature names that are temporal features, i.e. observed over time.

sensitive_features: List[str]

Name of sensitive features.

important_features: List[str]

Default: None. Only relevant for SurvivalGAN method.

outcome_features:

The feature name that provides labels for downstream tasks.

abstract property columns: list
compress() Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]
abstract compression_protected_features() list
abstract dataframe() pandas.core.frame.DataFrame
decode(encoders: Dict[str, Any]) synthcity.plugins.core.dataloader.DataLoader
decompress(context: Dict) synthcity.plugins.core.dataloader.DataLoader
abstract decorate(data: Any) synthcity.plugins.core.dataloader.DataLoader
domain() Optional[str]
abstract drop(columns: list = []) synthcity.plugins.core.dataloader.DataLoader
encode(encoders: Optional[Dict[str, Any]] = None) Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]
abstract fillna(value: Any) synthcity.plugins.core.dataloader.DataLoader
abstract static from_info(data: pandas.core.frame.DataFrame, info: dict) synthcity.plugins.core.dataloader.DataLoader
abstract get_fairness_column() Union[str, Any]
hash() str
abstract info() dict
abstract is_tabular() bool
abstract match(constraints: synthcity.plugins.core.constraints.Constraints) synthcity.plugins.core.dataloader.DataLoader
abstract numpy() numpy.ndarray
raw() Any
abstract sample(count: int, random_state: int = 0) synthcity.plugins.core.dataloader.DataLoader
abstract satisfies(constraints: synthcity.plugins.core.constraints.Constraints) bool
abstract property shape: tuple
abstract test() synthcity.plugins.core.dataloader.DataLoader
abstract train() synthcity.plugins.core.dataloader.DataLoader
type() str
abstract unpack(as_numpy: bool = False, pad: bool = False) Any
property values: numpy.ndarray
class GenericDataLoader(data: Union[pandas.core.frame.DataFrame, list, numpy.ndarray], sensitive_features: List[str] = [], important_features: List[str] = [], target_column: Optional[str] = None, fairness_column: Optional[str] = None, domain_column: Optional[str] = None, random_state: int = 0, train_size: float = 0.8, **kwargs: Any)

Bases: synthcity.plugins.core.dataloader.DataLoader

Inheritance diagram of synthcity.plugins.core.dataloader.GenericDataLoader

Data loader for generic tabular data.

Constructor Args:
data: Union[pd.DataFrame, list, np.ndarray]

The dataset. Either a Pandas DataFrame or a Numpy Array.

sensitive_features: List[str]

Name of sensitive features.

important_features: List[str]

Default: None. Only relevant for SurvivalGAN method.

target_column: Optional[str]

The feature name that provides labels for downstream tasks.

fairness_column: Optional[str]

Optional fairness column label, used for fairness benchmarking.

domain_column: Optional[str]

Optional domain label, used for domain adaptation algorithms.

random_state: int

Defaults to zero.

Example

>>> from sklearn.datasets import load_diabetes
>>> from synthcity.plugins.core.dataloader import GenericDataLoader
>>> X, y = load_diabetes(return_X_y=True, as_frame=True)
>>> X["target"] = y
>>> # Important note: preprocessing data with OneHotEncoder or StandardScaler is not needed or recommended.
>>> # Synthcity handles feature encoding and standardization internally.
>>> loader = GenericDataLoader(X, target_column="target", sensitive_columns=["sex"],)
property columns: list
compress() Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]
compression_protected_features() list
dataframe() pandas.core.frame.DataFrame
decode(encoders: Dict[str, Any]) synthcity.plugins.core.dataloader.DataLoader
decompress(context: Dict) synthcity.plugins.core.dataloader.DataLoader
decorate(data: Any) synthcity.plugins.core.dataloader.DataLoader
domain() Optional[str]
drop(columns: list = []) synthcity.plugins.core.dataloader.DataLoader
encode(encoders: Optional[Dict[str, Any]] = None) Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]
fillna(value: Any) synthcity.plugins.core.dataloader.DataLoader
static from_info(data: pandas.core.frame.DataFrame, info: dict) synthcity.plugins.core.dataloader.GenericDataLoader
get_fairness_column() Union[str, Any]
hash() str
info() dict
is_tabular() bool
match(constraints: synthcity.plugins.core.constraints.Constraints) synthcity.plugins.core.dataloader.DataLoader
numpy() numpy.ndarray
raw() Any
sample(count: int, random_state: int = 0) synthcity.plugins.core.dataloader.DataLoader
satisfies(constraints: synthcity.plugins.core.constraints.Constraints) bool
property shape: tuple
test() synthcity.plugins.core.dataloader.DataLoader
train() synthcity.plugins.core.dataloader.DataLoader
type() str
unpack(as_numpy: bool = False, pad: bool = False) Any
property values: numpy.ndarray
class ImageDataLoader(data: Union[torch.utils.data.dataset.Dataset, Tuple[torch.Tensor, torch.Tensor]], height: int = 32, width: Optional[int] = None, random_state: int = 0, train_size: float = 0.8, **kwargs: Any)

Bases: synthcity.plugins.core.dataloader.DataLoader

Inheritance diagram of synthcity.plugins.core.dataloader.ImageDataLoader

Data loader for generic image data.

Constructor Args:
data: torch.utils.data.Dataset or torch.Tensor

The image dataset or a tuple of (tensor images, tensor labels)

random_state: int

Defaults to zero.

height: int. Default = 32

Height to use internally

width: Optional[int]

Optional width to use internally. If None, it is used the same value as height.

train_size: float = 0.8

Train dataset ratio.

Example

>>> dataset = datasets.MNIST(".", download=True)
>>>
>>> loader = ImageDataLoader(
>>>     data=dataset,
>>>     train_size=0.8,
>>>     height=32,
>>>     width=w32,
>>> )
property columns: list
compress() Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]
compression_protected_features() list
dataframe() pandas.core.frame.DataFrame
decode(encoders: Dict[str, Any]) synthcity.plugins.core.dataloader.DataLoader
decompress(context: Dict) synthcity.plugins.core.dataloader.DataLoader
decorate(data: Any) synthcity.plugins.core.dataloader.DataLoader
domain() Optional[str]
drop(columns: list = []) synthcity.plugins.core.dataloader.DataLoader
encode(encoders: Optional[Dict[str, Any]] = None) Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]
fillna(value: Any) synthcity.plugins.core.dataloader.DataLoader
static from_info(data: torch.utils.data.dataset.Dataset, info: dict) synthcity.plugins.core.dataloader.ImageDataLoader
get_fairness_column() None

Not implemented for ImageDataLoader

hash() str
info() dict
is_tabular() bool
match(constraints: synthcity.plugins.core.constraints.Constraints) synthcity.plugins.core.dataloader.DataLoader
numpy() numpy.ndarray
raw() Any
sample(count: int, random_state: int = 0) synthcity.plugins.core.dataloader.DataLoader
satisfies(constraints: synthcity.plugins.core.constraints.Constraints) bool
property shape: tuple
test() synthcity.plugins.core.dataloader.DataLoader
train() synthcity.plugins.core.dataloader.DataLoader
type() str
unpack(as_numpy: bool = False, pad: bool = False) Any
property values: numpy.ndarray
class SurvivalAnalysisDataLoader(data: pandas.core.frame.DataFrame, time_to_event_column: str, target_column: str, time_horizons: list = [], sensitive_features: List[str] = [], important_features: List[str] = [], fairness_column: Optional[str] = None, random_state: int = 0, train_size: float = 0.8, **kwargs: Any)

Bases: synthcity.plugins.core.dataloader.DataLoader

Inheritance diagram of synthcity.plugins.core.dataloader.SurvivalAnalysisDataLoader

Data Loader for Survival Analysis Data

Constructor Args:
data: Union[pd.DataFrame, list, np.ndarray]

The dataset. Either a Pandas DataFrame or a Numpy Array.

time_to_event_column: str

Survival Analysis specific time-to-event feature

target_column: str

The outcome: event or censoring.

sensitive_features: List[str]

Name of sensitive features.

important_features: List[str]

Default: None. Only relevant for SurvivalGAN method.

target_column: str

The feature name that provides labels for downstream tasks.

fairness_column: Optional[str]

Optional fairness column label, used for fairness benchmarking.

domain_column: Optional[str]

Optional domain label, used for domain adaptation algorithms.

random_state: int

Defaults to zero.

train_size: float

The ratio to use for train splits.

Example

>>> TODO
property columns: list
compress() Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]
compression_protected_features() list
dataframe() pandas.core.frame.DataFrame
decode(encoders: Dict[str, Any]) synthcity.plugins.core.dataloader.DataLoader
decompress(context: Dict) synthcity.plugins.core.dataloader.DataLoader
decorate(data: Any) synthcity.plugins.core.dataloader.DataLoader
domain() Optional[str]
drop(columns: list = []) synthcity.plugins.core.dataloader.DataLoader
encode(encoders: Optional[Dict[str, Any]] = None) Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]
fillna(value: Any) synthcity.plugins.core.dataloader.DataLoader
static from_info(data: pandas.core.frame.DataFrame, info: dict) synthcity.plugins.core.dataloader.DataLoader
get_fairness_column() Union[str, Any]
hash() str
info() dict
is_tabular() bool
match(constraints: synthcity.plugins.core.constraints.Constraints) synthcity.plugins.core.dataloader.DataLoader
numpy() numpy.ndarray
raw() Any
sample(count: int, random_state: int = 0) synthcity.plugins.core.dataloader.DataLoader
satisfies(constraints: synthcity.plugins.core.constraints.Constraints) bool
property shape: tuple
test() synthcity.plugins.core.dataloader.DataLoader
train() synthcity.plugins.core.dataloader.DataLoader
type() str
unpack(as_numpy: bool = False, pad: bool = False) Any
property values: numpy.ndarray
class TimeSeriesDataLoader(temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame] = None, static_data: Optional[pandas.core.frame.DataFrame] = None, sensitive_features: List[str] = [], important_features: List[str] = [], fairness_column: Optional[str] = None, random_state: int = 0, train_size: float = 0.8, seq_offset: int = 0, **kwargs: Any)

Bases: synthcity.plugins.core.dataloader.DataLoader

Inheritance diagram of synthcity.plugins.core.dataloader.TimeSeriesDataLoader

Data Loader for Time Series Data

Constructor Args:
temporal data: List[pd.DataFrame]

The temporal data. A list of pandas DataFrames

observation times: List

List of arrays mapping directly to index of each dataframe in temporal_data

outcome: Optional[pd.DataFrame] = None

pandas DataFrame thatn can be anything (eg, labels, regression outcome)

static_data: Optional[pd.DataFrame] = None

pandas DataFrame mapping directly to index of each dataframe in temporal_data

sensitive_features: List[str]

Name of sensitive features

important_features List[str]

Default: None. Only relevant for SurvivalGAN method

fairness_column: Optional[str]

Optional fairness column label, used for fairness benchmarking.

random_state: int

Defaults to zero.

Example

>>> TODO
property columns: list
compress() Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]
compression_protected_features() list
dataframe() pandas.core.frame.DataFrame
decode(encoders: Dict[str, Any]) synthcity.plugins.core.dataloader.DataLoader
decompress(context: Dict) synthcity.plugins.core.dataloader.DataLoader
decorate(data: Any) synthcity.plugins.core.dataloader.DataLoader
domain() Optional[str]
drop(columns: list = []) synthcity.plugins.core.dataloader.DataLoader
encode(encoders: Optional[Dict[str, Any]] = None) Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]
static extract_masked_features(full_temporal_features: list) tuple
fillna(value: Any) synthcity.plugins.core.dataloader.DataLoader
filter_ids(ids_list: list) pandas.core.frame.DataFrame
static from_info(data: pandas.core.frame.DataFrame, info: dict) synthcity.plugins.core.dataloader.DataLoader
get_fairness_column() Union[str, Any]
hash() str
ids() list
info() dict
is_tabular() bool
static mask_temporal_data(temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, fill: Any = 0) Any
match(constraints: synthcity.plugins.core.constraints.Constraints) synthcity.plugins.core.dataloader.DataLoader
numpy() numpy.ndarray
static pack_raw_data(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame], fill: Any = nan, seq_offset: int = 0) pandas.core.frame.DataFrame
static pad_and_mask(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame], only_features: Any = False, fill: Any = 0) Any
static pad_raw_data(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame]) Any
static pad_raw_features(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame]) Any
raw() Any
property raw_columns: list
sample(count: int, random_state: int = 0) synthcity.plugins.core.dataloader.DataLoader
satisfies(constraints: synthcity.plugins.core.constraints.Constraints) bool
static sequential_view(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame], id_col: str = 'seq_id', time_id_col: str = 'seq_time_id', seq_offset: int = 0) Tuple[pandas.core.frame.DataFrame, dict]
property shape: tuple
test() synthcity.plugins.core.dataloader.DataLoader
train() synthcity.plugins.core.dataloader.DataLoader
type() str
static unique_temporal_features(temporal_data: List[pandas.core.frame.DataFrame]) List
static unmask_temporal_data(temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, fill: Any = nan) Any
unpack(as_numpy: bool = False, pad: bool = False) Any
unpack_and_decorate(data: pandas.core.frame.DataFrame) synthcity.plugins.core.dataloader.DataLoader
static unpack_raw_data(data: pandas.core.frame.DataFrame, info: dict) Tuple[Optional[pandas.core.frame.DataFrame], List[pandas.core.frame.DataFrame], List, Optional[pandas.core.frame.DataFrame]]
property values: numpy.ndarray
class TimeSeriesSurvivalDataLoader(temporal_data: List[pandas.core.frame.DataFrame], observation_times: Union[List, numpy.ndarray, pandas.core.series.Series], T: Union[pandas.core.series.Series, numpy.ndarray], E: Union[pandas.core.series.Series, numpy.ndarray], static_data: Optional[pandas.core.frame.DataFrame] = None, sensitive_features: List[str] = [], important_features: List[str] = [], time_horizons: list = [], fairness_column: Optional[str] = None, random_state: int = 0, train_size: float = 0.8, seq_offset: int = 0, **kwargs: Any)

Bases: synthcity.plugins.core.dataloader.TimeSeriesDataLoader

Inheritance diagram of synthcity.plugins.core.dataloader.TimeSeriesSurvivalDataLoader

Data loader for Time series survival data

Constructor Args:
temporal_data: List[pd.DataFrame}

The temporal data. A list of pandas DataFrames.

observation_times: List

List of arrays mapping directly to index of each dataframe in temporal_data

T: Union[pd.Series, np.ndarray, pd.Series]

Time-to-event data

E: Union[pd.Series, np.ndarray, pd.Series]

E is censored/event data

static_data Optional[pd.DataFrame] = None

pandas DataFrame of static features for each subject

sensitive_features: List[str]

Name of sensitive features

important_features: List[str}

Default: None. Only relevant for SurvivalGAN method.

fairness_column: Optional[str]

Optional fairness column label, used for fairness benchmarking.

random_state. int

Defaults to zero.

Example

>>> TODO
property columns: list
compress() Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]
compression_protected_features() list
dataframe() pandas.core.frame.DataFrame
decode(encoders: Dict[str, Any]) synthcity.plugins.core.dataloader.DataLoader
decompress(context: Dict) synthcity.plugins.core.dataloader.DataLoader
decorate(data: Any) synthcity.plugins.core.dataloader.DataLoader
domain() Optional[str]
drop(columns: list = []) synthcity.plugins.core.dataloader.DataLoader
encode(encoders: Optional[Dict[str, Any]] = None) Tuple[synthcity.plugins.core.dataloader.DataLoader, Dict]
static extract_masked_features(full_temporal_features: list) tuple
fillna(value: Any) synthcity.plugins.core.dataloader.DataLoader
filter_ids(ids_list: list) pandas.core.frame.DataFrame
static from_info(data: pandas.core.frame.DataFrame, info: dict) synthcity.plugins.core.dataloader.DataLoader
get_fairness_column() Union[str, Any]
hash() str
ids() list
info() dict
is_tabular() bool
static mask_temporal_data(temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, fill: Any = 0) Any
match(constraints: synthcity.plugins.core.constraints.Constraints) synthcity.plugins.core.dataloader.DataLoader
numpy() numpy.ndarray
static pack_raw_data(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame], fill: Any = nan, seq_offset: int = 0) pandas.core.frame.DataFrame
static pad_and_mask(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame], only_features: Any = False, fill: Any = 0) Any
static pad_raw_data(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame]) Any
static pad_raw_features(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame]) Any
raw() Any
property raw_columns: list
sample(count: int, random_state: int = 0) synthcity.plugins.core.dataloader.DataLoader
satisfies(constraints: synthcity.plugins.core.constraints.Constraints) bool
static sequential_view(static_data: Optional[pandas.core.frame.DataFrame], temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, outcome: Optional[pandas.core.frame.DataFrame], id_col: str = 'seq_id', time_id_col: str = 'seq_time_id', seq_offset: int = 0) Tuple[pandas.core.frame.DataFrame, dict]
property shape: tuple
test() synthcity.plugins.core.dataloader.DataLoader
train() synthcity.plugins.core.dataloader.DataLoader
type() str
static unique_temporal_features(temporal_data: List[pandas.core.frame.DataFrame]) List
static unmask_temporal_data(temporal_data: List[pandas.core.frame.DataFrame], observation_times: List, fill: Any = nan) Any
unpack(as_numpy: bool = False, pad: bool = False) Any
unpack_and_decorate(data: pandas.core.frame.DataFrame) synthcity.plugins.core.dataloader.DataLoader
static unpack_raw_data(data: pandas.core.frame.DataFrame, info: dict) Tuple[Optional[pandas.core.frame.DataFrame], List[pandas.core.frame.DataFrame], List, Optional[pandas.core.frame.DataFrame]]
property values: numpy.ndarray
create_from_info(data: Union[pandas.core.frame.DataFrame, torch.utils.data.dataset.Dataset], info: dict) synthcity.plugins.core.dataloader.DataLoader

Helper for creating a DataLoader from existing information.