Synthetic data generation for tabular data
OTHER License
Bot releases are hidden (Show)
FixedIncrements
Fails with New Numerical Data Types - Issue #2157 by @R-PalazzoPublished by amontanez24 about 2 months ago
This release enables the HMASynthesizer
and other utility functions to work with null foreign key values! It also adds an anonymization
method to the metadata classes. Additionally, it patches a bug that lets SDV work with more Pandas data types.
PARSynthesizer
I cannot pass in datetime context (InvalidDataError
during fitting) - Issue #1485 by @lajohn4747learn_rounding_digits
from RDT - Issue #2164 by @R-Palazzois_faker_function
to speed up the unit tests - Issue #2163 by @R-PalazzoPublished by amontanez24 3 months ago
This release adds a new utils function called get_random_sequence_subset
, that allows users to get a subset of sequential data.
get_random_sequence_subset
- Issue #2085 by @amontanez24FixedCombinations
constraint on a child table with multiple parents in HMASynthesizer
- Issue #2087 by @pvk-developerfit
if sequence_index is numerical sdtype - Issue #2079 by @lajohn4747file_name
parameter to filepath
parameter in ExcelHandler - Issue #2065 by @lajohn4747scale
parameter doesn't work for small values - Issue #2045 by @lajohn4747Published by amontanez24 4 months ago
This release provides a number of new features. A big one is that it adds the ability to fit the HMASynthesizer
on disconnected schemas! It also enables the PARSynthesizer
to work with constraints in certain conditions. More specifically, the PARSynthesizer
can now handle constraints as long as the columns involved in the constraints are either exclusively all context columns or exclusively all non-context columns.
Additionally, a verbose
parameter was added to the TVAESynthesizer
to get a more detailed progress bar. Also, a bug was corrected that renamed the file_path
parameter in the ExcelHandler.read()
method to filepath
as specified in the official SDV docs.
sequence_length
is higher than real data - Issue #2031 by @lajohn4747DataProcessor
never gets assigned a table_name
. - Issue #1964 by @fealhofile_path
to filepath
parameter in ExcelHandler - Issue #2055 by @amontanez24sample
- Issue #2042 by @lajohn4747TVAESynthesizer
- Issue #1990 by @fealhoauto_assign_transformers
- Issue #1509 by @lajohn4747Published by amontanez24 5 months ago
This release fixes the ModuleNotFoundError
error that was causing the 1.13.0 release to fail.
Published by amontanez24 5 months ago
This release adds a utility function called get_random_subset
that helps users get a subset of their multi-table data so that modeling can be done quicker. Given a dictionary of table names mapped to DataFrames, metadata, a main table and a desired number of rows to use for the main table, it will subsample the data in a way that maintains referential integrity.
This release also adds two new local file handlers: the CSVHandler
and the ExcelHandler
. This enables users to easily load from and save synthetic data to these files types. These handlers return data and metadata in the multi-table format, so we also added the function get_table_metadata
to get a SingleTableMetadata
object from a MultiTableMetadata
object.
Finally, this release fixes some bugs that prevented synthesizers from working with data that had numerical column names.
get_random_subset
poc utility function - Issue #1877 by @R-Palazzodrop_unknown_references
from poc
to be directly under utils
- Issue #1947 by @R-PalazzoAttributeError: 'int' object has no attribute 'lower'
) - Issue #1933 by @lajohn4747TypeError: unsupported operand
) - Issue #1935 by @lajohn4747FutureWarning
related to 'enforce_uniqueness' parameter - Issue #1995 by @pvk-developerPublished by amontanez24 6 months ago
This release makes a number of changes to how id columns are generated. By default, id columns with a regex will now have their values scrambled in the output. Id columns without a regex that are numeric will be created randomly. If they're not numeric, they will have a random suffix.
Additionally, improvements were made to the visibility of the get_loss_values_plot
.
Published by amontanez24 6 months ago
This release adds support for Python 3.12! It also adds a number of feature improvements. It adds a simplify_schema
utility function to the sdv.utils.poc
module which simplifies multi-table schemas so they can be run using HMASynthesizer
. Multi-table data dictionaries can now be saved directly to CSVs using the sdv.datasets.local.save_csvs
utility function. Additionally, generator-discriminator loss values can now be plotted directly from CTGAN using the get_loss_values_plot
method. This release also adds error messages when trying to load an SDV synthesizer on an older version of the SDV, or when trying to re-fit a synthesizer from an older version of the SDV.
This release also fixes a number of bugs. Metadata auto-detection now validates that all primary keys are unique, and the metadata correctly validates sdtypes in a column relationship. Bugs in the HMASynthesizer
that would cause the diagnostic score to not be equal to 1.0 for cardinality and data validity were fixed. Finally, errors in constraints now correctly raise a ConstraintsNotMetError
instead of an InvalidData
error.
SingleTablePreset
(including FastML
Preset) - Issue #1855 by @lajohn4747sequence_key
when using PARSynthesizer - Issue #1883 by @frances-h'truncnorm'
distribution - Issue #1831 by @frances-hIDGenerator
for Primary Key columns - Issue #1862 by @lajohn4747Published by frances-h 7 months ago
This release adds the poc
utility submodule to help users more easily create a proof-of-concept with multi-table datasets. The poc
submodule includes the drop_unknown_references
utility function to automatically drop unknown references in a multi-table dataset. Additionally, multiple columns in the metadata can now be updated at once using the update_columns
and update_columns_metadata
methods. The SDV now also warns users when a synthesizer is loaded that was fitted on a different version of the SDV.
get_parameters
function consistent between synthesizers - Issue #1756 by @fealhoget_table_parameters
for the multi-table synthesizers - Issue #1757 by @fealhoupdate_columns
and update_columns_metadata
methods to metadata - Issue #1804 by @R-Palazzoget_column_names
method to metadata - Issue #1805 by @frances-hdrop_unknown_references
- Issue #1845 by @R-Palazzopoc
module for utilities that help with proof-of-concept - Issue #1846 by @pvk-developerutils
module: Make internal functions private - Issue #1793 by @R-PalazzoPublished by frances-h 8 months ago
This release adds multiple improvements to handling premium transformers and column relationships, including using premium transformers even if the PII flag is set to true. Additionally, the SDV now warns users to save the metadata after auto-detection has been used. Semantic sdtype detection has also been improved to tokenize column names to prevent unexpected substring matches.
This release also fixes a few warning bugs and fixes an issue that would cause metadata.to_dict
to fail for metadata loaded from older versions of the SDV. A few synthesizer bugs were also resolved. The quality of the sequence_index for the PARSynthesizer
has been improved, and an issue that would cause CTGANSynthesizer
, TVAESynthesizer
, and CopulaGANSynthesizer
to crash if all columns were to be generated from scratch has been fixed.
ScalarRange
constraint - Issue #1737 by @fealhosequence_index
: Move the start dates into the context model - Issue #1760 by @frances-h'category'
(CTGAN, TVAE) - Issue #1735 by @frances-hversion
module to align with SDV Enterprise - Issue #1761 by @R-PalazzoPublished by amontanez24 9 months ago
This release makes a number of improvements. It introduces a new concept to the metadata known as column relationships! Column relationships can be used to define when certain groups of columns in a table should be treated as a special concept (eg. address). You can add a column relationship by using the new add_column_relationship
method. The metadata detection was also improved by allowing semantic sdtypes (eg. 'email', 'phone_number') to be detected as primary keys.
This release also patches some bugs. An issue messing up the likelihood matching in the HMASynthesizer
was resolved. The CTGANSynthesizer
no longer fails when using the FixedCombinations
constraint. The Inequality
constraint was also patched to handle datetimes better.
set_address_columns
method is deprecated in favor of add_column_relationship
.BaseIndependentSampler
crashes because it tries to cast id columns - Issue #1712 by @pvk-developerCTGANSynthesizer
when applying FixedCombinations
constraint - Issue #1717 by @pvk-developerPublished by amontanez24 11 months ago
This release adds support for the new Diagnostic Report from SDMetrics. This report calculates scores for three basic but important properties of your data: data validity, data structure and in the multi table case, relationship validity. Data validity checks that the columns of your data are valid (eg. correct range or values). Data structure makes sure the synthetic data has the correct columns. Relationship validity checks to make sure key references are correct and the cardinality is within ranges seen in the real data.
Additionally, a few bugs were fixed and functionality was improved around synthesizers. It is now possible to access the loss values for the TVAESynthesizer
and CTGANSynthesizer
by using the get_loss_values
method. The get_parameters
method is now more detailed and returns all the parameters used to make a synthesizer. The metadata is now capable of detecting some common pii sdtypes. Finally, a bug that made every parent row generated by the HMASynthesizer
have at least one child row was patched. This should improve cardinality.
SettingWithCopyWarning
(HMASynthesizer) - Issue #1557 by @pvk-developerget_parameters
method for all multi-table synthesizers - Issue #1674 by @frances-hPublished by amontanez24 11 months ago
This release adds an alert to the CTGANSynthesizer
during preprocessing. The alert informs the user if the fitting of the synthesizer is likely to be slow on their schema. Additionally, it is now possible to enforce that sampled datetime values stay within the range of the fitted data!
This release also makes internal changes to support address data in SDV Enterprise.
Published by amontanez24 12 months ago
This release improves user messaging in multiple ways. The most notable is that users will now see an alert if the HMASynthesizer
is likely to be slow for their data's schema. Additionally, the logger messaging for constraints and the error messaging when setting distributions on non-parametric models was made more detailed.
The visualization plots in the sdv.evaluation
sub-package all got a new parameter called plot_type
, allowing the users to specify the plot type to use if the one being inferred is not useful. The sdv.datasets.local.load_csvs
method now has a parameter called read_csv_parameters
, that allow users to specify how the csvs should be read during loading. The same change was also made to the sdv.metadata.multi_table.detect_table_from_csv
, sdv.metadata.multi_table.detect_from_csvs
and sdv.metadata.single_table.detect_from_csv
methods.
Multiple bugs were resolved including one that caused new categories to be created during the sample step of CTGANSynthesizer
.
Published by amontanez24 about 1 year ago
Several improvements and bug fixes were made in this release. Most notably, the metadata detection was substantially improved. Support for the 'unknown' sdtype was added, providing more flexibility in data representation. The software now attempts to intelligently detect primary keys and identify parent-child relationships in the metadata, streamlining the metadata creation process.
Additionally, issues related to conditional sampling with negative float values, the inability to update transformers for columns created by constraints, and compatibility with numpy version 1.25 and higher were addressed. The default branch was also switched from 'master' to 'main' for better development practices. Various bugs and errors, including those involving HMA and datetime format detection, were also resolved.
id
(leave others as unknown
) - Issue #1598 by @amontanez24'gaussian_kde'
with HMA - Issue #1604 by @frances-hKeyError
) - Issue #1454 by @frances-hValueError: Invalid distribution specification
when setting numerical_distributions on child table (HMA) - Issue #1605 by @fealhoPublished by amontanez24 about 1 year ago
This release makes multiple improvements to the metadata. Both the single and multi table metadata classes now have a validate_data
method. This method runs checks to validate the data against the current specifications in the metadata. The SingleTableMetadata.visualize
is also improved. The sequence index is now shown in the same section as the sequence key. It also now shows all key and index information (eg. sequence key, primary key, sequence index) in one section.
The CTGANSynthesizer
has been made more efficient in the following ways:
preprocess
like categorial columns are.CTGAN
skip the one-hot encoding step.Additional changes include that the columns labeled with the sdtype id
will now go through the IDGenerator
transformer by default and constraint transformations that were being overwritten during sampling will now be respected.
Published by amontanez24 about 1 year ago
This release adds two new methods to the MultiTableMetadata
: detect_from_csvs
and detect_From_dataframes
. These methods allow you to detect metadata for a whole dataset at once by either loading them from a folder or a dictionary mapping table names to the pandas.DataFrames
. The SingleTableMetadata
can now be visualized! Additionally, there is now a summarized
option in the show_table_details
parameter of the visualize
methods. This will print each sdtype in the table and the number of columns that have that sdtype.
Additionally, this release patches a bug that prevented custom constraints from working on columns that were primary or alternate keys. It also adds support for Python 3.11!
Published by amontanez24 over 1 year ago
Published by amontanez24 over 1 year ago
This release adds a parameter called verbose
to the HMASynthesizer
's initialization. Setting it to True will show progress bars during the fitting steps. Additionally, performance optimizations were made to the modeling and initialization of the HMASynthesizer
.
Multiple changes were made to enhance constraints. The Range
constraint was improved to be able to generate more accurate data when null values are provided. Constraints are also now validated against the data when running validate()
on any synthesizer.
Finally, some warnings were resolved.
Published by amontanez24 over 1 year ago
This release adds a new initialization parameter to synthesizers called locales
that allows users to set the locales to use for all columns that have a locale based sdtype
(eg. address
or phone_number
). Additionally, it adds support for Pandas 2.0!
Multiple enhancements were made to improve the performance of data and metadata validation in synthesizers. The Inequality
constraint was improved to be able to generate more scenarios of data concerning the presence of NaNs. Finally, many warnings have been resolved.