data-validation

Library for exploring and validating machine learning data

APACHE-2.0 License

Downloads
195.8K
Stars
753
Committers
24

Bot releases are hidden (Show)

data-validation - TensorFlow Data Validation 1.15.1 Latest Release

Published by rtg0795 6 months ago

Major Features and Improvements

  • N/A

Bug Fixes and Other Changes

  • Depends on tensorflow>=2.15,<2.16.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.15.0

Published by vkarampudi 6 months ago

Major Features and Improvements

  • N/A

Bug Fixes and Other Changes

  • When computing cross feature statistics, skip configured crosses that
    include features of unsupported types (i.e., are not univalent numeric
    features).
  • Update the minimum Bazel version required to build TFDV to 6.1.0.
  • Modifies get_statistics_html() utility function to return a value indicating
    a dataset has no examples.
  • Outputs both a standard and a quantiles histogram for level N value list
    length statistics.
  • Add a macos_arm64 config setting to the TFDV build file. NOTE: At this
    time, any M1 support for TFDV is experimental and untested.
  • Bumps the pybind11 version to 2.11.1.
  • Depends on tensorflow~=2.15.0.
  • Depends on apache-beam[gcp]>=2.53.0,<3 for Python 3.11 and on
    apache-beam[gcp]>=2.47.0,<3 for 3.9 and 3.10.
  • Depends on protobuf>=4.25.2,<5 for Python 3.11 and on protobuf>3.20.3,<5
    for 3.9 and 3.10.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • Deprecated python 3.8 support.
  • Deprecated Windows support.
data-validation - TensorFlow Data Validation 1.14.0

Published by rtg0795 about 1 year ago

Major Features and Improvements

  • N/A

Bug Fixes and Other Changes

  • Bumped the Ubuntu version on which TFX-BSL is tested to 20.04 (previously
    was 16.04).
  • Use @platforms instead of @bazel_tools//platforms to specify constraints in
    OSS build.
  • Depends on pyarrow>=10,<11.
  • Depends on apache-beam>=2.47,<3.
  • Depends on numpy>=1.22.0.
  • Depends on tensorflow>=2.13.0,<3.

Known Issues

  • N/A

Breaking Changes

  • Moves some non-public arrow_util functions to TFX-BSL.
  • Changes SkewPair proto to store tf.Examples in serialized format.

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.13.0

Published by rtg0795 over 1 year ago

Major Features and Improvements

  • Introduces a Schema option HistogramSelection to allow numeric drift/skew
    calculations to use QUANTILES histograms, which are more robust to outliers.

Bug Fixes and Other Changes

  • Rename statistics_io_impl and default_record_sink (not part of public API).
  • Update the minimum Bazel version required to build TFDV to 5.3.0.
  • Depends on numpy~=1.22.0.
  • Depends on pyfarmhash>=0.2.2,<0.4.
  • Depends on tensorflow>=2.12.0,<2.13.
  • Depends on protobuf>=3.20.3,<5.
  • Depends on tfx-bsl>=1.13.0,<1.14.0.
  • Depends on tensorflow-metadata>=1.13.1,<1.14.0.

Known Issues

  • N/A

Breaking Changes

  • Jensen-Shannon divergence now treats NaN values as always contributing to
    higher drift score.

Deprecations

  • Deprecated python 3.7 support.
data-validation - TensorFlow Data Validation 1.12.0

Published by venkat2469 almost 2 years ago

Major Features and Improvements

  • N/A

Bug Fixes and Other Changes

  • TFDV is now tested against macOS 12.5 (Monterey).

Known Issues

  • N/A

Breaking Changes

  • Depends on tensorflow>=2.11,<3
  • Depends on tfx-bsl>=1.12.0,<1.13.0.
  • Depends on tensorflow-metadata>=1.12.0,<1.13.0.

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.11.0

Published by venkat2469 almost 2 years ago

Major Features and Improvements

  • This is the last version that supports TensorFlow 1.15.x. TF 1.15.x support
    will be removed in the next version. Please check the
    TF2 migration guide to migrate
    to TF2.

  • Add a custom_validate_statistics function to the validation API, and
    support passing custom validations to validate_statistics. Note that
    custom validation is not supported on Windows.

Bug Fixes and Other Changes

  • Fix bug in implementation of semantic_domain_stats_sample_rate.

  • Add beam metrics on string length

  • Determine whether to calculate string statistics based on the
    is_categorical field in the schema string domain.

  • Histograms counts should now be more accurate for distributions with few
    distinct values, or frequent individual values.

  • Nested list length histogram counts are no longer based on the number of
    values one up in the nested list hierarchy.

  • Support using jensen-shannon divergence to detect drift and skew for string
    and categorical features.

  • get_drift_skew_dataframe now includes a threshold column.

  • Adds support for NormalizedAbsoluteDifference comparator.

  • Depends on tensorflow>=1.15.5,<2 or tensorflow>=2.10,<3

  • Depends on joblib>=1.2.0.

Known Issues

  • N/A

Breaking Changes

  • Histogram semantics are slightly changed, so that buckets include their
    upper bound instead of their lower bound. STANDARD histograms will no longer
    generate buckets that contain infinite and finite endpoints together.
  • Introduces StatsOptions.use_sketch_based_topk_uniques replacing
    experimental_use_sketch_based_topk_uniques. The latter option can still be
    written, but not read.

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.10.0

Published by venkat2469 about 2 years ago

Major Features and Improvements

  • N/A

Bug Fixes and Other Changes

  • Skew pipeline supports counting pairs of feature values in base/test.
  • Depends on apache-beam[gcp]>=2.40,<3.
  • Depends on pyarrow>=6,<7.
  • Depends on tfx-bsl>=1.10.1,<1.11.0.
  • Depends on tensorflow-metadata>=1.10.0,<1.11.0.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.9.0

Published by venkat2469 over 2 years ago

Major Features and Improvements

  • N/A

Bug Fixes and Other Changes

  • Depends on tensorflow>=1.15.5,<2 or tensorflow>=2.9,<3
  • Depends on tfx-bsl>=1.9.0,<1.10.0.
  • Depends on tensorflow-metadata>=1.9.0,<1.10.0.

Known Issues

  • N/A

Breaking Changes

  • Some fields in feature skew results proto changed names to be more generic.

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.8.0

Published by rtg0795 over 2 years ago

Major Features and Improvements

  • From this version we will be releasing python 3.9 wheels.

Bug Fixes and Other Changes

  • Adds get_statistics_html to the public API.
  • Fixes several incorrect type annotations.
  • Schema inference handles derived features.
  • StatsOptions.to_json now raises an error if it encounters unsupported
    options.
  • Depends on apache-beam[gcp]>=2.38,<3.
  • Depends on
    tensorflow>=1.15.5,!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.
  • Depends on tensorflow-metadata>=1.8.0,<1.9.0.
  • Depends on tfx-bsl>=1.8.0,<1.9.0.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.7.0

Published by rtg0795 over 2 years ago

Major Features and Improvements

  • Adds the DetectFeatureSkew PTransform to the public API, which can be used
    to detect feature skew between training and serving examples.
  • Uses sketch-based top-k/uniques in TFDV inmemory mode.

Bug Fixes and Other Changes

  • Fixes a bug in load_statistics that would cause failure when reading binary
    protos.
  • Depends on pyfarmhash>=0.2,<0.4.
  • Depends on
    tensorflow>=1.15.5,!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.
  • Depends on tensorflow-metadata>=1.7.0,<1.8.0.
  • Depends on tfx-bsl>=1.7.0,<1.8.0.
  • Depends on apache-beam[gcp]>=2.36,<3.
  • Updated the documentation for CombinerStatsGenerator to clarify that the
    first accumulator passed to merge_accumulators may be modified.
  • Added compression type detection when reading csv header.
  • Detection of invalid utf8 strings now works regardless of relative frequency.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.6.0

Published by rtg0795 over 2 years ago

Major Features and Improvements

  • Introduces a convenience wrapper for handling indexed access to statistics
    protos.
  • String features are checked for UTF-8 validity, and the number of invalid
    strings is reported as invalid_utf8_count.

Bug Fixes and Other Changes

  • Depends on numpy>=1.16,<2.
  • Depends on absl-py>=0.9,<2.0.0.
  • Depends on
    tensorflow>=1.15.5,!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,<3.
  • Depends on tensorflow-metadata>=1.6.0,<1.7.0.
  • Depends on tfx-bsl>=1.6.0,<1.7.0.
  • Depends on apache-beam[gcp]>=2.35,<3.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.5.0

Published by rtg0795 almost 3 years ago

Major Features and Improvements

  • N/A

Bug Fixes and Other Changes

  • BasicStatsGenerator is now responsible for setting the global num_examples.
    This field will no longer be populated at the DatasetFeatureStatistics level
    if default generators are disabled.
  • Depends on apache-beam[gcp]>=2.34,<3.
  • Depends on
    tensorflow>=1.15.2,!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,<3.
  • Depends on tensorflow-metadata>=1.5.0,<1.6.0.
  • Depends on tfx-bsl>=1.5.0,<1.6.0.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.4.0

Published by jay90099 almost 3 years ago

Major Features and Improvements

  • Float features can now be analyzed as categorical for the purposes of top-k
    and unique count using experimental sketch based generators.
  • Support SQL based slicing in TFDV. This would enable slicing (using SQL) in
    TFX OSS and Dataflow environments. SQL based slicing is currently not
    supported on Windows.

Bug Fixes and Other Changes

  • Variance calculations have been updated to be more numerically stable for
    large datasets or large magnitude numeric data.
  • When running per-example validation against a schema, output of
    validate_examples_in_tfrecord and validate_examples_in_csv now optionally
    return samples of anomalous examples.
  • Changes to source code ensures that it can now work with pyarrow>=3.
  • Add load_anomalies_binary utility function.
  • Merge two accumulators at a time instead of batching.
  • BasicStatsGenerator is now responsible for setting
    FeatureNameStatistics.Type. Previously it was possible for a top-k generator
    and BasicStatsGenerator to set different types for categorical numeric
    features with physical type STRING.
  • Depends on pyarrow>=1,<6.
  • Depends on tensorflow-metadata>=1.4,<1.5.
  • Depends on tfx-bsl>=1.4,<1.5.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • Deprecated python 3.6 support.
data-validation - TensorFlow Data Validation 1.3.0

Published by dhruvesh09 about 3 years ago

Major Features and Improvements

  • N/A

Bug Fixes and Other Changes

  • Fixed bug in JensenShannonDivergence calculation affecting comparisons of
    histograms that each contain a single value.
  • Fixed bug in dataset constraints validation that caused failures with very
    large numbers of examples.
  • Fixed a bug wherein slicing on a feature missing from some batches could
    produce slice keys derived from a different feature.
  • Depends on apache-beam[gcp]>=2.32,<3.
  • Depends on
    tensorflow>=1.15.2,!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,<3.
  • Depends on tfx-bsl>=1.3,<1.4.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.2.0

Published by jay90099 about 3 years ago

Major Features and Improvements

  • Added statistics/generators/mutual_information.py. It estimates AMI using a
    knn estimation. It differs from sklearn_mutual_information.py in that this
    supports multivalent features/labels (by encoding) and multivariate
    features/labels. The plan is to deprecate sklearn_mutual_information.py in
    the future.
  • Fixed NonStreamingCustomStatsGenerator to respect max_batches_per_partition.

Bug Fixes and Other Changes

  • Depends on 'scikit-learn>=0.23,<0.24' ("mutual-information" extra only)
  • Depends on 'scipy>=1.5,<2' ("mutual-information" extra only)
  • Depends on apache-beam[gcp]>=2.31,<3.
  • Depends on tensorflow-metadata>=1.2,<1.3.
  • Depends on tfx-bsl>=1.2,<1.3.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.1.1

Published by jay90099 about 3 years ago

Major Features and Improvements

  • N/A

Bug Fixes and Other Changes

  • Depends on google-cloud-bigquery>=1.28.0,<2.21.
  • Depends on tfx-bsl>=1.1.1,<1.2.
  • Fixes error when using tfdv.experimental_get_feature_value_slicer with
  • pandas==1.3.0.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.1.0

Published by jay90099 over 3 years ago

Major Features and Improvements

  • N/A

Bug Fixes and Other Changes

  • Optimized certain stats generators that needs to materialize the input
    RecordBatches.
  • Depends on protobuf>=3.13,<4.
  • Depends on tensorflow-metadata>=1.1,<1.2.
  • Depends on tfx-bsl>=1.1,<1.2.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 1.0.0

Published by jay90099 over 3 years ago

Major Features and Improvements

  • N/A

Bug Fixes and Other Changes

  • Increased the threshold beyond which a string feature value is considered
    "large" by the experimental sketch-based top-k/unique generator to 1024.
  • Added normalized AMI to sklearn mutual information generator.
  • Depends on apache-beam[gcp]>=2.29,<3.
  • Depends on tensorflow>=1.15.2,!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,<3.
  • Depends on tensorflow-metadata>=1.0,<1.1.
  • Depends on tfx-bsl>=1.0,<1.1.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • Removed the following deprecated symbols. Their deprecation was announced
    in 0.30.0.
  • tfdv.validate_instance
  • tfdv.lift_stats_generator
  • tfdv.partitioned_stats_generator
  • tfdv.get_feature_value_slicer
  • Removed parameter compression_type in
    tfdv.generate_statistics_from_tfrecord
data-validation - TensorFlow Data Validation 0.26.1

Published by dhruvesh09 over 3 years ago

Major Features and Improvements

  • N/A

Bug Fixes and Other Changes

  • Depends on apache-beam[gcp]>=2.25,!=2.26.*,<2.29.

Known Issues

  • N/A

Breaking changes

  • N/A

Deprecations

  • N/A
data-validation - TensorFlow Data Validation 0.30.0

Published by jay90099 over 3 years ago

Major Features and Improvements

  • This version is the last version before TFDV 1.0. Once 1.0, all the TFDV
    public APIs (i.e. symbols in the root __init__.py) will be subject to
    semantic versioning. We are deprecating some public APIs in this version
    and they will be removed in 1.0.

  • Sketch-based top-k/unique stats generator now is able to detect invalid
    utf-8 sequences / large texts and replace them with a placeholder.
    It will not suffer from memory issue usually caused by image / large text
    features in the data. Note that this generator is not by default used yet.

  • Added StatsOptions.experimental_use_sketch_based_topk_uniques which
    enables the sketch-based top-k/unique stats generator.

Bug Fixes and Other Changes

  • Fixed bug in display_schema that caused domains not to be displayed.
  • Modified how get_schema_dataframe outputs numeric domains.
  • Anomalies previously (un)classified as UKNOWN_TYPE now trigger more specific
    anomaly types: INVALID_DOMAIN_SPECIFICATION and MULTIPLE_REASONS.
  • Depends on tensorflow-metadata>=0.30,<0.31.
  • Depends on tfx-bsl>=0.30,<0.31.

Known Issues

  • N/A

Breaking Changes

  • N/A

Deprecations

  • tfdv.LiftStatsGenerator is going to be removed in the next version from
    the public API. To enable that generator,
    supply StatsOptions.label_feature
  • tfdv.NonStreamingCustomStatsGenerator is going to be removed in the next
    version from the public API. You may continue to import it from TFDV
    but it will not be subject to compatibility guarantees.
  • tfdv.validate_instance is going to be removed in the next
    version from the public API. You may continue to import it from TFDV
    but it will not be subject to compatibility guarantees.
  • Removed tfdv.DecodeCSV, tfdv.DecodeTFExample (deprecated in 0.27).
  • Removed feature_whitelist in tfdv.StatsOptions (deprecated in 0.28).
    Use feature_allowlist instead.
  • tfdv.get_feature_value_slicer is deprecated.
    tfdv.experimental_get_feature_value_slicer is introduced as a replacement.
    TFDV is likely to have a different slicing functionality post 1.0, which
    may not be compatible with the current slicers.
  • StatsOptions.slicing_functions is deprecated.
    StatsOptions.experimental_slicing_functions is introduced as a
    replacement.
  • tfdv.WriteStatisticsToText is removed (deprecated in 0.25.0).
  • Parameter compression_type in tfdv.generate_statistics_from_tfrecord
    is deprecated. The compression type is currently automatically determined.