text

Making text a first-class citizen in TensorFlow.

APACHE-2.0 License

Downloads
7.4M
Stars
1.2K
Committers
118

Bot releases are visible (Hide)

text - v2.9.0-rc1

Published by broken over 2 years ago

Release 2.9.0-rc1

Major Features and Improvements

  • New FastBertNormalizer that improves speed for BERT normalization and is convertible to TF Lite.
  • New FastBertTokenizer that combines FastBertNormalizer and FastWordpieceTokenizer.
  • New ngrams kernel for handling STRING_JOIN reductions.

Bug Fixes and Other Changes

  • Fixed bug in setup.py that was requiring the wrong version.
  • Updated package with the correct versions of Python we release on.
  • Update documentation on TF Lite convertible ops.
  • Transition to use TF's version of bazel.
  • Transition to use TF's bazel configuration.
  • Add missing symbols for tokenization layers
  • Fix typo in text_generation.ipynb
  • Fix grammar typo
  • Allow fast wordpiece tokenizer to take in external wordpiece model.
  • Internal change
  • Improvement to guide where mean call is redundant. See https://github.com/tensorflow/text/issues/810 for more info.
  • Update broken link and fix typo in BERT-SNGP demo notebook
  • Consolidate disparate test-related files into a single testing_infra folder.
  • Pin tf-text version to guides & tutorials.
  • Fix bug in constrained sequence op. Added a check on an edge case where num_steps = 0 should do nothing and prevent it from SIGSEV crashes.
  • Remove outdated Keras tests due to them no longer making the testing utilities available.
  • Update bert preprocessing by padding correct tensors
  • Update tensorflow-text notebooks from 2.7 to 2.8
  • Optimize FastWordPiece to only generate requested outputs.
  • Add a note about byte-indexing vs character indexing.
  • Add a MAX_TOKENS to the transformer tutorial.
  • Only export tensorflow symbols from shared libs.
  • (Generated change) Update tf.Text versions and/or docs.
  • Do not run the prepare_tf_dep script for Apple M1 macs.
  • Update text_classification_rnn.ipynb
  • Fix the exported symbols for the linker test. By adding it to the share objects instead of the c++ code, it allows for the code to be compiled together in one large shared lib.
  • Implement FastBertNormalizer based on codepoint-wise mappings.
  • Add pybind for fast_bert_normalizer_model_builder.
  • Remove unused comments related to Python 2 compatibility.
  • update transformer.ipynb
  • Update toolchain & temporarily disable tf lite tests.
  • Define manylinux2014 for the new toolchain target, and have presubmits use it.
  • Move tflite build deps to custom target.
  • Add FastBertTokenizer.
  • Update bazel version to 5.1.0
  • Update TF Text to use new Ngrams kernel.
  • Don't try to set dimension if shape is unknown for ngrams.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Aflah, Connor Brinton, devnev39, Janak Ramakrishnan, Martin, Nathan Luehr, Pierre Dulac, Rabin Adhikari

text - v2.9.0-rc0

Published by broken over 2 years ago

Release 2.9.0-rc0

Major Features and Improvements

  • New FastBertNormalizer that improves speed for BERT normalization and is convertible to TF Lite.
  • New FastBertTokenizer that combines FastBertNormalizer and FastWordpieceTokenizer.
  • New ngrams kernel for handling STRING_JOIN reductions.

Bug Fixes and Other Changes

  • Add missing symbols for tokenization layers
  • Fix typo in text_generation.ipynb
  • Fix grammar typo
  • Allow fast wordpiece tokenizer to take in external wordpiece model.
  • Internal change
  • Improvement to guide where mean call is redundant. See https://github.com/tensorflow/text/issues/810 for more info.
  • Update broken link and fix typo in BERT-SNGP demo notebook
  • Consolidate disparate test-related files into a single testing_infra folder.
  • Pin tf-text version to guides & tutorials.
  • Fix bug in constrained sequence op. Added a check on an edge case where num_steps = 0 should do nothing and prevent it from SIGSEV crashes.
  • Remove outdated Keras tests due to them no longer making the testing utilities available.
  • Update bert preprocessing by padding correct tensors
  • Update tensorflow-text notebooks from 2.7 to 2.8
  • Optimize FastWordPiece to only generate requested outputs.
  • Add a note about byte-indexing vs character indexing.
  • Add a MAX_TOKENS to the transformer tutorial.
  • Only export tensorflow symbols from shared libs.
  • (Generated change) Update tf.Text versions and/or docs.
  • Do not run the prepare_tf_dep script for Apple M1 macs.
  • Update text_classification_rnn.ipynb
  • Fix the exported symbols for the linker test. By adding it to the share objects instead of the c++ code, it allows for the code to be compiled together in one large shared lib.
  • Implement FastBertNormalizer based on codepoint-wise mappings.
  • Add pybind for fast_bert_normalizer_model_builder.
  • Remove unused comments related to Python 2 compatibility.
  • update transformer.ipynb
  • Update toolchain & temporarily disable tf lite tests.
  • Define manylinux2014 for the new toolchain target, and have presubmits use it.
  • Move tflite build deps to custom target.
  • Add FastBertTokenizer.
  • Update bazel version to 5.1.0
  • Update TF Text to use new Ngrams kernel.
  • Don't try to set dimension if shape is unknown for ngrams.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Aflah, Connor Brinton, devnev39, Janak Ramakrishnan, Martin, Nathan Luehr, Pierre Dulac, Rabin Adhikari

text - v2.8.1

Published by broken over 2 years ago

Release 2.8.1

Major Features and Improvements

  • Upgrade Sentencepiece to v0.1.96
  • Adds new trimmer ShrinkLongestTrimmer

Bug Fixes and Other Changes

  • Upgrade bazel to 4.2.2
  • Create .bazelversion file to guarantee using correct version
  • Update tf.Text versions and docs.
  • Add Apple Silicon support for manual builds.
  • Update configure.sh
  • Only Apple Silicon will be installed with tensorflow-macos
  • Fix merge error & add SP patch for building on Windows
  • Fix inclusion of missing libraries for Mac & Windows
  • Update word_embeddings.ipynb
  • Update classify_text_with_bert.ipynb
  • Update tensorflow_text tutorials to new preprocessing layer symbol path
  • Fixes typo in guide
  • Update Apple Silicon's requires.
  • release script to use tf nighly
  • Fix typo in ragged tensor link.
  • Update requires for setup. It wasn't catching non-M1 Macs.
  • Add missing symbols for tokenization layers
  • Fix typo in text_generation.ipynb
  • Fix grammar typo
  • Allow fast word piece tokenizer to take in external word piece model.
  • Update guide with redundant mean call.
  • Update broken link and fix typo in BERT-SNGP demo notebook.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Abhijeet Manhas, chunduriv, Dean Wyatte, Feiteng, jaymessina3, Mao, Olivier Bacs, RenuPatelGoogle, Steve R. Sun, Stonepia, sun1638650145, Tharaka De Silva, thuang513, Xiaoquan Kong, devnev39, Janak Ramakrishnan, Pierre Dulac

text - v2.8.0-rc0

Published by mms4devops over 2 years ago

Release 2.8.0-rc0

Major Features and Improvements

  • Upgrade Sentencepiece to v0.1.96
  • Adds new trimmer ShrinkLongestTrimmer

Bug Fixes and Other Changes

  • Upgrade bazel to 4.2.2
  • Create .bazelversion file to guarantee using correct version
  • (Generated change) Update tf.Text versions and/or docs.
  • Add Apple Silicon support for manual builds.
  • Update configure.sh
  • Only Apple Silicon will be installed with tensorflow-macos
  • Fix merge error & add SP patch for building on Windows
  • Fix inclusion of missing libraries for Mac & Windows
  • Update word_embeddings.ipynb
  • Update classify_text_with_bert.ipynb
  • Update tensorflow_text tutorials to new preprocessing layer symbol path
  • Fixes typo in guide
  • Update Apple Silicon's requires.
  • release script to use tf nighly
  • Fix typo in ragged tensor link.
  • Update requires for setup. It wasn't catching non-M1 Macs.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Abhijeet Manhas, chunduriv, Dean Wyatte, Feiteng, jaymessina3, Mao, Olivier Bacs, RenuPatelGoogle, Steve R. Sun, Stonepia, sun1638650145, Tharaka De Silva, thuang513, Xiaoquan Kong

text - v2.7.3

Published by mms4devops almost 3 years ago

Bug Fixes and Other Changes

  • Fixed broken packages for MacOS & Windows
text - v2.7.0

Published by mms4devops almost 3 years ago

Release 2.7.0

Major Features and Improvements

  • Added new tokenizer: FastWordpieceTokenizer that is considerably faster than the original WordpieceTokenizer
  • WhitespaceTokenizer was rewritten to increase speed and smaller kernel size
  • Ability to convert WhitespaceTokenizer & FastWordpieceTokenizer to TF Lite
  • Added Keras layers for tokenizers: UnicodeScript, Whitespace, & Wordpiece

Bug Fixes and Other Changes

  • (Generated change) Update tf.Text versions and/or docs.
  • tiny change for variable name in transformer tutorial
  • Update nmt_with_attention.ipynb
  • Add vocab_size for wordpiece tokenizer to have consistency with sentence piece.
  • This is a general clean up to the build files. The previous tf_deps paradigm was confusing. By encapsulating everything into a single call lib, I'm hoping this makes it easier to understand and follow.
  • This adds the builder for the new WhitespaceTokenizer config cache. This is the first in a series of changes to update the WST for mobile.
  • C++ API for new WhitespaceTokenizer. The updated API is more useful (accepts strings instead of ints), faster, and smaller in size.
  • Adds pywrap for WhitespaceTokenizer config builder.
  • Simplify the configure.bzl. Since for each platform we build with C++14, let's just make it easier to default to it across the board. This should be easier to understand and maintain.
  • Remove most of the default oss deps for kernels as they are no longer required for building.
  • Updating this BERT tutorial to use model subclassing (easier for students to hack on it this way).
  • Adds kernels for TF & TFLite for the new WhitespaceTokenizer.
  • Fix a problem with the WST template that was causing members to be exported as undefined symbols. After this change they become a unique global symbol in the shared object file.
  • Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.
  • Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.
  • Convert the TFLite kernel for ngram with STRING_JOIN mode to use tfshim so the same code is now used for TF and TFLite kernels.
  • fix: masked_ids -> masked_lm_ids
  • Save the transformer.
  • Remove the sentencepiece patch in OSS
  • fix vocab_table arg is not used in bert_pretrain_preprocess()
  • Disable TSAN for one more tutorial test that may run for >900sec when TSAN is
  • Remove the sentencepiece patch in OSS
  • internal
  • (Generated change) Update tf.Text versions and/or docs.
  • Update deps to fix broken build.
  • Remove --gen_report flag.
  • Small typo fixed
  • Explain that all heads are handled with a single Dense layer
  • internal change, should be a noop in github.
  • Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.
  • Creates tf Lite registrar and adds TF Lite tests for mobile ops.
  • Fix nmt_with_attention start_index
  • Export LD_LIBRARY_PATH when configuring for build.
  • Update tf lite test to use the function rather than having to globally share the linked library symbols so the interpreter can find the name since this is only available on linux.
  • Temporarily switch to the definition of REGISTER_TF_OP_SHIM while it updates.
  • Update REGISTER_TF_OP_SHIM macro to remove unnecessary parameter.
  • Remove temporary code and set back to using the op shim macro.
  • Updated import statement
  • Internal change
  • pushed back forward compatibility date for tf_text.WhitespaceTokenizer.
  • Add .gitignore
  • The --keep_going flag will make bazel run all tests instead of stopping
  • Add missing blank line between test and doctest.
  • Adds a regression test for model server for the replaced WST op. This ensures that current models using the old kernel will continue to work.
  • Fix the build by adding a new dependency required by TF to kernel targets.
  • Add sentenepiece detokenize op to stateful allowlist.
  • Fix broken build. This occurred because of a change on TF that updated the compiler infra version (https://github.com/tensorflow/tensorflow/commit/e0940f269a10f409466b6fef4ef531aec81f9afa).
  • Clean up code now that the build horizon has passed.
  • Add pywrap dependency for tflite ops.
  • Update TextVectorization layer
  • Allows overridden get_selectable to be used.
  • fix: masked_input_ids is not used in bert_pretrain_preprocess()
  • Update word_embeddings.ipynb
  • Fixed a value where the training accuracy was shown instead of the validation accuracy
  • Mark old SP targets
  • Create a single SELECT_TFTEXT_OPS for registering all of the TF Text ops with TF Lite interpreter. Also adds a single target for building to them.
  • Add TF Lite op for RaggedTensorToTensor.
  • Adds a new guide for using select TF Text ops in TF Lite models for mobile.
  • Switch FastWordpieceTokenizer to default to running pre-tokenization, and rename the end_to_end parameter to no_pretokenization. This should be a no-op. The flatbuffer is not changed so as to not affect any models already using FWP currently. Only the python API is updated.
  • Update version

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Aaron Siddhartha Mondal, Abhijeet Manhas, Dominik Schlösser, jaymessina3, Mao, Xiaoquan Kong, Yasir Modak, Olivier Bacs, Tharaka De Silva

text - v2.7.0-rc1

Published by broken almost 3 years ago

Release 2.7.0-rc1

Major Features and Improvements

  • Added new tokenizer: FastWordpieceTokenizer that is considerably faster than the original WordpieceTokenizer
  • WhitespaceTokenizer was rewritten to increase speed and smaller kernel size
  • Ability to convert WhitespaceTokenizer & FastWordpieceTokenizer to TF Lite
  • Added Keras layers for tokenizers: UnicodeScript, Whitespace, & Wordpiece

Bug Fixes and Other Changes

  • (Generated change) Update tf.Text versions and/or docs.
  • tiny change for variable name in transformer tutorial
  • Update nmt_with_attention.ipynb
  • Add vocab_size for wordpiece tokenizer to have consistency with sentence piece.
  • This is a general clean up to the build files. The previous tf_deps paradigm was confusing. By encapsulating everything into a single call lib, I'm hoping this makes it easier to understand and follow.
  • This adds the builder for the new WhitespaceTokenizer config cache. This is the first in a series of changes to update the WST for mobile.
  • C++ API for new WhitespaceTokenizer. The updated API is more useful (accepts strings instead of ints), faster, and smaller in size.
  • Adds pywrap for WhitespaceTokenizer config builder.
  • Simplify the configure.bzl. Since for each platform we build with C++14, let's just make it easier to default to it across the board. This should be easier to understand and maintain.
  • Remove most of the default oss deps for kernels as they are no longer required for building.
  • Updating this BERT tutorial to use model subclassing (easier for students to hack on it this way).
  • Adds kernels for TF & TFLite for the new WhitespaceTokenizer.
  • Fix a problem with the WST template that was causing members to be exported as undefined symbols. After this change they become a unique global symbol in the shared object file.
  • Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.
  • Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.
  • Convert the TFLite kernel for ngram with STRING_JOIN mode to use tfshim so the same code is now used for TF and TFLite kernels.
  • fix: masked_ids -> masked_lm_ids
  • Save the transformer.
  • Remove the sentencepiece patch in OSS
  • fix vocab_table arg is not used in bert_pretrain_preprocess()
  • Disable TSAN for one more tutorial test that may run for >900sec when TSAN is
  • Remove the sentencepiece patch in OSS
  • internal
  • (Generated change) Update tf.Text versions and/or docs.
  • Update deps to fix broken build.
  • Remove --gen_report flag.
  • Small typo fixed
  • Explain that all heads are handled with a single Dense layer
  • internal change, should be a noop in github.
  • Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.
  • Creates tf Lite registrar and adds TF Lite tests for mobile ops.
  • Fix nmt_with_attention start_index
  • Export LD_LIBRARY_PATH when configuring for build.
  • Update tf lite test to use the function rather than having to globally share the linked library symbols so the interpreter can find the name since this is only available on linux.
  • Temporarily switch to the definition of REGISTER_TF_OP_SHIM while it updates.
  • Update REGISTER_TF_OP_SHIM macro to remove unnecessary parameter.
  • Remove temporary code and set back to using the op shim macro.
  • Updated import statement
  • Internal change
  • pushed back forward compatibility date for tf_text.WhitespaceTokenizer.
  • Add .gitignore
  • The --keep_going flag will make bazel run all tests instead of stopping
  • Add missing blank line between test and doctest.
  • Adds a regression test for model server for the replaced WST op. This ensures that current models using the old kernel will continue to work.
  • Fix the build by adding a new dependency required by TF to kernel targets.
  • Add sentenepiece detokenize op to stateful allowlist.
  • Fix broken build. This occurred because of a change on TF that updated the compiler infra version (https://github.com/tensorflow/tensorflow/commit/e0940f269a10f409466b6fef4ef531aec81f9afa).
  • Clean up code now that the build horizon has passed.
  • Add pywrap dependency for tflite ops.
  • Update TextVectorization layer
  • Allows overridden get_selectable to be used.
  • Update version

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Aaron Siddhartha Mondal, Abhijeet Manhas, Dominik Schlösser, jaymessina3, Mao, Xiaoquan Kong, Yasir Modak

text - v2.7.0-rc0

Published by broken about 3 years ago

Release 2.7.0-rc0

Major Features and Improvements

  • WhitespaceTokenizer was rewritten to increase speed and smaller kernel size
  • Ability to convert some ops to TF Lite

Bug Fixes and Other Changes

  • (Generated change) Update tf.Text versions and/or docs.
  • tiny change for variable name in transformer tutorial
  • Update nmt_with_attention.ipynb
  • Add vocab_size for wordpiece tokenizer to have consistency with sentence piece.
  • This is a general clean up to the build files. The previous tf_deps paradigm was confusing. By encapsulating everything into a single call lib, I'm hoping this makes it easier to understand and follow.
  • This adds the builder for the new WhitespaceTokenizer config cache. This is the first in a series of changes to update the WST for mobile.
  • C++ API for new WhitespaceTokenizer. The updated API is more useful (accepts strings instead of ints), faster, and smaller in size.
  • Adds pywrap for WhitespaceTokenizer config builder.
  • Simplify the configure.bzl. Since for each platform we build with C++14, let's just make it easier to default to it across the board. This should be easier to understand and maintain.
  • Remove most of the default oss deps for kernels as they are no longer required for building.
  • Updating this BERT tutorial to use model subclassing (easier for students to hack on it this way).
  • Adds kernels for TF & TFLite for the new WhitespaceTokenizer.
  • Fix a problem with the WST template that was causing members to be exported as undefined symbols. After this change they become a unique global symbol in the shared object file.
  • Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.
  • Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.
  • Convert the TFLite kernel for ngram with STRING_JOIN mode to use tfshim so the same code is now used for TF and TFLite kernels.
  • fix: masked_ids -> masked_lm_ids
  • Save the transformer.
  • Remove the sentencepiece patch in OSS
  • fix vocab_table arg is not used in bert_pretrain_preprocess()
  • Disable TSAN for one more tutorial test that may run for >900sec when TSAN is
  • Remove the sentencepiece patch in OSS
  • internal
  • (Generated change) Update tf.Text versions and/or docs.
  • Update deps to fix broken build.
  • Remove --gen_report flag.
  • Small typo fixed
  • Explain that all heads are handled with a single Dense layer
  • internal change, should be a noop in github.
  • Update whitespace op to use new kernel. This change still allows for building the old kernel as well so current users can continue to use it, even though we cannot make new calls to it.
  • Creates tf Lite registrar and adds TF Lite tests for mobile ops.
  • Fix nmt_with_attention start_index
  • Export LD_LIBRARY_PATH when configuring for build.
  • Update tf lite test to use the function rather than having to globally share the linked library symbols so the interpreter can find the name since this is only available on linux.
  • Temporarily switch to the definition of REGISTER_TF_OP_SHIM while it updates.
  • Update REGISTER_TF_OP_SHIM macro to remove unnecessary parameter.
  • Remove temporary code and set back to using the op shim macro.
  • Updated import statement
  • Internal change
  • pushed back forward compatibility date for tf_text.WhitespaceTokenizer.
  • Add .gitignore
  • The --keep_going flag will make bazel run all tests instead of stopping
  • Add missing blank line between test and doctest.
  • Adds a regression test for model server for the replaced WST op. This ensures that current models using the old kernel will continue to work.
  • Fix the build by adding a new dependency required by TF to kernel targets.
  • Add sentenepiece detokenize op to stateful allowlist.
  • Fix broken build. This occurred because of a change on TF that updated the compiler infra version (https://github.com/tensorflow/tensorflow/commit/e0940f269a10f409466b6fef4ef531aec81f9afa).
  • Clean up code now that the build horizon has passed.
  • Add pywrap dependency for tflite ops.
  • Update TextVectorization layer
  • Allows overridden get_selectable to be used.
  • Update version

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Aaron Siddhartha Mondal, Dominik Schlösser, Xiaoquan Kong, Yasir Modak

text - v2.6.0

Published by broken about 3 years ago

Release 2.6.0

Bug Fixes and Other Changes

  • Update __init__.py: Added a __version__ variable
  • Fixes the benchmark suite for graph mode. While using tf.function prevented caching, it was also causing the graph being tested to rebuild each time. Using placeholder instead fixes this.
  • Pin nightly version.
  • Remove TF patch as it is not needed anymore. The code is in core TF.
  • Typos
  • Format and lint NBs, add images
  • Add a couple notes to the BertTokenizer docs.
  • Narrative docs migration: TF Core -> TF Text
  • Update nmt_with_attention
  • Moved examples of a few API docs above the args sections to better match other formats.
  • Fix NBs
  • Update Installation from source instruction.
  • Add SplitterWithOffsets as an exported symbol.
  • Fix a note to the BertTokenizer docs.
  • Remove unused index.md
  • Convert tensorflow_text to use public TF if possible.
  • Fix failing notebooks.
  • Create user_ops BUILD file.
  • Remove unnecessary METADATA.
  • Replace tf.compat.v2.xxx with tf.xxx, since tf_text is using tf2 only.
  • Fix load_data function in nmt tutorial
  • Update tf.data.AUTOTUNE in Fine-tuning a BERT model
  • Switch TF to OSS keras (1/N).
  • added subspaces
  • Disable TSAN for tutorial tests that may run for >900sec when TSAN is enabled.
  • Adds a short description to the main landing page of our GitHub repo to point users to the tf.org subsite.
  • Phrasing fix to TF Transformer tutorial.
  • Disable RTTI when building Tf.Text kernels for mobile
  • Migrate the references in third_party/toolchains directory as it is going to be deleted soon.
  • Fix bug in RoundRobinTrimmer. Previously the stopping condition was merging and combining from across different batches. Instead now the stopping condition is first determined in each batch, then aggregated.
  • Set mask_token='' to make it work with TF 2.6.0
  • Builds TF Text with C++14 by default. This is already done by TensorFlow, and the TF Lite shim has C++14 features used within; thus, this is needed to build kernels against it.
  • This is a general clean up to the build files. The previous tf_deps paradigm was confusing. By encapsulating everything into a single call lib, I'm hoping this makes it easier to understand and follow.
  • Update the WORKSPACE to not use the same "workspace" name when initializing TensorFlow.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

8bitmp3, akiprasad, bongbonglemon, Jules Gagnon-Marchand, Stonepia

text - v2.6.0-rc0

Published by broken over 3 years ago

Release 2.6.0-rc0

Bug Fixes and Other Changes

  • Update __init__.py: Added a __version__ variable
  • Fixes the benchmark suite for graph mode. While using tf.function prevented caching, it was also causing the graph being tested to rebuild each time. Using placeholder instead fixes this.
  • Pin nightly version.
  • Remove TF patch as it is not needed anymore. The code is in core TF.
  • Typos
  • Format and lint NBs, add images
  • Add a couple notes to the BertTokenizer docs.
  • Narrative docs migration: TF Core -> TF Text
  • Update nmt_with_attention
  • Moved examples of a few API docs above the args sections to better match other formats.
  • Fix NBs
  • Update Installation from source instruction.
  • Add SplitterWithOffsets as an exported symbol.
  • Fix a note to the BertTokenizer docs.
  • Remove unused index.md
  • Convert tensorflow_text to use public TF if possible.
  • Fix failing notebooks.
  • Create user_ops BUILD file.
  • Remove unnecessary METADATA.
  • Replace tf.compat.v2.xxx with tf.xxx, since tf_text is using tf2 only.
  • Fix load_data function in nmt tutorial
  • Update tf.data.AUTOTUNE in Fine-tuning a BERT model
  • Switch TF to OSS keras (1/N).
  • added subspaces
  • Disable TSAN for tutorial tests that may run for >900sec when TSAN is enabled.
  • Adds a short description to the main landing page of our GitHub repo to point users to the tf.org subsite.
  • Phrasing fix to TF Transformer tutorial.
  • Disable RTTI when building Tf.Text kernels for mobile

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

8bitmp3, akiprasad, bongbonglemon, Jules Gagnon-Marchand, Stonepia

text - v2.5.0

Published by broken over 3 years ago

Release 2.5

We want to particularly point out that guides, tutorials, and API docs are currently being published to http://tensorflow.org/text ! This should make it easier for users to find our documentation. We worked hard on improving docs across the board, so feel free to let us know if further clarification is needed.

Major Features and Improvements

  • API docs, guides, & tutorial are now available on http://tensorflow.org/text
  • New guides & tutorials including: tokenizers, subwords tokenizer, and BERT text preprocessing guide.
  • Add RoundRobinTrimmer
  • Add a function to generate a BERT vocab from a tf.data.Dataset.
  • Add detokenize methods for BertTokenizer and WordpieceTokenizer.
  • Enable NFD and NFKD in NormalizeWithOffset op

Bug Fixes and Other Changes

  • Many API updates (eg. adding descriptions & examples) to various ops.
  • Let SentencePieceTokenizer optionally return the nbest tokenizations instead of sampling from them.
  • Fix a bug in split mode tokenizers that caused tests to fail on Windows.
  • Fix broadcasting bugs in RoundRobinTrimmer
  • Add WordpieceTokenizeWithOffsets with ALLOW_STATEFUL_OP_FOR_DATASET_FUNCTIONS for tf.data
  • Remove PersistentTensor from sentencepiece_kernels.cc
  • Document examples are now tested.
  • Fix benchmarking of graph mode ops through use of tf.function.
  • Set the default for mask_token for StringLookup and IntegerLookup to None
  • Update the sentence_breaking_ops docstring to indicate that it's deprecated.
  • Adding an i18n-friendly BasicTokenizer that can preserve accents
  • For Windows, always include ICU data files since they need to be built in statically.
  • Rename documentation file WordShape.md to WordShape_cls.md. Fix #361.
  • Convert input to tensor to allow for numpy inputs to state based sentence breaker.
  • Add classifiers to py packages and fix header image.
  • Fix for the model server test.
  • Update regression test for break_sentences_with_offsets.
  • Add a shape attribute to the ToDense Keras layer.
  • Add support for [batch, 1] shaped inputs in StateBasedSentenceBreaker
  • Fix for the model server test.
  • Refactor saved_model.py to make it easier to comment out blocks of related code to identify problems.
  • Add regression test for Find Source Offsets
  • Fix unselectable_ids shape check in ItemSelector.
  • Switch out architecture image in tf.Text documentation.
  • Fix regression test for state_based_sentence_breaker_v2
  • Update run_build with enable_runfiles flag.
  • Update the version of bazel_skylib to match TF's and fix a possible visibility issue.
  • Simplify tf-text WORKSPACE, by relying on tf_workspace().
  • Update transformer.ipynb to use a saved text.BertTokenizer
  • Update mobile targets to use :mobile rather than separate :android & :ios targets.
  • Make tools part of the tensorflow_text pip package.
  • Import tools from the tf-text package, instead of cloning the git repo.
  • Minor cleanups to make some code compile on the android build system.
  • Fix pip install command in readme
  • Fix tools pip package inclusion.
  • A tensorfow.org compatible docs generator for tf-text.
  • Sample random tokens correctly during MLM.
  • Treat Sentencepiece ops as stateful in tf.data pipelines.
  • Replacing use of TFT's deprecated dataset_schema.from_feature_spec with its replacement schema_utils.schema_from_feature_spec.
text - 2.5.0-rc0

Published by gregbillock over 3 years ago

Release 2.5.0-rc0

Major Features and Improvements

  • Add a subwords tokenizer tutorial to text/examples.
  • Add a function to generate a BERT vocab from a tf.data.Dataset.
  • Add detokenize methods for BertTokenizer and WordpieceTokenizer.
  • Let SentencePieceTokenizer optionally return the nbest tokenizations instead of sampling from them.
  • Enable NFD and NFKD in NormalizeWithOffset op
  • Adding an i18n-friendly BasicTokenizer that can preserve accents
  • Create guide for tokenizers.

Breaking Changes

Bug Fixes and Other Changes

  • Other:
    • For Windows, always include ICU data files since they need to be built in statically.
    • Patches TF to fix windows builds to not look for a python3 executable.
    • Rename documentation file WordShape.md to WordShape_cls.md. The problem is on MacOS (and maybe Windows) this filename collides with wordshape.md, because the filesystem does not differentiate cases for the files. This is purely a QOL change for anybody checking out the library on a non-Linux platform. Fix #361.
    • Convert input to tensor to allow for numpy inputs to state based sentence breaker.
    • Add classifiers to py packages and fix header image.
    • fix bad rendering for add_eos add_bos description in SentencepieceTokenizer.md
    • Fix for the model server test. Make sure our test tensors have the expected
    • Update regression test for break_sentences_with_offsets.
    • Add a shape attribute to the ToDense Keras layer.
    • Add support for [batch, 1] shaped inputs in StateBasedSentenceBreaker
    • Fix for the model server test. The result of the tokenize() method of
    • Refactor saved_model.py to make it easier to comment out blocks of related code to identify problems. Also moved out the vocab for Wordpiece due to a tf bug.
    • Update documentation for SplitMergeFromLogitsTokenizer
    • Add regression test for Find Source Offsets
    • Fix unselectable_ids shape check in ItemSelector.
    • changing two tests, to debug failure on Kokoro Windows build.
    • Switch out architecture image in tf.Text documentation.
    • Fix regression test for state_based_sentence_breaker_v2
    • Update run_build with enable_runfiles flag.
    • Update the version of bazel_skylib to match TF's and fix a possible visibility issue.
    • Simplify tf-text WORKSPACE, by relying on tf_workspace().
    • Update transformer.ipynb to use a saved text.BertTokenizer
    • typos
    • Update mobile targets to use :mobile rather than separate :android & :ios targets.
    • Make tools part of the tensorflow_text pip package.
    • Import tools from the tf-text package, instead of cloning the git repo.
    • Minor cleanups to make some code compile on the android build system.
    • Fix pip install command in readme
    • Fix tools pip package inclusion.
    • Clear outputs
    • A tensorfow.org compatible docs generator for tf-text.
    • Formatting fixes for tensorflow.org
    • Sample random tokens correctly during MLM.
    • Internal repo change
    • Treat Sentencepiece ops as stateful in tf.data pipelines.
    • Reduce the critical section range. Because the options are
    • Replacing use of TFT's deprecated dataset_schema.from_feature_spec with its replacement schema_utils.schema_from_feature_spec.
    • Updating guide with new template

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Rens, Samuel Marks, thuang513

text - v2.4.3

Published by broken almost 4 years ago

Release 2.4.3

Bug Fixes and Other Changes

  • Fix export as saved model of hub_module_splitter
  • Fix bug in regex_split_with_offsets when input.ragged_rank > 1
  • Convert input to tensor to allow for numpy inputs in state based sentence breaker.
  • Add more classifiers to py packages.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

fsx950223

text - v2.4.2

Published by broken almost 4 years ago

Release 2.4.2

Major Features and Improvements

  • We are now building a nightly package - tensorflow-text-nightly. This is available for Linux immediately, with other platforms to be added soon.

Bug Fixes and Other Changes

  • Fixes a bug which prevented the sentence_fragmenter from being able to process tensors with a rank > 1.
  • Update documentation filenames to prevent collisions when checking out the code on filesystems that do not have case sensitivity.
text - 2.4.1

Published by thuang513 almost 4 years ago

Release 2.4.1

Major Features and Improvements

  • Splitter
    • RegexSplitter
    • StateBasedSentenceBreaker
  • Trimmer
    • WaterfallTrimmer
    • RoundRobinTrimmer
  • ItemSelector
    • RandomItemSelector
    • FirstNItemSelector
  • MaskValuesChooser
  • mask_language_model()
  • combine_segments()
  • pad_model_inputs()
  • Windows support!
  • Released our first TF Hub module for Chinese segmentation! Please visit the hub module page here for more info including instructions on how to use the model.
  • Added Spliter / SplitterWithOffsets abstract base classes. These are meant to replace the current Tokenizer / TokenizerWithOffsets base classes. The Tokenizer base classes will continue to work and will implement these new Splitter base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
  • With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that offset_end is a positional value rather than a length.
  • Added new HubModuleSplitter that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
  • Added new SplitMergeFromLogitsTokenizer which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.
  • Added normalize_utf8_with_offsets and find_source_offsets ops.
  • Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.
  • Added string_to_id to SentencepieceTokenizer.
  • Support Android build.
  • RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

  • Add a minimal count_words function to wordpiece_vocabulary_learner.
  • Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
  • Add dep on tensorflow_hub in pip_package/setup.py
  • Add filegroup BUILD target for test_data segmentation Hub module.
  • Extend documentation for class HubModuleSplitter.
  • Read SP model file in bytes mode in tests.
  • Update intro.ipynb colab.
  • Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.
  • Update StateBasedSentenceBreaker handling of text input tensors.
  • Reduce over-broad dependencies in regex_split library.
  • Fix broken builds.
  • Fix comparison between signed and unsigned int in FindNextFragmentBoundary.
  • Update README regarding versions.
  • Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.
  • Convert non-tensor inputs in pad along dimension op.
  • Add the necessity to install coreutils to the build instructions if building on MacOS.
  • Add filegroup BUILD target for test_data segmentation Hub module.
  • Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.
  • Add Spliter / SplitterWithOffsets abstract base classes.
  • Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.
  • Change variable names for token offsets: "limit" -> "end".
  • Fix presubmit failed for MacOS.
  • Allow dense tensor inputs for RegexSplit.
  • Fix imports in tools/.
  • BertTokenizer: Error out if the user passes a normalization_form that will be ignored.
  • Update documentation for Sentencepiece.tokenize_with_offsets.
  • Let WordpieceTokenizer read vocabulary files.
  • Numerous build improvements / adjustments (mostly to support Windows):
    • Patch out googletest & glog dependencies from Sentencepiece.
    • Switch to using Bazel's internal patching.
    • ICU data is built statically for Windows.
    • Remove reliance on tf_kernel_library.
    • Patch TF to fix problematic Python executable searching.
    • Various other updates to .bazelrc, build_pip_package, and configuration to support Windows.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin

text - v2.4.0-rc1

Published by broken almost 4 years ago

Release 2.4.0-rc1

Major Features and Improvements

  • Windows support!
  • Released our first TF Hub module for Chinese segmentation! Please visit the hub module page here for more info including instructions on how to use the model.
  • Added Spliter / SplitterWithOffsets abstract base classes. These are meant to replace the current Tokenizer / TokenizerWithOffsets base classes. The Tokenizer base classes will continue to work and will implement these new Splitter base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
  • With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that offset_end is a positional value rather than a length.
  • Added new HubModuleSplitter that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
  • Added new SplitMergeFromLogitsTokenizer which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.
  • Added normalize_utf8_with_offsets and find_source_offsets ops.
  • Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.
  • Added string_to_id to SentencepieceTokenizer.
  • Support Android build.
  • RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

  • Add a minimal count_words function to wordpiece_vocabulary_learner.
  • Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
  • Add dep on tensorflow_hub in pip_package/setup.py
  • Add filegroup BUILD target for test_data segmentation Hub module.
  • Extend documentation for class HubModuleSplitter.
  • Read SP model file in bytes mode in tests.
  • Update intro.ipynb colab.
  • Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.
  • Update StateBasedSentenceBreaker handling of text input tensors.
  • Reduce over-broad dependencies in regex_split library.
  • Fix broken builds.
  • Fix comparison between signed and unsigned int in FindNextFragmentBoundary.
  • Update README regarding versions.
  • Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.
  • Convert non-tensor inputs in pad along dimension op.
  • Add the necessity to install coreutils to the build instructions if building on MacOS.
  • Add filegroup BUILD target for test_data segmentation Hub module.
  • Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.
  • Add Spliter / SplitterWithOffsets abstract base classes.
  • Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.
  • Change variable names for token offsets: "limit" -> "end".
  • Fix presubmit failed for MacOS.
  • Allow dense tensor inputs for RegexSplit.
  • Fix imports in tools/.
  • BertTokenizer: Error out if the user passes a normalization_form that will be ignored.
  • Update documentation for Sentencepiece.tokenize_with_offsets.
  • Let WordpieceTokenizer read vocabulary files.
  • Numerous build improvements / adjustments (mostly to support Windows):
    • Patch out googletest & glog dependencies from Sentencepiece.
    • Switch to using Bazel's internal patching.
    • ICU data is built statically for Windows.
    • Remove reliance on tf_kernel_library.
    • Patch TF to fix problematic Python executable searching.
    • Various other updates to .bazelrc, build_pip_package, and configuration to support Windows.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin

text - v2.4.0-rc0

Published by broken almost 4 years ago

Release 2.4.0-rc0

Major Features and Improvements

  • Released our first TF Hub module for Chinese segmentation! Please visit the hub module page here for more info including instructions on how to use the model.
  • Added Spliter / SplitterWithOffsets abstract base classes. These are meant to replace the current Tokenizer / TokenizerWithOffsets base classes. The Tokenizer base classes will continue to work and will implement these new Splitter base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
  • With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that offset_end is a positional value rather than a length.
  • Added new HubModuleSplitter that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
  • Added new SplitMergeFromLogitsTokenizer which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.
  • Added normalize_utf8_with_offsets and find_source_offsets ops.
  • Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.
  • Added string_to_id to SentencepieceTokenizer.
  • Support Android build.
  • Support Windows build (Py3.6 & Py3.7 this release).
  • RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

  • Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
  • Add dep on tensorflow_hub in pip_package/setup.py
  • Add filegroup BUILD target for test_data segmentation Hub module.
  • Extend documentation for class HubModuleSplitter.
  • Read SP model file in bytes mode in tests.
  • Update intro.ipynb colab.
  • Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.
  • Update StateBasedSentenceBreaker handling of text input tensors.
  • Reduce over-broad dependencies in regex_split library.
  • Fix broken builds.
  • Fix comparison between signed and unsigned int in FindNextFragmentBoundary.
  • Update README regarding versions.
  • Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.
  • Convert non-tensor inputs in pad along dimension op.
  • Add the necessity to install coreutils to the build instructions if building on MacOS.
  • Add filegroup BUILD target for test_data segmentation Hub module.
  • Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.
  • Add Spliter / SplitterWithOffsets abstract base classes.
  • Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.
  • Change variable names for token offsets: "limit" -> "end".
  • Fix presubmit failed for MacOS.
  • Allow dense tensor inputs for RegexSplit.
  • Fix imports in tools/.
  • BertTokenizer: Error out if the user passes a normalization_form that will be ignored.
  • Update documentation for Sentencepiece.tokenize_with_offsets.
  • Let WordpieceTokenizer read vocabulary files.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin

text - 2.4.0-b0

Published by broken almost 4 years ago

Release 2.4.0-b0

Please note that this is a pre-release and meant to run with TF v2.3.x. We wanted to give access to some of the features we were adding to 2.4.x, but did not want to wait for the TF release.

Major Features and Improvements

  • Released our first TF Hub module for Chinese segmentation! Please visit the hub module page here for more info including instructions on how to use the model.
  • Added Spliter / SplitterWithOffsets abstract base classes. These are meant to replace the current Tokenizer / TokenizerWithOffsets base classes. The Tokenizer base classes will continue to work and will implement these new Splitter base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
  • With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that offset_end is a positional value rather than a length.
  • Added new HubModuleSplitter that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
  • Added new SplitMergeFromLogitsTokenizer which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.

Bug Fixes and Other Changes

  • Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
  • Add dep on tensorflow_hub in pip_package/setup.py
  • Add filegroup BUILD target for test_data segmentation Hub module.
  • Extend documentation for class HubModuleSplitter.
  • Read SP model file in bytes mode in tests.

Thanks to our Contributors

text - 2.3.0

Published by broken about 4 years ago

Release 2.3.0

Major Features and Improvements

  • Added UnicodeCharacterTokenizer
  • Tokenizers are now tf.Modules and can be saved from within Keras layers.

Bug Fixes and Other Changes

  • Allow wordpiece_tokenizer to output int32 tokens natively.
  • Tracks the Sentencepiece model resource via a TrackableResource.
  • oss-segmenter:
    • fix end-offset error in split_merge_tokenizer_kernel.
  • TensorFlow text python ops wordshape:
    • More comprehensive emoji handling
  • Other:
    • Unref lookup_table in wordpiece_kernel fixing a possible memory leak.
    • Add missing LICENSE file for third_party/tensorflow_text/core/kernels.
    • add normalize kernals test
    • Fix Sentencepiece tests.
    • Add some metric logs to tokenizers.
    • Fix documentation formatting for SplitMergeTokenizer
    • Bug fix: make sure tokenize() method does not ignore itself.
    • Improve logging efficiency.
    • Update tf.text's regression test model for model server. Without the asserts, errors are erroneously swallowed by tensorflow. I also added tf.unicode_script test just to ensure that ICU is working correctly from within model server.
    • Add the ability to define a user-defined destination directory to make testing easier.
    • Fix typo in documentation of BertTokenizer
    • Clarify docstring of UnicodeScriptTokenizer about splitting on space
    • Add executable flag to the run_build.sh script.
    • Clarify docstring of WordpieceTokenizer on unknown_token:
    • Update protobuf library and point HEAD to build on tf 2.3.0-rc0

Thanks to our Contributors

text - 2.3.0-rc1

Published by broken over 4 years ago

Release 2.3.0-rc1

Major Features and Improvements

  • Added UnicodeCharacterTokenizer

Bug Fixes and Other Changes

  • oss-segmenter:
    • fix end-offset error in split_merge_tokenizer_kernel.
  • TensorFlow text python ops wordshape:
    • More comprehensive emoji handling
  • Other:
    • Unref lookup_table in wordpiece_kernel fixing a possible memory leak.
    • Add missing LICENSE file for third_party/tensorflow_text/core/kernels.
    • add normalize kernals test
    • Add some metric logs to tokenizers.
    • Fix documentation formatting for SplitMergeTokenizer
    • Bug fix: make sure tokenize() method does not ignore itself.
    • Improve logging efficiency.
    • Update tf.text's regression test model for model server. Without the asserts, errors are erroneously swallowed by tensorflow. I also added tf.unicode_script test just to ensure that ICU is working correctly from within model server.
    • Add the ability to define a user-defined destination directory to make testing easier.
    • Fix typo in documentation of BertTokenizer
    • Clarify docstring of UnicodeScriptTokenizer about splitting on space
    • Add executable flag to the run_build.sh script.
    • Clarify docstring of WordpieceTokenizer on unknown_token:
    • Update protobuf library and point HEAD to build on tf 2.3.0-rc0

Thanks to our Contributors

Package Rankings
Top 1.01% on Pypi.org
Badges
Extracted from project README
PyPI version PyPI nightly version PyPI Python version Documentation Contributions welcome License