ludwig - v0.3.3: New datasets, dependency fixes

Published by w4nderlust over 3 years ago

Changelog

Additions

Added Irony dataset and AGNews dataset (#1073)

Improvements

Updated hyperopt sampling functions to handle list values (#1082)

Bugfixes

Fix compatibility issues with Transformers 4.2.1 (#1077)
Fixed SST dataset link (#1085)
Fix hyperopt batch sampling (#1086)
Bumped skimage dependency version (#1087)

ludwig - v0.3.2: New datasets, better processing of binary and numerical, minor fixes

Published by w4nderlust almost 4 years ago

Changelog

Additions

Added feature identification logic (#957)
Added Backend interface for abstracting DataFrame preprocessing steps (#1014)
Add support for transforming numeric predictions that were normalized (#1015)
Added Kaggle API integration and Titanic dataset (#1021)
Add Korean translation for the README (#1022)
Added cast_columns function to preprocessing and cast_column function to all feature mixin classes (#1027)
Added custom encoder / decoder registration decorator (#1017)
Add titles to Hyperopt Report visualization (#1026)
-Added cast_columns function to preprocessing and cast_column function to all feature mixin classes (#1027)
Added label-wise probability to binary feature predictions (#1033)
Add support for num_layers in sequence generator decoder (#1050)
Added Flickr8k dataset (#1053)
Add support for transforming numeric predictions that were normalized (#1015)

Improvements

Improved triggering of cache re-creation (now it depends also on changes in feature types)
Improved legend and add tight_layout param to compare predictions plot (#1037)
Improved postprocessing for binary features so prediction vocab matches inputs (#1038)
Bump TensorFlow and tfa-nightly for 2.4.0 release (#1058)
Updated Dockerfiles to TensorFlow 2.4.0 (#1059)

Bugfixes

Fix missing yaml files for datasets in pip package
Fix hdf5 preprocessing error
Fix calculation of the metric score for hyperopt (#1031)
Fix wrong argument in visualize.py from -f to -ofn (#1032)
Fix fill NaN by adding selected conversion of columns to string when computing metadata (#1042)
Fix: inconsistent seq length for probabilities (#1043)
Fix issues with changes in xlrd package (#1056)

ludwig - v0.3.1: Datasets, cache checksum, improvements for text and visualization

Published by w4nderlust almost 4 years ago

Additions

Added dataset module (#949) containing MNIST, SST-2, SST-5, REUTERS, OHSUMED, FEVER and GoEmotions datasets
Add Ludwig Model Serve Example (#947)
Add checksum mechanism for HDF5 and Meta JSON cache file (#1006)-

Improvements

Updated run_experiment to use new skip parameters and returns (#955)
Several improvements to testing (more coverage, with faster tests)
Changed default value of HF encoder trainable parameter to True (for performance reasons) (#996)
Improved and slightly modified visualization functions API-

Bugfixes

Changed not to is None in dataset checks in hyperopt.run.hyperopt() (#956)
Fix LudwigModel.predict() when skip_save_predictions = False (#962)
Fix #963: Convert materialized tensors to numpy arrays up front to avoid repeated conversion ()
Fix errors with DataFrame truth checks in hyperopt (#956)
Added truncation to HF tokenizer (#978)
Reimplemented Jaccard Metric for the Set Feature (#979)
Fix learning rate computation with decay and warmup (#982)
Fix CLI logger typos (#998, #999)
Fix loading of split from hdf5 (#1003)
Fix visualization unit tests (#981)
Fix concatenate_csv to work with arbitrary read functions and renamed concatenate_datasets
Fix compatibility issue with matplotlib 3.3.3
Limit numpy and h5py max versions due to tensorflow 2.3.1 max supported versions (#990)
Fixed usage of model_load_path with Horovod (#1011)

ludwig - v0.3: TensorFlow 2, Hyperparameter optimization, Hugging Face Transformers integration, new data formats and more

Published by w4nderlust about 4 years ago

Improvements

Full porting to TensorFlow 2.
New hyperparameter optimization functionality through the hyperopt command.
Integration with HuggingFace Transformers for pre-trained text encoders.
Refactored preprocessing with new supported data formats: auto, csv, df, dict, excel, feather, fwf, hdf5 (cache file produced during previous training), html (file containing a single HTML <table>), json, jsonl, parquet, pickle (pickled Pandas DataFrame), sas, spss, stata, tsv.
improved validation logic.
New Transformer encoders for sequential data types (sequence, text, audio, timeseries).
new batch_predict functionality in the REST API.
New export command to export to SavedModel and Neuropod.
New collect_summary command to print out a model summary with layers names.
Modified the predict command, and splitt it into predict and evaluate. The first only produces predictions, the second evaluates those predictions against ground truth.
Two new hyperopt-related visualizations: hyperopt_report and hyperopt_hiplot.
Improved tracking of metrics in the TensorBoard.
Greatly improved test suite.
Various documentation improvements.

Bugfixes

This release includes a fundamental rewrite of the internals, so many bugs have been fixed while rewiting.
This list includes only the ones that have a specific Issue associated with them, but many others where addressed.

Fix #649: Replaced SPLIT with 'split' in example code.
Fix documentation, wrong parameter name (#684)
Fix #702: Fixed setting defaults in binary output feature.
Fix #729: Reduce output was not passed to the sequence encoder inside the sequence combiner.
Fix #742: Renamed self._learning_rate in Progresstracker.
Fix #799: Added tf_version to description.json.
Fix #840: Better messaging for plateau logic.
Fix #850: Switch from ValueError to Warning to make stratify work on non-output features.
Fix ##844: Load LudwigModel in test_savedmodel before creating saved model.
Fix #833: loads the model after training and before predicting if the model was saved on disk.
Fix #933: Added NumpyDecoder before returning JSON response from server.
Fix #935: Multiple categorical features with different vocabs now work.

Breaking changes

Because of the change in the underlying tensor computation library (TensorFlow 1 to TensorFlow 2) and the internal reworking it required, models trained with v0.2 don't work on v0.3.
We suggest to retrain such models, in most cases the same model definition can be used, although one impactuful breaking change is that model_definition are now called config, because they don't contain only information about the model, but also training, preprocessing, and a newly added hyperopt section.

There have been some changes in the parameters inside the config too.
In particular, one main change is dropout that now it is a float value that can be specified for each encode / combiner / decoder / layer, while before it was a boolean parameter.
As a consequence, the dropout_rate parameter in the training section has been removed.

Another change in training parameters are the available optimizers.
TensorFlow 2 doesn't ship with some of the ones that were exposed in Ludwig (adagradda, proximalgd, proximaladagrad) and the momentum optimizer has been removed as now it is a parameter of the sgd optimizer.
Newly added optimizers are nadam and adamax.
Note that the accuracy metric for the combined feature has been removed because it was misleading in some scenarios when multiple features of different types where trained.

In most cases, encoders, combiners and decoders now have an increased number of exposed parameters to play with for increased flexibility.
One notable change is that the previous BERT encoder has been replaced by an HuggingFace based one with different parameters, and it is now available only for text features.
Please refer to the User Guide for details for each encoder.

Tokenizers also changed substantially with new parameters supported, refer to User Guide for more details.

Other major changes are related to the CLI interface.
The predict command has been replaced in functionality with a simplified predict and a new evaluate. The first only produces predictions, the second evaluates those predictions against ground truth.
Some parameters of all CLI commands changed.
All different type of data_* parameters have been replaced by generic dataset, training_set, validation_set and test_set parameters, while the data format is automatically determined, but can also be set manually by using the data_format argument. There is no gpu_fractionany more, but now users can specifygpu_limit` for managing the VRAM usage.
For all additional minor changes to the CLI please refer to the User Guide.

The programmatic API changed too, as a consequence.
Now all the parameters match closely the ones of the CLI interface, including the new dataset and gpu parameters.
Also in this case the predict function has been split into predict and evaluate.
Finally, the returned values of most functions changed to include some intermediate processing values, like for instance the preprocessed and split data when calling train, the output experiment directory and so on.
Notably, now there is an experiment function in the API too, together with a new hyperopt one.
For more datails, refer to the API reference.

Contriburotrs

@jimthompson5802 @tgaddair @kaushikb11 @ANarayan @calio @dme65 @ydudin3 @carlogrisetti @ifokeev @flozi00 @soovam123 @KushalP1 @JiByungKyu @stremlau @adiov @martinremy @dsblank @jakobt @vkuzmin-uber @mbzhu1 @moritzebeling @lnxpy

ludwig - v0.2.2: WandB, K-Fold cross validation, better tracking of measures, and many bugfixes.

Published by w4nderlust over 4 years ago

Improvements

Added integration with Weights and Biases.
Added K-Fold cross validation.
Added 4 examples with their respective code and Jupyter Notebooks: Hyper-parameter optimization, K-Fold Cross Validation, MNIST, Titanic.
Greatly improved the measures tracked on the TensorBoard.
Added auto-detect function for field separator when reading CSVs.
Added CI tooling.
Class weights can be specified as a dictionary #615.
Removed deprecation warning from h5py.
Removed most deprecation warning from TensorFlow.
Bypass multiprocessing.Pool.map for faster execution.
Updated TensorFlow dependency to 1.15.2.
Various documentation improvements.

Bugfixes

Fix cudnn error on RTX GPUs.
Fix inverted confusion_matrix axis.
Fix #201: Removed whitespace as a separator option.
Fix #540: Fixed default text parameters for sampled loss.
Fix #541: Docker image improvements (removed libgmp and spacy model download).
Fix #554: Fix audio input test case in docker container.
Fix #570: Temporary dolution for in_memory flag usage in API.
Fix #574: Setting intra and inter op parallelism to 0 so that TF determine them automatically.
Fix #329 and #575: Fixed use of SavedModel and added an integration test.
Fix #609: When predicting, if a split is in the CSV, data is split correctly.
Fix #616: Change preprocessing in siamese network example.
Fix #620: Failure in unit tests for 1 vs all calibration plots.
Fix #632: Setting minimum version requirements for six.
Fix #636: CLI output table column ordering preserved when resuming.
Fix #641: Added multi-task learning section specifying the weight for each output feature in the User Guide.
Fix #642: Fixing horovod use when loading a model as initialization.

Contriburotrs

@jimthompson5802 @calz1 @pingsutw @vanpelt @carlogrisetti @anttisaukko @dsblank @borisdayma @flozi00 @jshah02

ludwig - v0.2.1: Vector features, Norwegian and Lithuanian tokenizers, many bugfixes.

Published by w4nderlust about 5 years ago

Improvements

Add Filter Bank features for audio.
Added two more parameters skip_save_test_predictions and skip_save_test_statistics to train and experiment CLI commands and API.
Updated to spaCy 2.2 with support for Norvegian and Lithuanian tokenizers.
Reorganized dependencies, now the defaults are barebone and there are several axtra ones.
Added fc_layers to H3 embed encoder.
Added get_preprocessing_params in preprocessing.
Refactored image features preprocessing to use multiprocessing.
Refactored preprocessing with strategy pattern.

Bugfixes

Fix #452: Removed dependency on gmpy.
Fix #465: Adds capability to set the vocabulary from a Glove file.
Fix #480: Adds a health check to ludwig serve.
Fix #481: Added some examples of visualization commands.
Fix #491: Improved skip parameters, now no directories are created if not needed.
Fix #492: Adds skip saving unprocessed output api.py.
Fix #493: Added parameters for the vocabulary file and the UNK and PAD symbols in sequence feature call to create_vocabulary in the calculation of metadata.
Fix #500: Fixed learning_curves() when the training statistics file does not contain validation.
Fix #509: Fixes in_memory issues in image features.
Fix #525: Adding check is_on_master() before creating save_path dir./ectory
Fix #510: Fixed version of pydantic.
Fix #532: Improved speed of add_sequence_feature_column().

Potentially breaking changes

Fix #520: Renamed field parameter in visualization to output_feature_name for clarity and improved documentation. Please make sure to rename you function calls if you were using this parameter by name (the order keeps the same).

Contributors

@sriki18 @carlogrisetti @areeves87 @naresh-bhandari @revolunet @patrickvonplaten @Athanaziz @dsblank @tgaddair @Mechachleopteryx @AlexeyGy @yu-iskw

ludwig - v0.2: BERT, Audio / Speech, geospacial and temporal features, Visualization API, Server and improved Comet.ml integration

Published by w4nderlust over 5 years ago

Improvements

New BERT encoder and with its BPE tokenizer
Added Audio features that can be used also for speech data (with appropriate preprocessing feature extraction)
Added H3 feature, together with 3 encoders to deal with spatial information
Added Date feature and two encoders to deal with temporal information
Improved Comet.ml integration
Refactored visualization.py to make individual functions usable from API
Added capability of saving visualization graph in the visualization command and visualizations_utils.py
Added a serve command that allows for spawning a prediction server using FastAPI
Added a test command (that requires output columns in the data) to avoid confusion with predict (which does not require output columns)
Added pixel normalization and pixel standardization scaling options for image features
Added greyscaling of images if specified channels = 1 and img channels is 3 or 4
Added normalization strategies for numerical features (#367)
Added experiment name parameter in the API (#357)
Refactored text tokenizers
Several improvements in logging
Added a method for saving models with SavedModels in model.py and exposes it in the API with a save_for_serving() function (#329)(#425)
Upgraded to the latest version of TensorFlow 1.14 (#429)
Added learning rate warmup for non distributed settings

Bugfixes

Fix #321: Removed the 6n+2 check for ResNet size
Fix #328: adds missing UPDATE_OPS to the optimization operation
Fix #336: GloVe embeddings loading now reads utf-8 encoded files
Fix #336: Addresses the malformed lines issue in embeddings loading
Fix #346: added a parameter indicating if the session should be closed after training in full_train
Fix #351: values in categorical columns are now stripped before being compared to the vocabulary
Fix #364: associate the right function to non english text format functions
Fix #372: set evaluate performance parameter to false in predict.py
Fix #394: Improved error explaination when image dimensions don't match and improved documentation accordingly
Fix #411: Images in HDF5 are now correctly saved as uint8 instead of int8
Fix #431: missing libgmp3-dev dependency in docker (#428)
Fix fixed image resizing
Fix model load path (#424)
Fix batch norm in convolutional layers (now uses tf internal layer and not the one in contrib)
Several additional minor fixes

Contributors

@carlogrisetti @jaipradeesh @glongh @dsblank @danicattaneob @gogasca @lordeddard @IgorWilbert @patrickvonplaten @ojus1 @jimthompson5802 @johnwahba @revolunet @gogasca

ludwig - v0.1.2: Import speed improvements, safety-related fixes and various minor fixes and improvements

Published by w4nderlust over 5 years ago

Improvements

Improved import speed by ~50%
Improved Comet.ml integration
Replaced only_predict with evaluate_performance (and flipped the logic) in all predict commands and functions
Refactored preprocessing functions for improved testability, understanbility and extensibility
Added data_dict to the train method in LudwigModel
Improved tests speed

Bugfixes

Fix issue #283: word_format in text features is now properly used
Fix issue #286: avoid using signal when not on main thread
Fix issue where the order of operations when preprocessing images between resizing and changing channels was inverted
Fix safety issues: now using yaml.safe_load instead of yaml.load and replaced pickling of the progress tracker with a JSON equivalent
Fix minor bug with missing tied_weights key in some features
Fixed a few minor issues discovered with deepsource.io

Other Changes

If before LudwigModel would be imported from ludwig now it should be imported from ludwig.api. This change was needed for speeding up imports

Contributors

@dsblank @Ignisor @bertyhell @jaipradeesh

ludwig - v0.1.1: Bug fixes, new parameters and Comet.ml integration

Published by w4nderlust over 5 years ago

New features and improvements

Updated to tensorflow 1.13.1 and spacy 2.1 (this also makes Ludwig compatible with Python 3.7)
Added an initial integration with Comet.ml
Added support for text preprocessing of additional languages: Italian, Spanish, German, French, Portuguese, Dutch, Greek and Multi-language (Fature Request #251).
Added skip_save_progress, skip_save_model and skip_save_log parameters
Improved the default parameters of the image feature (this may make previously trained models including image features not compatible. If that is the case retrain your model)
Added PassthroughEncoder
Added eval_batch_size parameter
Added sanity checks for model definitions, with improved error messages
Add Dockerfile for running Ludwig on a CPU
Added clip parameter to numerical output features
Added a full MNIST training example, a fraud detection example and a more complex regression example on fuel consumption

Bug fixes

Fix issue #56: removing just keys that exist in dataset when when replacing text feature names concatenating their level
Fix issue #46 #144: Solved Mac OS X mpl.use('TkAgg') use
Fix issue #74: Call subprocess within try except
Fix issue #81: Opens a file before calling yaml.load()
Fix issue #90: Forcing csv writer to write utf-8 encoded files
Fix issue #120: Missing sgd (and synonyms) key in optimizers default
Fix issue #64: Fix for files with capitalized extensions
Fix issue #121: Typo bucketin_field to bucketing_field
Fix training when validation or test cvs are provided separately
Fix issue #112: dataframe_df may not have a csv attribute
Fix missing checks if dataset is None in preprocessing.py and api.py
Fix error measure aggregation and default value
Fix image interpolation
Fix preprocessing_defaults error in bag_feature.py
Fix text output features populate_defaults() and update_model_definition_with_metadata()
Fix in timeseires placeholder datatype
Moved image preprocessing params to preprocessing section (this may make previously trained models including image features not compatible. If that is the case retrain your model)
Fix warmup learning rate function for distributed training
Fix issue #214: replace_text_feature_level usage in api.py
Fix issue #214: replaced SPACE_PUNCTUATION_REGEX
Fix issue #229 #100: solved missing hdf5 / csv file reference
Fix issue #222: incorrect logging in read_csv
Fix issue #194: Renaming class_distance to class_similarities and several bugfixes regarding class_similarities, class_weights and their interaction at model building time
Fix issue #100 #225: solves image prediction issues
Fix issue #98: solves dealing with images with different numbers of channels, including transparencies
Fix unwanted creation of hdf5 files when running ludwig.predict on images
And few more minor fixes

Contributors

Thanks to all our amazing contributors (some of your PRs were not merged, but we used some of their code in our commits, so thank you anyway!):
@dsblank @MariusDanner @BenMacKenzie @Barathwaja @gabefair @kevinqz @yantsey @jontonsoup4 @Praneet460 @DakshMiglani @syeef @Tejaf @rolisz @JakeConnors376W @AndyZZH @us @0xflotus @laserbeam3 @krychu @dettmering @bbrodsky @c-m-hunt @C0deFxxker @hemchander23 @Shivam-Beeyani @yashrajbharti @rbramwell @emushtaq @EBazarov @graytowne @jovilius @ivanhe @philippgille @floscha

ludwig - v0.1.0: First release

Published by w4nderlust over 5 years ago

This is the first public release of Ludwig