deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

BSD-3-CLAUSE License

Stars
3.2K
Committers
31

Bot releases are hidden (Show)

deepvariant - DeepVariant 1.6.1 Latest Release

Published by kishwarshafin 7 months ago

In this release:

deepvariant - DeepVariant 1.6.0

Published by kishwarshafin 12 months ago

  • Improved support for haploid regions, chrX and chY. Users can specify haploid regions with a flag. Updated case studies show usage and metrics.
  • Added pangenome workflow (FASTQ-to-VCF mapping with VG and DeepVariant calling). Case study demonstrates improved accuracy
  • Substantial improvements to DeepTrio de novo accuracy by specifically training DeepTrio for this use case (for chr20 at 30x HG002-HG003-HG004, false negatives reduced from 8 to 0 with DeepTrio v1.4, false positives reduced from 5 to 0).
  • We have added multi-processing ability in postprocess_variants which reduces 48 minutes to 30 minutes for Illumina WGS and 56 minutes to 33 minutes with PacBio.
  • We have added new models trained with Complete genomics data, and added case studies.
  • We have added NovaSeqX to the training data for the WGS model.
  • We have migrated our training and inference platform from Slim to Keras.
  • Force calling with approximate phasing is now available.

We are sincerely grateful to

  • @wkwan and @paulinesho for the contribution to helping in Keras move.
  • @lucasbrambrink for enabling multiprocessing in postprocess_variants.
  • @msamman, @akiraly1 for their contributions.
  • PacBio: William Rowell (@williamrowell), Nathaniel Echols for their feedback and testing.
  • UCSC: Benedict Paten(@benedictpaten), Shloka Negi (@shlokanegi), Jimin Park (@jimin001), Mobin Asri (@mobinasri) for the feedback.
deepvariant - DeepVariant 1.5.0

Published by pichuan over 1 year ago

  • New model datatype: --model_type ONT_R104 is a new option. Starting from v1.5, DeepVariant natively supports ONT R10.4 simplex and duplex data.
  • Incorporated PacBio Revio training data in DeepVariant PacBio model. In our evaluations this single model performs well on both Sequel II and Revio datatypes. Please use DeepVariant v1.5 and later for Revio data.
  • Incorporated Element Biosciences data in WGS models. We found that we could jointly train a short-read WGS model with both Illumina and Element data. Inclusion of Element data improves accuracy on Element without negative effect on Illumina. Please use the WGS model for best results on either Illumina or Element data.
  • Added vg/Giraffe-mapped BAMs to DeepVariant WGS training data (alongside existing BWA). We observed that a single model can be trained for strong results with both BWA and vg/Giraffe.
  • Improved DeepVariant WES model for 100bps exome sequencing thanks to user-reported issues (including https://github.com/google/deepvariant/issues/586 and https://github.com/google/deepvariant/issues/592).
  • Thanks to Tong Zhu from Nvidia for his suggestion to improve the logic for shuffling reads.
  • Thanks to Doron Shem-Tov (@doron-st) and Ilya Soifer (@ilyasoifer) from Ultima Genomics for adding new functionalities enabled by flags --enable_joint_realignment and --p_error.
  • Thanks to Dennis Yelizarov for improving Google-internal infrastructure for running make_examples.
  • Updated TensorFlow version to 2.11.0. Updated htslib version to 1.13.
deepvariant - DeepVariant 1.4.0

Published by pichuan over 2 years ago

  • Simplified DeepVariant PacBio by introducing direct phasing. This means PacBio users who run DeepVariant no longer need to run DeepVariant+WhatsHap+DeepVariant. See PacBio case study for more information.
  • For Illumina WGS and WES, we add an additional feature of read insert size (insert_size) . This reduces errors by 4-10% for Illumina WGS and WES model. Thanks @lucasbrambrink for implementing this feature.
  • Reduced the runtime of the postprocess_variants step by 10-30%. Thanks @moshewagner for optimizing the code.
  • Included experimental code which explores use of Keras for model architecture. This is not used in production methods, but may be informative to developers seeking examples of Keras applied to similar problems. Thanks @wkwan and @paulinesho for their contributions.
  • We did not include OpenVINO by default in the Docker images we released. Users can still build their own Docker images with the option turned on as needed.
  • Updated 2022-10-17: We have released an Illumina RNA-seq model and added an RNA-seq case study.
deepvariant - DeepVariant 1.3.0

Published by pichuan almost 3 years ago

  • Improved the DeepTrio PacBio models on PacBio Sequel II Chemistry v2.2 by including this data in the training dataset.
  • Improved call_variants speed for PacBio models (both DeepVariant and DeepTrio) by reducing the default window width from 221 to 199, without tradeoff on accuracy. Thanks to @lucasbrambrink for conducting the experiments to find a better window width for PacBio.
  • Introduced a new flag --normalize_reads in make_examples, which normalizes Indel candidates at the reads level.This flag is useful to reduce rare cases where an indel variant is not left-normalized. This feature is mainly relevant to joint calling of large cohorts for joint calling, or cases where read mappings have been surjected from one reference to another. It is currently set to False by default. To enable it, add --normalize_reads=true directly to the make_examples binary. If you’re using the run_deepvariant one-step approach, add --make_examples_extra_args="normalize_reads=true". Currently we don’t recommend turning this flag on for long reads due to potential runtime increase.
  • Added an --aux_fields_to_keep flag to the make_examples step, and set the default to only the auxiliary fields that DeepVariant currently uses. This reduces memory use for input BAM files that have large auxiliary fields that aren’t used in variant calling. Thanks to @williamrowell and @rhallPB for reporting this issue.
  • Reduced the frequency of logging in make_examples as well as call_variants to address the issue reported in https://github.com/google/deepvariant/issues/491.
deepvariant - DeepVariant 1.2.0

Published by pichuan about 3 years ago

The DeepVariant v1.2 release contains the following major improvements:

  • A major code refactor for make_examples better modularizes common components between DeepVariant, DeepTrio, and potential future applications. This enables DeepTrio to inherit improvements such as --add_hp_channel (introduced to the DeepVariant PacBio model in v1.1; see blog), improving DeepTrio’s PacBio accuracy.
  • The DeepVariant PacBio model has substantially improved accuracy for PacBio Sequel II Chemistry v2.2, achieved by including this data in the training dataset.
  • We updated several dependencies: Python version to 3.8, TensorFlow version to 2.5.0, and GPU support version to CUDA 11.3 and cuDNN 8.2. The greater computational efficiency of these dependencies results in improvements to speed.
  • In the "training" model for make_examples, we committed (https://github.com/google/deepvariant/commit/4a11046de0ad86e36d2514af9f035c9cb34414bf) that fixed an issue introduced in an earlier commit (https://github.com/google/deepvariant/commit/a4a654769f1454ea487ebf0a32d45a9f8779617b) where make_examples might generate fewer REF (class0) examples than expected.
  • Improvements to accuracy for Illumina WGS models for various, shorter read lengths. Thanks to the following contributors and their teams for the idea:
    • Dr. Masaru Koido (The University of Tokyo and RIKEN)
    • Dr. Yoichiro Kamatani (The University of Tokyo and RIKEN)
    • Mr. Kohei Tomizuka (RIKEN)
    • Dr. Chikashi Terao (RIKEN)

Additional detail for improvements in DeepVariant v1.2:

Improvements for training:

  • We augmented the training data for Illumina WGS model by adding BAMs with trimmed reads (125bps and 100bps) to improve our model’s robustness on different read lengths.

Improvements for make_examples:
For more details on flags, run /opt/deepvariant/bin/make_examples --help for more details.

  • Major refactoring to ensure useful features (such as --add_hp_channel) can be shared between DeepVariant and DeepTrio make_examples.
  • Add MED_DP (median of DP) in the gVCF output. See this section for more details.
  • New --split_skip_reads flag: if True, make_examples will split reads with large SKIP cigar operations into individual reads. Resulting read parts that are less than 15 bp are filtered out.
  • We now sort the realigned BAM output mentioned in this section when you use --emit_realigned_reads=true --realigner_diagnostics=/output/realigned_reads for make_examples. You will still need to run samtools index to get the index file, but no longer need to sort the BAM.
  • Added an experimental prototype for multi-sample make_examples.
    • This is an experimental prototype for working with multiple samples in DeepVariant, a proof of concept enabled by the refactoring to join together DeepVariant and DeepTrio, generalizing the functionality of make_examples to work with multiple samples. Usage information is in multisample_make_examples.py, but note that this is experimental.
  • Improved logic for read allele counts calculation for sites with low base quality indels, which resulted in Indel accuracy improvement for PacBio models.
  • Improvements to the realigner code to fix certain uncommon edge cases.

Improvements for the one-step run_deepvariant:
For more details on flags, run /opt/deepvariant/bin/run_deepvariant --help for more details.

  • New --runtime_report which enables runtime report output to --logging_dir. This makes it easier for users to get the runtime by region report for make_examples.
  • New --dry_run flag is now added for printing out all commands to be executed, without running them. This is mentioned in the Quick Start section.
deepvariant - DeepVariant 1.1.0

Published by akolesnikov almost 4 years ago

The v1.1 release introduces DeepTrio, which uses a model specifically trained to call a mother-father-child trio or parent-child duo. DeepTrio has superior accuracy compared to DeepVariant. Pre-trained models are available for Illumina WGS, Illumina exome, and PacBio HiFi.

In addition, DeepVariant v1.1 contains the following improvements:

  • Accuracy improvements on PacBio, reducing Indel errors by ~21% on the case study. This is achieved by adding an input channel which specifically encodes haplotype information, as opposed to only sorting by haplotype in v1.0. The flag is --add_hp_channel which is enabled by default for PacBio.
  • Speed improvements for long read data by more efficient handling of long CIGAR strings.
  • New functionality to add detailed logs for runtime of make_examples by genomic region, viewable in an interactive visualization.
  • We now fully withhold HG003 from all training, and report all accuracy evaluations on HG003. We continue to withhold chromosome20 from training in all samples.

New optional flags to increase speed:

A team at Intel has adapted DeepVariant to use the OpenVINO toolkit, which further accelerates
TensorFlow applications. This further speeds up the call_variants stage by ~25% for any model when run in CPU mode on an Intel machine. DeepVariant runs of OpenVINO have the same accuracy and are nearly identical to runs without. Runs with OpenVINO are fully reproducible on OpenVINO.

To use OpenVINO, add the following flag too the DeepVariant command:

--call_variants_extra_args "use_openvino=true"

We thank Intel for their contribution, and acknowledge the extensive work their team put in, captured in (https://github.com/google/deepvariant/pull/363)

deepvariant - DeepVariant 1.0.0

Published by pichuan about 4 years ago

DeepVariant v1.0 releases new features and accuracy improvements sufficiently substantial to indicate a major version of v1.0. Compared to DeepVariant v0.10, these changes reduce Illumina WGS errors by 24%, exome errors by 19%, and PacBio errors by 52%.

deepvariant - DeepVariant 0.10.0

Published by pichuan over 4 years ago

  • Update to Python3 and TensorFlow2: We use Python3.6, and pin to TensorFlow 2.0.0.
  • Improved PacBio model for amplified libraries: the PacBio HiFi training data now includes amplified libraries at both standard and high coverages. This provides a substantial accuracy boost to variant detection from amplified HiFi data.
  • Turned off ws_use_window_selector_model by default: This flag was turned on by default in v0.7.0. After the discussion in issue #272, we decided to turn this off to improve consistency and accuracy, at the trade-off of a 7% increase in runtime of the make_examples step.
    Users may add --make_examples_extra_args "ws_use_window_selector_model=true" to save some runtime at the expense of accuracy.
deepvariant - DeepVariant 0.9.0

Published by sgoe1 almost 5 years ago

  • In the v0.9.0 release, we introduce best practices for merging DeepVariant samples.
  • Added visualizations of variant output for visual QC and inspection.
  • Improved Indel accuracy for WGS and WES (error reduction of 36% on the WGS case study) by reducing Indel candidate generation threshold to 0.06.
  • Improved WES model accuracy by expanding training regions with a 100bp buffer around capture regions and additional training at lower exome coverages.
  • Improved performance for new PacBio Sequel II chemistry and CCS v4 algorithm by training on additional data.

Full release notes:

New documentation:

Changes to Docker images, code, and models:

  • Docker images now live in Docker Hub google/deepvariant in addition to gcr.io/deepvariant-docker/deepvariant.
  • For WES, added 100bps buffer to the capture regions when creating training examples.
  • For WES, increased training examples with lower coverage exomes, down to 30x.
  • For PACBIO, added training data for Sequel II v2 chemistry and samples processed with CCS v4 algorithm.
  • Loosened the restriction that the BAM files need to have exactly one sample_name. Now if there are multiple samples in the header, use the first one. If there was none, use a default.
  • Changes in realigner code. Realigner aligns reads to haplotypes first and then realigns them to the reference. With this change some of the haplotypes (with not enough read support) are now discarded. This results in fewer reads needing to be realigned. Theoretically, this fix should improve FP rate. It also helps to resolve a GitHub issue.

Changes to flags:

  • Added --sample_name flag to run_deepvariant.py.
  • Reduced default for vsc_min_fraction_indels to 0.06 for Illumina data (WGS and WES mode) which increases sensitivity.
  • Expanded the use of --reads to take multiple BAMs in a comma-separated list.
  • Use --ref for CRAM by default. (Set --use_ref_for_cram to true by default)
  • Added support for BAM output for realigner debugging. See --realigner_diagnostics and --emit_realigned_reads flags in realigner.py.
deepvariant - DeepVariant 0.8.0

Published by gunjanbaid over 5 years ago

With the v0.8.0 release, we introduce a new DeepVariant model for PacBio CCS data. This model can be run in the same manner as the Illumina WGS and WES models. For more details, see our manuscript with PacBio and our blog post.

This release also includes general improvements to DeepVariant and the Illumina WGS and WES models. These include:

  • New script that lets the users run DeepVariant in one command. See Quick Start.
  • Improved accuracy for NovaSeq samples, especially PCR-Free ones, achieved by adding NovaSeq samples to the training data. See DeepVariant training data.
  • Improved accuracy for low coverage (30x and below), achieved by training on a broader mix of downsampled data. See DeepVariant training data.
  • Overall speed improvements which reduce runtime by ~24% on WGS case study:
    • Speed improvements in querying SAM files and doing calculations with Reads and Ranges.
    • Fewer unnecessary copies when constructing DeBrujin graphs.
    • Less memory usage when writing BED, FASTQ, GFF, SAM, and VCF files.
    • Speed improvements in postprocess_variants when creating gVCFs - achieved by combining writing and merging for both VCF and gVCF.
  • Improved support for CRAM files, allowing the use of a provided reference file instead of the embedded reference. See the use_ref_for_cram flag below.

New optional flags:

  • make_examples.py
    • use_ref_for_cram:
      Default is False (using the embedded reference in the CRAM file). If set to True, --ref will be used as the reference instead. See CRAM support section for more details.
    • parse_sam_aux_fields and use_original_quality_scores:
      Option to read base quality scores from OQ tag. To use this option, set both flags to true.
      Standard GATK process includes a score re-calibration stage where base quality scores are re-calibrated using special software. DeepVariant produces a slightly better accuracy when original scores are used. Usually original scores are stored in a BAM file under OQ optional tag. This feature will allow to read quality scores from OQ tag instead of QUAL field.
    • min_base_quality:
      Allowed users to try different thresholds for minimum base quality score.
    • min_mapping_quality:
      Allowed users to try different thresholds for minimum mapping quality score.
  • call_variants.py
    • config_string:
      Allowed users to specify estimator session configuration through a flag when running on CPU and GPU, thanks to the contribution of @A-Tsai from ATGENOMIX in #159.
    • num_mappers:
      Allowed users to modify the number of dataset mappers through a flag, thanks to the contribution of @fo40225 from National Taiwan University Hospital in #152.
deepvariant - DeepVariant 0.7.2

Published by akolesnikov almost 6 years ago

  • Htslib updated to v1.9, fixing an outstanding CRAM issue.
  • Fix for the issue of non-deterministic output caused by changing number of shards in the make_example process.
  • Upgrade to TensorFlow v1.12.
  • Speed improvements in make_examples via the use of a flat_hash_map.
  • Speed improvements in call_variants.
  • The genotypes of low-quality (GQ < 20) homozygous reference calls are set to ./. instead of 0/0. The threshold is configurable via --cnn_homref_call_min_gq flag in postprocess_variants.py. This improves downstream cohort merging performance based on our internal investigation in a "Improved non-human variant calling using species-specific DeepVariant models" blog.
  • Google Cloud Runner:
    • Localize BED region files (given via --region flag), fixing an outstanding issue.
    • Make worker logs available in case of a failure inside DeepVariant.
deepvariant - DeepVariant 0.7.1

Published by pichuan almost 6 years ago

  • Fix for postprocess_variants - the previous version crashes if the first shard contains no records.
  • Update the TensorFlow version dependency to 1.11.
  • Added support to build on Ubuntu 18.04.
  • Documentation changes: Move the commands in WGS and WES Case Studies into scripts under scripts/ to make it easy to run.
  • Google Cloud runner:
    • Added batch_size in case the users need to change it for the call_variants step.
    • Added logging_interval_sec to control how often worker logs are written into Google Cloud Storage.
    • Improved the use of call_variants: only one call_variants is run on each machine for better performance. This improved the GPU cost and speed.
deepvariant - DeepVariant 0.7.0

Published by pichuan about 6 years ago

This release includes numerous performance improvements that collectively reduce the runtime of DeepVariant by about 65%.

A few highlighted changes in this release:

  • Update TensorFlow version to 1.9 built by default with Intel MKL support, speeding up call_variants runtime by more than 3x compared to v0.6.
  • The components that use TensorFlow (both inference and training) can now be run on Cloud TPUs.
  • Extensive optimizations in make_examples which result in significant runtime improvements. For example, make_examples now runs more than 3 times faster in the WGS case study than v0.6.
    • New realigner implementation (fast_pass_aligner.cc) with parameters re-tuned using Vizier for better accuracy and performance.
    • Changed window selector to use a linear decision model for choosing realignment candidates. This can be controlled by a flag. -ws_use_window_selector_model which is now on by default.
    • Many micro-optimizations throughout the codebase.
  • Added a new training case study showing how to train and fine-tune DeepVariant models.
  • Added support for CRAM files
deepvariant - DeepVariant 0.6.1

Published by pichuan over 6 years ago

  • Update the build scripts and header files so that it builds successfully on Debian.
  • Include a script that demonstrates how to build the CLIF binary we released.
  • Update GCP runner's default #cores.
  • Small code fix: Fix the call_variants issue of crashing on empty shards.
deepvariant - DeepVariant 0.6.0

Published by pichuan over 6 years ago

This release has a new WGS model that has major accuracy improvement on PCR+ data. We also released a new WES model that has some minor accuracy improvement.

A few important changes in this release:

  1. Changes in the training data for the WGS model:
    • Addition:
      • 3 replicates of HG001 (PCR+, HiSeqX) provided by DNAnexus
      • 2 replicates of HG001 (PCR+, NovaSeq) from BaseSpace public data.
    • Removal:
      • WES data
        (In v0.5.0, we trained our WGS model with WGS+WES data. This time we found that it didn’t help with WGS accuracy, so we removed them)
  2. Improved training data labels. See haplotype_labeler.py
  3. For direct inputs/outputs from cloud storage, we no longer support direct file I/O (like gs://deepvariant) due to bugs in htslib. Instead we recommend using gcsfuse to read/write data directly on GCS buckets. See “Inputs and Outputs” in DeepVariant user guide.
deepvariant - DeepVariant 0.5.2

Published by cmclean over 6 years ago

This release is a bugfix release for gVCF creation. See https://github.com/google/deepvariant/issues/58 for details.

deepvariant - DeepVariant v0.5.1

Published by cmclean over 6 years ago

This release fixes issue #27 and adds support for creating the MIN_DP field in gVCF records.

deepvariant - DeepVariant 0.5.0

Published by pichuan over 6 years ago

  1. Release two separate models for calling genome and exome sequencing data. Significant improvement of Indel F1 on exome data.

    • On exome sequencing data (HG002):
      • Indel F1 0.936959 --> 0.961724; SNP F1 0.998636 --> 0.998962
    • On whole genome sequencing data (HG002):
      • Indel F1 0.996632 --> 0.996684; SNP F1 0.999495 --> 0.999542
  2. Provide capability to produce gVCF files as output from DeepVariant [doc]:
    gVCF files are required as input for analyses that create a set of variants in a cohort of individuals, such as cohort merging or joint genotyping.

  3. Training data:
    All models are trained with a benchmarking-compatible strategy: That is, we never train on any data from the HG002 sample, or from chromosome 20 from any sample.

    • Whole genome sequencing model:
      We used training data from both genome sequencing data as well as exome sequencing data.

      • WGS data:
        • HG001: 1 from PrecisionFDA, and 8 replicates from Verily.
        • HG005: 2 from Verily.
      • WES data:
        • HG001: 11 HiSeq2500, 17 HiSeq4000, 50 NovaSeq.
        • HG005: 1 from Oslo University.

      In order to increase diversity of training data, we also used the downsample_fraction flag when making training examples.

    • Whole exome sequencing model:
      We started from a trained WGS model as a checkpoint, then we continue to train only on WES data above. We also use various downsample fractions for the training data.

  4. DeepVariant now provides deterministic output by rounding QUAL field to one digit past the decimal when writing to VCF.

  5. Update the model input data representation from 7 channels to 6.

    • Removal of "Op-Len" (CIGAR operation length) as a model feature. In our tests this makes the model more robust to input that has different read lengths.
    • Added an example for visualizing examples.
  6. Add a post-processing step to variant calls to eliminate rare inconsistent haplotypes [description].

  7. Expand the excluded contigs list to include common problematic contigs on GRCh38 [GitHub issue].

  8. It is now possible to run DeepVariant workflows on GCP with pre-emptible GPUs.

deepvariant - DeepVariant 0.4.1

Published by scott7z almost 7 years ago

This fixes a problem with htslib_gcp_oauth when network access is unavailable.