luntergroup/octopus
Bayesian haplotype-based mutation calling
๐ Bug fixes
- Restores bug fixed from [0.7.2](https://github.com/luntergroup/octopus/releases/tag/v0.7.2) that were accidentally reverted.
๐ฆ Improvements and modifications
- `QUAL` scores less than `1` are now reported to 2 significant figures.
๐ Changes
- Replaces the old random forest training procedure with a Snakemake version. [e0c492251d85bc9926c2dba2f28c87159da909ad]
- All annotations can now requested even if they are not active. [9bb584cd7f92a3dcdeb93b009f7a82770dc64eb8]
- Annotations will be default no longer be aggregated when `--disable-call-filtering` is used. To aggregate annotations for forest training, the `--aggregate-annotations` option is added. [42f424870307861e6eb9cb8f53245f03357973f6]
- Big runtime improvement to the `cell` calling model. [bc062d24c1e344a09ba42240e206f604ad8ab771]
- Change the default `--max-genotype-combinations` to `100,000` for `trio` and `population` calling. This improves runtime considerably but has little impact on accuracy. [18fcbcbc354479ebf0abe6a2d08e048041eb0918]
- Trains a new germline random forest using much less training data overall but more trio data.
๐ Changes
- Fixes a segmentation fault in the cancer caller caused by adjacent phase blocks with different ploidies. [9bc21a34f980075a9746014a42eddd177c32f325]
- Default to UTC time when no tz database found. [3dbd8cc33616129ad356e99a4dae82e4f6702250]
- Prints annotations when `--annotations` specified with `--help`. [366fe044327504487e50f51224582e82d3fdda1e]
- Fixes a bug causing some reads to be dropped when filtering long. haplotype regions. [8f2cc87a731c48c38d71eab25a5a20f6fe82b84b]
- Update the link in the README to the Nature Biotechnology paper! [b8baf136ba9c579c22fc8db8b69b66d350379ae1]
๐ Changes
- Fixes underreporting of *de novo* mutations in trio mode [7d195dee033d178680b5d12b89c54ec5b4b4978e].
- Improves QUAL precision for trio calls [b68c7cf73bbc9500e740d12c4e91783fc0420773].
- Resolves some issues with read counting (e.g. `AD`) on `*` alleles [e0023adf0f1bf0f97af474a0a5016ea213f1cc62].
- Fixes underreporting of more than one somatic haplotype in cancer mode [0c7d06bca5ea072e6b53eb72238624fc5c7bc103].
- Improves coalescent model to allow 2 indel heterozygosity parameters along a haplotype [58db9346ab06216bc4478fa011ccf372f557565e].
- Adds timestamp to VCF output [36ca0b908726486476292308a867dc6bc4e67edb].
- Adds `--architecture` option to `install.py` that sets compiler `march` option [87faea93b08e465c74dd28c7452a66a74a644ed2].
- Default minimum mapping quality filter set to 5.
๐ Changes
- The pair HMM used for the core haplotype likelihood model has been completely re-written to support AVX2 and AVX-512 instruction sets. This can result in some nice performance improvements on machines supporting these instructions. Also, the HMM now supports variable band-widths and 32-bit integer scores, which is necessary to evaluate long reads.
- [Evidence BAMs](https://github.com/luntergroup/octopus/wiki/How-to:-Make-evidence-BAMs) are now annotated with supporting haplotype(s) and other information. Automatic 'splitting' by haplotype is gone but there is a [script] provided to do this.
- Octopus is now paired and linked read aware! Reads are assumed paired by default, but can be assumed unpaired or linked with the [`--read-linkage`](https://github.com/luntergroup/octopus/wiki/Command-line-reference#option---read-linkage) option. This improves accuracy and phasing for most analysis.
- [Random forests](https://github.com/luntergroup/octopus/wiki/Variant-filtering:-Random-Forest) now store the annotations used for training as meta information in the forest file, allowing different annotations to be used for different forests. Note that this change makes previous forest versions incompatible with this version, it also means that a [modified ranger](https://github.com/dancooke/ranger) must be used for training (the main [ranger](https://github.com/imbs-hl/ranger) package does not store variable names in the meta info).
- Allele-level annotations (e.g. `AD`) are now supported; they can be requested with the [`--annotations`](https://github.com/luntergroup/octopus/wiki/Command-line-reference#option---annotations) option.
- The phasing algorithm has been completely re-written to improve accuracy and to allow discontiguous phase sets, which can frequently occur in some analysis (e.g. linked reads, or somatic phasing).
- Calling from PacBio CCS reads is now supported - although improvements are still needed, especially regarding runtime. See the [PacBio CCS config](https://github.com/luntergroup/octopus/blob/develop/resources/configs/PacBioCCS.config).
- The haplotype generator now supports 'backtracking' - where a block of partially resolved haplotypes is buffered, and then restored when downstream haplotypes have also been partially resolved. This can lead to long haplotypes much faster than keeping all haplotypes in the tree simultaneously. Backtracking is turned off by default, but can be. enabled by using [`--backtrack-level`](https://github.com/luntergroup/octopus/wiki/Command-line-reference#option---backtrack-level) option.
- + 14 more
โจ New features / interface changes
- Adds command line options `--mask-inverted-soft-clipping` and `--mask-3prime-shifted-soft-clipped-heads` for masking 10X Genomics sequencing artefacts. [0b8fb935d93154b624f644940e0375f8c92b62c0, 6566fb2432cdac01fa43e0b217ae985c105993e9]
๐ฆ Improvements
- Reduces runtime in the `VariationalBayesMixtureMixtureModel` used in the `cancer` and `polyclone` calling models by ~20-25% [d9cbcec3c24a9460d709b30ff0ea006d08e55491]
- Switches multi-precision floating point arithmetic in the `cancer` calling model to use GMP library, resulting in a small speedup. This change adds a dependency to GMP. [e59be9dbf5cd768846b3de1170a185d67d3a06d3]
๐ Bug fixes
- Fixes the read deduplication algorithm so that reads with multiple duplicates are recognised. [01eb88d0acd3e07b33cea4210a1126cef0e0e407]
- Fixes a bug in the `NC` and `SMQ` measures that could cause an exception to be thrown. [a1262e2a14d6c2efa7c7045a470cfe4b3dd7a209, 226b40a272cbd57721c29645b882637ac15474ef]
๐ Bug fixes
- Fixes a bug that results in the cancer calling model throwing an exception when provided with a single candidate haplotype. [59bceefe7acdcfb610a12ea818e1354a2bb1cc42]
- Fixes a bug that could cause a segmentation fault due to the haplotype leaf list becoming corrupted when removing regions from the tree. [ec7ee85224513968dd4ecf53ad5929fd3365a346]
๐ Changes
- The git branch and commit, are some system information are now logged during compilation. This information is available with the `--version` command. [242dd00549cc27f6629619d5d3bf7c0866ff0c29, 36a6a82c28d99cd17f0e169fa28e876f28b0c82f, c6c397d1ca5d3c50a372c83d6a75ca491ef4e678]
- Adds measure `ADP` for assigned sequence depth (i.e. reads assigned to a unique called allele). [702109ee85f362c6cce739aa0ff3a1be19e436a8]
- Adds measures `ADP` and `VL` to default random forest measures. [a0359530077794cadb3723253f5fb783d9fca975]
- Adds support for gzipped region files (for options `--regions-file` and `--skip-regions-file`) [ec41af47350067808c3c27a434563a948bf652dc]
- Reads that cannot be assigned to a unique haplotype are assigned randomly to any of the supporting haplotypes for bam realignment (rather than always assigning to one of them). [cb3faf950a02f6577e21ae8f42b7b72cf6b4694b]
๐ Bug fixes
- Corrects measures `AD` and `AF` calculations. [2a6a1065319a4cb0416f2162def18b362d9e09d2 , 66e4466350edb5c7c17c2eab218e46f28699293b]
- Adds check for overflow in SIMD pHMM method that could result in segfaults. [08c231049c40f26b6667e3027b37fd7a610cbfe5]
๐ Interface changes
- The `--training-annotations` option is replaced with `--annotations`, with has slightly different behaviour (see below).
- The `--split-bamout` option is removed as `--bamout` realignments now include tags.
- Adds the option `--full-bamout`. [1147e8f72f0fe3613ace63580ee592677f2f8466]
- Adds the option `--refcall-block-merge-threshold` for controlling recall blocks.
- Renames `--extract-filtered-source-candidates` to `--use-filtered-source-candidates`. [6972ffaf0f5ca88fb56e2b5e5f7a066462f64a37]
๐ฆ Improvements
- Indel error models now include variable gap extension penalties and account for tetra-nucleotide tandem repeats. [8f40fc3d3e8feef5c078d45ec8e17b3ec1955946]
- More built-in sequence error models to choose from, and custom error models (see [wiki](https://github.com/luntergroup/octopus/wiki/How-to:-use-sequence-error-models)). [8f40fc3d3e8feef5c078d45ec8e17b3ec1955946]
- Annotations can now be requested for filtered VCF files using the new `--annotations` option. [c75cbac60cdc27864b29097f1d608d89d31cbb68]
- Reference calling now outputs calls in adaptive blocks using the new `--refcall-block-merge-threshold` option. [9127cf3f6f5a11f0268e14b9e51f177b9d1825d3]
- Better handling of temporary BCF files in multithreaded mode helps prevent system errors due to too many open files (addresses issue #52). [42fa364aa3ca64ea25613458c8f6a45dbab5f34f]
- Adds annotations to realigned evidence BAMs (see [wiki](https://github.com/luntergroup/octopus/wiki/How-to:-Make-evidence-BAMs)). [c047e96978d40234314cf82bcf35f12d835c3b6f]
๐ Interface changes
- Renames option `--download` to `--download-forests` in Python installation script. [166b6bea998242a821a23074f04253266976d297]
- Adds command line option `--temp-directory-prefix` for setting name of temporary directory. [69ea9e73c41b73ffd6f93d9c1df9ba13588d21de]
- Adds `cell` caller prototype (undocumented). [9f496d30e01e12146e5face4a010f42ee0407c23]
๐ฆ Improvements
- Assembler now adjusted support threshold depending on depth, which should prevents too many false candidates from high depth samples. [d582cbc4b3880822b87dac74a81bd5ae5f75851c, 91930cb7231981a33524b75dcecd9aff582b8755]
- Adds `--install-dependencies' to the Python installation script that results in all dependencies being installed locally. [81c6535e47a4ba3d772e05b5afdfea83f884cd0d, 5ac69585f0da395baee1a73d9c7bd621bfd149a3]
- Allows candidates only seen on one strand if there are only reads seen in that direction overlapping the region. Addresses #45. [44c5c2268d4802b6dda5669752e6f91292b6c659]
๐ Bug fixes
- Resolves exception when merging temporary files for contigs containing `:` (e.g. HLA-A*01:01:01:01). Resolves #44. [https://github.com/luntergroup/octopus/commit/37a3329239518a07b2b9ca62e28a7570d9773667]
- Stops output of IUPAC ambiguity symbols, which are not permitted by VCF specification. Resolves #46. [da055138540dc06be2d66960d3cc0f812788ff70]
- Should prevent exception being thrown during filtering caused by short haplotype for realignment (see issue raised in #41). [0942ddeb10e2a271d528da5e9867ce43637bace1]
๐ฆ Improvements
- Makes RFQUAL a FORMAT field rather than an INFO field, so each sample gets an RFQUAL. [81b75ea3af852e809c789a8a924b3aa0f9791264]
- Installation can now be to any location. Resolves #36. [18b36eaaf789d294ddfe3514fbb0d4e03c4eeccd]
- Installation script can now be given htslib root location. Resolves #38. [93ba0000b56cad3c490d3e8ce5b400180a6a0e46]
- Installation script now tries both `cmake3` and `cmake`. Resolves #37. [30a3ffb11ad216e108efefcd70549130939655bb]
- Installation script now properly downloads provided random forests. [3599999dbedce18520389f656a352edf866fd1c7]
๐ Bug fixes
- Fixes bug in htslib float field extraction that could corrupt FORMAT and INFO values. [0078c5a1e2fbd1abd43a65315f4de216fbf4fa9b]
- Fixes bug in the *de novo* mutation model that could lead to segmentation faults. [7d945f80a7ae4db3a5b107619801e76ec93e0033, ad996a28631fc3ce58912e20db38e7b424fa61b1]
- Fixes bug that causes conflicting call exception due to calling variants in skipped regions. [01477d9a89111b01bb81f741d8f8f334e66f0a0b]
- Fixes bug in *de novo* contamination measure that could cause segmentation faults. [f2e6610f76d4d6b5be68c9c278be0a16a7bbb162]
๐ฆ Improvements
- Adds support for allosomes in the trio calling model. [41b72b22663d91e08b63121ba2a7624485a99003]
- Moves the `RFQUAL` random forest score to the `FORMAT` field, so there is now one score for each sample. [81b75ea3af852e809c789a8a924b3aa0f9791264]
- Adds new measures: `RTB`, `REB`, `BMC`, `BMF`. [f1000d4410fe62e8b0b8bd0080d4720b81024710, 1a937f2f48aad986ea76b799f666155da4fccc08]
- Improves temp directory cleanup on failed runs. [95016f138b27f8b683c511025d83ec539dc2cc0f]
- Makes random forest training a little easier by adding default measure lists to training scripts and by allowing the argument `forest` to the `--training-annotations` option (renamed from `--csr-train`). [519be06afbc7ddc3c70b4a5da899a22d18391b5c, 106e3443c4ed5319d84f621b5b0eaf50c46db179]
- Changes some of the UMI config settings to reduce runtimes (at minor expensive of accuracy). [36e47295df27ab19f51e2207eaee4842ab88ca32]
๐ Interface changes
- Renames `--csr-train` option to `--training-annotations`. [a1f8c45878ac1ca05c496f4b6b6c344c21a1ab10]
- Adds version numbers to provided random forests. [3599999dbedce18520389f656a352edf866fd1c7]
- Renames the `RPB` measure to `RSB`. [c52c6e8cb220e5db1171aa857141617b8aedf7c4]
๐ Bug fixes
- Resolves a libc++ bug where subnormal `double`s are not parsed properly, causing errors when using random forest filtering. [dc137542403b7c9af73257151472936ccd5a0844]
- Fixes a possible segmentation fault when using the `MQD` measure. [45b9b742d09cb037ffa605c719695ae22d94a066]
- Fixes a VCF reading bug that could mangle `INFO` and `FORMAT` fields with multiple values. [0078c5a1e2fbd1abd43a65315f4de216fbf4fa9b]
๐ฆ General
- Overhaul of the indel mutation model which controls priors on germline, somatic, and *de novo* mutations. Gap open and extensions conditional on local repeat context and current gap length are modelled. [bd0eb24bfd09efacbadb306af3a0af15827b7015, 20f5d9ff1facdfee90b2c88b8b603986b0e01fce]
- A brand new candidate variant generator! Named *RepeatScanner*, this generator looks for likely misaligned SNV runs in microsatellites and proposes indels. This can result in more biologically realistic calls in these regions. This generator is controlled with the `--repeat-candidate-generator` command line option. [2856c2e6b8a5683f07c19d3f40e1c2f3b467bacd , 2856c2e6b8a5683f07c19d3f40e1c2f3b467bacd]
- Evidence BAMs for multi-sample input, including 'split' evidence BAMs. [face5fb7d7627154b1628f11a4aed64cd25a51ad, e56641c75c92cd104463fd3435a0fea0d3807793]
- The way `QUAL` is calculated in the cancer and trio models has been improved. Previously `QUAL` was the posterior probability the called alt allele segregated and is classified correctly. This could lead to low `QUAL` scores if the classification was uncertain (e.g. in tumour-only samples). `QUAL` is now simply the posterior probability the allele segregates. There is also a new annotation for all cancer caller calls, and `DENOVO` trio calls, `PP`, that is equivalent to the old `QUAL`. [905c96b7362ba2513c920e33d896751490cc32f0, 3b28e9fe85af4aef4408cb3b31c959408a0ba129, 0d1537b9012326d4e8e3d98d718e0f81ff73219e]
- Candidate variant generators are now more sensitive to very low frequency variation (<1% VAF). [d3e36316c47c7d736fde48611baaef408f7078c9]
- `SOMATIC` have a new annotation: `MAP_VAF` which reports theMaximum a posteriori VAF estimate.
- New measures to use for threshold and random forest filtering. [11ff14faaa141ddb290dd31f6a2686adf5f51269]
- Complete refactor of the core cancer caller genotype models results in some runtime improvements. [d3e5a5a0fc11e3462b63de8e7cc6c3c36080c006]
- + 9 more
๐ Bug fixes
- Fixes a bug that could lead to segmentation faults during haplotype generation. [1ecd74e7a45f2337426728e90bf5a3c90f52592a]
- Fixes a problem reading lists of floats from VCF files that could result in garbage output (e.g. for `VAF_CR`) [e361f5065da83a9d1febabf4dcac9c7578dc3e8e].
- Fix GCC 8 warning which caused compile error. [58b51fd14b73bf5dbcd8f50a4d9704f39acf985f, 3733b09e643de92226010dd866006786fd609375]
- Fixes some instances of compiler based non-determinism that could result in different results between compilers. [d01819396161e76a14cc1605d63da2abf35901aa, e66169e5724ce3251fd3071a01a5d5e8e1db1599]
๐ Interface changes
- Adds command line option `--max-vb-seeds` which controls the maximum number of seeds the Variational Bayes based genotype model algorithms can use. [95c66a2ec89fe37adb8a4707d15b69bf17f25563]
- Adds `--split-bamout` for split realigned BAMs. Split BAMs are no longer requested by specifying a prefix to `--bamout`. [34d8a89748cd363e967cea89774531efa73a9dbb]
- The measure `SC` has been renamed to `NC` (Normal Contamination). [23497c3aaf0c93c9ca633f96778f8f74c4a5a4b3]
- -- Adds `--mask-tails` for unconditionally masking bases of all read tails. [acfddaf1b5e910496b737f3dd6cab2667dadae4b]
- Adds `--tumour-germline-concentration` which may be used to control shape of prior distribution on haplotype mixture frequency of tumour samples. Only really relevant to high depth tumour-only calling. [9f83ca6fce24ced6ea901845f3c474ecfc6a1867]
- Renames `--snv-denovo-mutation-rate` to `--denovo-snv-mutation-rate` and `--indel-denovo-mutation-rate` to `--denovo-indel-mutation-rate`. [4b9d95f448ef1f8d2375947a58d664850a868c18]
- Adds `--repeat-candidate-generator` to control new repeat candidate generator. [2856c2e6b8a5683f07c19d3f40e1c2f3b467bacd]
๐ฆ Miscellaneous
- There is now a `configs` directory in the main project directory that contains pre-written configs for calling certain types of data. [9da036416ff2bd7a36f5f734aebbd391df7c48f4]
๐ Bug fixes
- Fixes a bug in v0.4.0-alpha where germline calls may be hard filtered when using threshold filtering.
โจ New features
- New [polyclone](https://github.com/luntergroup/octopus/wiki/Calling-models:-Polyclone) calling model for bacterial and viral data.
- New [population](https://github.com/luntergroup/octopus/wiki/Calling-models:-Population) calling model with Hardy-Weinberg priors.
- [Random forest filtering](https://github.com/luntergroup/octopus/wiki/Variant-filtering:-Random-Forest) for germline and somatic variants using [ranger](https://github.com/imbs-hl/ranger).
- Generate an 'evidence' BAM for single sample calling with the `--bamout` option. See the [wiki page](https://github.com/luntergroup/octopus/wiki/How-to:-Make-evidence-BAMs) for details.
๐ฆ Calling improvements
- The cancer caller can now model more than one somatic haplotype which improves calling sensitivity, and also allows somatic phasing. See [cancer calling model wiki](https://github.com/luntergroup/octopus/wiki/Calling-models:-Cancer) for more details.
- Optimisation of the cancer model improves sensitivity for low frequency mutations.
- New unified indel mutation model used for germline, de-novo, and somatic indel calling.
- New filter Measures. See [wiki](https://github.com/luntergroup/octopus/wiki/Variant-filtering) for full list.
- Tumour-only calling now much faster and more accurate.
- Uses variant prior model to deduplicate haplotypes for all models, resulting in more biologically realistic calls.
- `DENOVO` and `SOMATIC` calls now get different filtering treatment to regular germline variants using threshold filters.
๐ Interface changes
- Added `--forest-file` and `--somatic-forest-file` for random forest filtering.
- Added `--somatics-only` to report only `SOMATIC` variants.
- Added `--denovos-only` to report only `DENOVO` variants.
- Added `--max-somatic-haplotypes` which limits the number of somatic haplotypes that may be used by the `cancer` calling model.
- `--consider-reads-with-unmapped-segments` --> `--no-reads-with-unmapped-segments` and `--consider-reads-with-distant-segments` --> `--no-reads-with-distant-segments`. These filters are now off my default.
- `--max-cancer-genotypes` removed and replaced with `--max-genotypes`, which is also used by the `polyclone` calling model.
- Added `--max-clones` option for specifying the maximum number of clones for the `polyclone` calling model.
- Added `--somatic-filter-expression`, `--denovo-filter-expression`, and `--refcall-filter-expression` which may be used for hard filtering 'DENOVO' and `SOMATIC`calls.
โจ New features
- CSR filtering can be run on a user supplied octopus VCF file, without running calling (`--filter-vcf` command line option).
- Micro-inversions and complex rearrangements are callable.
๐ฆ Calling improvements
- Better handling of variants in tandem repeat regions, in particular, many cases that would previously have been called as a series of SNV's, are now called as an insertion-deletion pair, which is more biologically plausible.
- Improved the SNV error model to stop some true heterozygous SNV's being called as homozygous.
๐ฆ Runtime improvements
- CSR filtering is fully parallelised. Like for calling, this is activated with the `--threads` command. This resolves #13.
๐ Bug fixes
- Various fixes to the way haplotypes are reconstructed from VCF, which lead to some edge cases being misclassified.
๐ Interface changes
- The helper Python install script `install.py` is now supplied with both a C++ and C compiler with the `cxx_compiler` and `c_compiler` commands respectively.
- Supplementary alignments are now filtered by default (`--no-supplementary-alignments` changes to `--allow-supplementary-alignments`).
- Secondary alignments are now filtered by default (`--no-secondary-alignments` changes to `--allow-secondary-alignments`).
๐ Other changes
- htslib is now linked dynamically by default, which means its requirements do not need to be explicitly linked also. This resolves #16. Be sure to clean any CMake caches before rebuilding (`--clean` with Python install script).
- `.vcf.gz` index files are now in the `.tbi` format, rather than `.csi`.
๐ Bug fixes
- Fixes issue #11 where octopus hangs after calling variants.
- Fixes issue #17 where contig names containing a colon could not be parsed.
โก Performance improvements
- Gap open penalties are now more consistent tandem repeats which can improve calling performance in some cases.
- Decreased the minimum probability cap for *de novo* mutation model which seems to result in more sensitive *de novo* and somatic mutation calls.
๐ Interface changes
- Somatic SNV and INDEL mutation rates are now specified separately via the command line.
๐ Requirement changes
- Updates CMake requirement to 3.9 so can use IPO checks.
- Updates Boost requirement to 1.65 for bug fixes and better program option formatting.
- Updates GCC requirement to 6.3 to avoid bug in 6.2.
โก Performance Improvements
- Significantly improves runtime performance of tumour calling model.
- Improves masking of noisy regions which can slow down calling.
- Slightly improves CSR runtime performance.
๐ Other changes
- Fixes various warnings from new Clang and GCC compilers.
- Can now build with compiler sanitizer flags.
- Adds a Dockerfile.
โจ New features
- Variant filtering: Octopus now has simple threshold based filtering which is turned on by default. This can dramatically reduce the false positive rate in some datasets (e.g. Platinum genomes).
- The population model now uses an independence-based genotype model. Although this doesn't offer true joint calling, it at-least offers consistent output until such time as a proper model is implemented.
- Somatic mutation calling is now significantly faster and more accurate due to model optimisation.
๐ Bug fixes
- Fixed a bug with haplotype filtering that could cause haplotypes not to be filtered, and also result in inconsistent results between runs.
๐ Other changes
- VCF records now include AC and AN INFO fields.
- Added an official logo!
- Protect called haplotypes from filtering when using holdouts.
- Octopus will now always emit a call if the variant posterior is above the given threshold, even if the homozygous reference genotype is MAP.
- The max QUAL is now 10000.
โจ New features
- A new de novo mutation model that includes context dependent indel gap open and extension penalties, calculates using an exponential model. There are now two options that parametrise the model; `snv-denovo-mutation-rate` and `indel-denovo-mutation-rate`. Gap open and extension penalties are weighted based on context.
๐ Bug fixes
- Fixes a bug that could prevent a legacy VCF being made.
- Corrects a region difference method that sometimes resulted in incorrect 'skip region' deduction, which could lead to an exception being thrown.
- Fixes a bug that resulted in an incorrect trio model posterior probability.
- Fixes some numerical overflow/underflow bugs that resulted in undefined behaviour.
๐ Other changes
- Increases `max-joint-genotypes` to 1,000,00.
The first 0.2 series release, marking the increased stability of the core octopus algorithm, and the near completion of the trio calling model. A few bugs in 0.1.13-alpha (the last 0.1 series release) have been fixed, including a minor memory leak.
๐ Bug fixes
- Fixes a bug which could cause very large regions to be considered, resulting in memory overflow due to too many reads being read.
- A bug causing incorrect de novo haplotype priors due to overflow in the simd alignment routine has been resolved by capping the de novo mutation probability.
- Fixes a few suitable bugs in haplotype generation.
๐ฆ Improvements
- The assembler now removes more false positive bubbles due to cycles.
- De novo mutations which cause reversion to the reference are flagged as REVERSION in the INFO field.
- Candidates from raw cigar alignments are now filtered depending on a calculated probability or read misalignment.
๐ Other changes
- The assembler is now on be default.
- The penalty for matching to an 'N' has been set to 2.
- More informative debug logging.
- Unmapped reference contigs can now be ignored with the `--ignore-unmapped-contigs` option.
๐ฆ Improvements
- Thread management has been improved; there is now a dedicated VCF writing thread, and tasks can be run as soon as they are created, rather than having to wait for all tasks on a contig to be made.
- A new de novo mutation model has been introduced which reduces false positive indels.
- Haplotype generation now recognises regions which are likely to contain 'interacting' variation - basically non-independent indels - and groups them together when possible.
- Some minor efficiency improvements to the trio model should see it run slightly faster.
- Regions which contain massive amounts of variation are now automatically excluded from lagging. This can stop runtimes exploding in very difficult regions.
๐ Bug fixes
- Resolves a bug which could cause VCF records to contain a single alt '*' allele.
- Closed an infinite loop in the assembler.
- Fixes a bug in VCF reading which could drop a record.
๐ Other changes
- There is now a `--very-fast` option in addition to `--fast`.
- The default read memory footprint has been increased to 6GB.
๐ฆ Improvements
- The assembler has been overhauled. It now only tries to assemble regions that look difficult (i.e. those where the alignments are bad). Sensitivity and specificity have also been greatly improved with better pruning and bubble extraction algorithms. The assembler will now also be run in parallel if `--threads` is used.
- The trio model has been significantly improved, it is now slower, but offers far greater sensitivity and specificity. Overall performance can be adjusted with the `--max-joint-genotypes` option.
- The holdout mechanism has been improved, and can now deal with much larger holdout depths.
- Haplotype generation has been improved; regions interacting with large indels will be called together if possible.
๐ Bug fixes
- Some bugs which resulted in incorrect VCF output have been fixed.
- Fixes a bug in read extraction which could cause segmentation faults.
- Fixes some bugs in haplotype generation that could cause segmentation faults or exceptions.
- Fixes a bug where a raw VCF source variant file resulted in an exception.
๐ Other changes
- Htslib version 1.4 is now a requirement as this version fixes a bug that was causing octopus to stall.
- The default number of haplotypes (`--max-haplotypes`) has been increased to 200.
- The default value for `--haplotype-extension-threshold` has been reduced to 100.
- Mapping quality modelling can be turned off with the `--model-mapping-quality` option.
๐ Bug fixes
- Fixes a bug introduced in v0.1.9-alpha that causes segmentation faults or undefined behaviour. triggered by read-pairs that overlap sufficiently to contain adapter sequence, and also contain indel variation in the overlapping subregion.
- Corrects a calculation in soft clip masking.
๐ฆ Improvements
- The runtime performance of the assembler has been slightly improved.
- When using the assembler, soft clipped read tails with low average quality will be masked.
๐ Other changes
- The default value for `--assembler-mask-base-quality` has been changed to 5.
๐ฆ Improvements
- The way octopus deals with overlapping read templates (pairs) has been overhauled, resulting in significant calling improvements.
- Read filtering is now less strict, with more emphasis on recalibrating reads with non-modelled errors.
๐ Other changes
- The assembler is now switched off bay default. It will be re-activated at some point in the future.
โจ New features
- Adds more options for `--phasing-level` which control the length of haplotypes.
๐ฆ Improvements
- The Phasing algorithm has been overhauled; it is now approximately 5x faster.
- The downsampling procedure has significantly improved; it is now orders of magnitude faster.
- The read likelihood calculation flank score adjustment has been improved, which gives minor calling improvements.
- Minor improvements to BAM/CRAM reading.
๐ Bug fixes
- A bug introduced in the phasing algorithm in v0.1.6-alpha, which resulted in wrong phasing in many cases has been fixed.
- Fixes a floating point overflow error in HaplotypeGenerator which could result in huge memory spikes.
- Fixes a bug in CoverageTracker which could cause segmentation faults.
๐ Bug fixes
- Fixes a bug which in rare cases could lead to an infinite loop.
- Fixes an error in the trio caller genotype posterior calculation, meaning samples would often get assigned a GQ of zero.
- Fixes a performance bug in the trio model.
โจ New features
- A pedigree (.ped) file can now be specified with the option `--pedigree`. This can be used to invoke the trio calling model instead of specifying maternal and paternal samples. If exactly three samples are given, and the given pedigree contains those samples as a trio, then the trio calling model will be used.
๐ Other changes
- The default target read memory footprint limit (`--target-read-buffer-footprint`) has been increased to 4GB.
- Static library linking is now possible by adding the `--static` option to the python installer.
- Compiler warnings have been eliminated for Clang.
โจ New features
- A prototype population model has been implemented.
- Uniform genotype priors are available with the `--use-uniform-genotype-priors` command line option.
- Switchable sequence error model (only X10 alternative at the moment).
- Multiple source variant files can now be supplied.
๐ฆ Enhancements
- The holdout mechanism has been improved which results in much fewer skipped regions due to haplotype overflow.
- The runtime performance of the trio caller has been significantly improved.
- The runtime performance of the phaser has been significantly improved.
- Genotype posteriors are now capped at 255 rather than 99.
๐ Bug fixes
- Fixes an issue with active region boundary insertions which could cause missing genotype calls.
- Corrects somatic prior calculation which resulted in many false positive somatic calls.
- Fixes some issues which could result in HaplotypeTree exceptions.
๐ Bug fixes
- Fixes bug in assembler which could cause segmentation faults.
- Fixes bug in haplotype generation which could cause exceptions.
- Fixes bad VCF genotype output for somatic variants (e.g. 2|0).
- Prevents spurious "Not a BGZF file" warning.
- Building fixes.
๐ฆ Key improvements
- Many improvements have been made to the cancer calling model. In particular specificity has been significantly improved. A model posterior is now reported by default which can be used to hard filter calls. Runtime has been improved, but still needs much work.
- The assembler can now selectively extracts bubbles of a minimum weight, increasing both specificity and sensitivity.
๐ Other changes
- Command line options have been added for the new features.
- The command line options `--debug` and `--trace` can now take arguments (file paths) to write the respective logs to.
- Downsampling limits have been increased.
- Some other command line option names and default values have changed.
