GitPedia

Somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq

From bioinform·Updated June 6, 2026·View on GitHub·

SomaticSeq is an ensemble somatic SNV/indel caller that has the ability to use machine learning to filter out false positives from other callers. It also comes with a suite of [genomic utilities](somaticseq/utilities/README.md). The detailed documentation is located in [docs/Manual.pdf](docs/Manual.pdf "User Manual"). The project is written primarily in Python, distributed under the BSD 2-Clause "Simplified" License license, first published in 2015. Key topics include: cancer-genomics, somatic-variants.

Latest release: v3.12.0
April 25, 2026View Changelog →

SomaticSeq

SomaticSeq is an ensemble somatic SNV/indel caller that has the ability to use
machine learning to filter out false positives from other callers. It also comes
with a suite of genomic utilities. The
detailed documentation is located in
docs/Manual.pdf.

Training data for benchmarking and/or model building

In 2021, the
FDA-led MAQC-IV/SEQC2 Consortium
has produced multi-center multi-platform whole-genome and whole-exome
sequencing data sets for a
pair of tumor-normal reference samples (HCC1395 and HCC1395BL), along with the
high-confidence
somatic mutation call set.
This work was published in
Fang, L.T., Zhu, B., Zhao, Y. et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat Biotechnol 39, 1151-1160 (2021)
/
PMID:34504347
/
Free Read-Only Link.
The following are some of the use cases for these resources:

Click for more details of the SEQC2's somatic mutation project.

Recommendation of how to use SEQC2 data to create SomaticSeq classifiers.

<hr> <table style="width: 100%;"> <tr> <td>Briefly explaining SomaticSeq v1.0</td> <td>SEQC2 somatic mutation reference data and call sets</td> <td>How to run <a href="https://precision.fda.gov/home/apps/app-G7XVKQQ02v051q5PK3yQYJKJ-1">SomaticSeq v3.6.3</a> on precisionFDA</td> </tr> <tr> <td><a href="https://youtu.be/MnJdTQWWN6w"><img src="docs/SomaticSeqYoutube.png" width="400" /></a></td> <td><a href="https://youtu.be/nn0BOAONRe8"><img src="docs/workflow400.png" width="400" /></a></td> <td><a href="https://youtu.be/fLKokuMGTvk"><img src="docs/precisionfda.png" width="400" /></a></td> </tr> <tr> <td></td> <td></td> <td>Run in <a href="https://youtu.be/F6TSdg0OffM">train or prediction mode</a></td> </tr> </table> <hr>

Installation

Dependencies

This dockerfile reveals the
dependencies

  • Python 3, plus pysam, numpy, scipy, pandas, and xgboost libraries.
  • BEDTools: required when parallel
    processing is invoked, and/or when any bed files are used as input files.
  • Optional: dbSNP VCF file (if you want to use dbSNP membership as a feature).
  • Optional: R and ada are required for
    AdaBoost, whereas XGBoost (default) is implemented in python.
  • To install SomaticSeq, clone this repo, cd somaticseq, and then run
    pip install . (To install extra packages for development:
    pip install '.[dev]'). A number of commands prefixed with somaticseq_ will
    be placed into the PATH.

To install using pip

Make sure to install bedtools separately.

pip install somaticseq

To install the bioconda version

SomaticSeq can also be found on
Anaconda-Server Badge,
which has
Anaconda-Server Badge
so far. To
install with bioconda,
which also automatically installs a bunch of 3rd-party somatic mutation callers:

conda install -c bioconda somaticseq

To install from github source with conda

conda create --name venv -c bioconda python bedtools
conda activate venv
git clone git@github.com:bioinform/somaticseq.git
cd somaticseq
pip install -e .

Test your installation

If installed successfully, you will be able to run somaticseq --help in the
terminal. Also make sure bedtools is executable. There are some toy data sets
and test scripts in example that should finish in <1 minute
if installed properly.

Run SomaticSeq with an example command

  • At minimum, given the results of the individual mutation caller(s), SomaticSeq
    will extract sequencing features for the combined call set. Required inputs
    for command somaticseq are:

    • --output-directory and --genome-reference, then
    • Either paired or single to invoke paired or single sample mode,
      • if paired: --tumor-bam-file, and --normal-bam-file are both
        required.
      • if single: --bam-file is required.

    Everything else is optional (though without a single VCF file from at least
    one caller, SomaticSeq does nothing).

  • The following four files will be created into the output directory:

    • Consensus.sSNV.vcf, Consensus.sINDEL.vcf, Ensemble.sSNV.tsv, and
      Ensemble.sINDEL.tsv.
  • If you're searching for pipelines to run those individual somatic mutation
    callers, feel free to take advantage of our
    Dockerized Somatic Mutation Workflow
    as a start.

    • Important note: multi-argument options (e.g., --extra-hyperparameters or
      --features-excluded) cannot be placed immediately before paired or
      single, because those options would try to "grab" paired or single
      as an additional argument.
# Merge caller results and extract SomaticSeq features
somaticseq \
  --output-directory  $OUTPUT_DIR \
  --genome-reference  GRCh38.fa \
  --inclusion-region  genome.bed \
  --exclusion-region  blacklist.bed \
  --threads           24 \
paired \
  --tumor-bam-file    tumor.bam \
  --normal-bam-file   matched_normal.bam \
  --mutect2-vcf       MuTect2/variants.vcf \
  --varscan-snv       VarScan2/variants.snp.vcf \
  --varscan-indel     VarScan2/variants.indel.vcf \
  --jsm-vcf           JointSNVMix2/variants.snp.vcf \
  --somaticsniper-vcf SomaticSniper/variants.snp.vcf \
  --vardict-vcf       VarDict/variants.vcf \
  --muse-vcf          MuSE/variants.snp.vcf \
  --lofreq-snv        LoFreq/variants.snp.vcf \
  --lofreq-indel      LoFreq/variants.indel.vcf \
  --scalpel-vcf       Scalpel/variants.indel.vcf \
  --strelka-snv       Strelka/variants.snv.vcf \
  --strelka-indel     Strelka/variants.indel.vcf \
  --arbitrary-snvs    additional_snv_calls_1.vcf.gz additional_snv_calls_2.vcf.gz ... \
  --arbitrary-indels  additional_indel_calls_1.vcf.gz additional_indel_calls_2.vcf.gz ...
  • For all of those input VCF files, both .vcf and .vcf.gz are acceptable.
    SomaticSeq also accepts .cram, but some callers may only take .bam.

  • --arbitrary-snvs and --arbitrary-indels are added since v3.7.0. It allows
    users to input any arbitrary VCF file(s) from caller(s) that we did not
    explicitly incorporate. SNVs and indels have to be separated.

    • If your caller puts SNVs and indels in the same output VCF file, you may
      split it using a SomaticSeq utility script, e.g.,
      somaticseq_split_vcf -infile small_variants.vcf -snv snvs.vcf -indel indels.vcf.
      As usual, input can be either .vcf or .vcf.gz, but output will be
      .vcf.
    • For those VCF file(s), any calls not labeled REJECT or LowQual will be
      considered a bona fide somatic mutation call. REJECT calls will be
      skipped. LowQual calls will be considered, but will not have a value of
      1 in if_Caller machine learning feature.
  • --inclusion-region or --exclusion-region will require bedtools in your
    path.

  • --algorithm defaults to xgboost as v3.6.0, but can also be ada (AdaBoost
    in R). XGBoost supports multi-threading and can be orders of magnitude faster
    than AdaBoost, and seems to be about the same in terms of accuracy, so we
    changed the default from ada to xgboost as v3.6.0 and that's what we
    recommend now.

  • To split the job into multiple threads, place --threads X before the
    paired option to indicate X threads. It simply creates multiple BED file
    (each consisting of 1/X of total base pairs) for SomaticSeq to run on each of
    those sub-BED files in parallel. It then merges the results. This requires
    bedtools in your path.

Additional parameters to be specified before paired option to invoke
training mode. In addition to the four files specified above, two classifiers
(SNV and indel) will be created..

  • --somaticseq-train: FLAG to invoke training mode with no argument, which
    also requires ground truth VCF files.
    • --extra-hyperparameters: add hyperparameters for xgboost, e.g.,
      --extra-hyperparameters scale_pos_weight:0.1 grow_policy:lossguide max_leaves:12.
  • --truth-snv: if you have a ground truth VCF file for SNV
  • --truth-indel: if you have a ground truth VCF file for INDEL

Additional input files to be specified before paired option invoke
prediction mode (to use classifiers to score variants). Four additional files
will be created, i.e., SSeq.Classified.sSNV.vcf, SSeq.Classified.sSNV.tsv,
SSeq.Classified.sINDEL.vcf, and SSeq.Classified.sINDEL.tsv.

  • --classifier-snv: classifier previously built for SNV
  • --classifier-indel: classifier previously built for INDEL

Without those paramters above to invoking training or prediction mode,
SomaticSeq will default to majority-vote consensus mode.

To train for SomaticSeq classifiers with multiple data sets combined

Run somaticseq_xgboost train --help to see the options. It is recommended that
SNV and INDEL models be trained separately, but it is up to you to experiment,
e.g.,

somaticseq_xgboost train \
  -tsvs SAMPLE_1/Ensemble.sSNV.tsv SAMPLE_2/Ensemble.sSNV.tsv ... SAMPLE_N/Ensemble.sSNV.tsv \
  -out multiSample.SNV.classifier \
  -threads 8 -depth 12 -seed 42 -method hist -iter 250 \
  --extra-params scale_pos_weight:0.1 grow_policy:lossguide max_leaves:12

Run SomaticSeq modules seperately

Most SomaticSeq modules can be run on their own. They may be useful in debugging
context, or be run for your own purposes. See this page for your
options.

Dockerized workflows and pipelines

To run somatic mutation callers and then SomaticSeq

We have created a module (i.e., somaticseq_make_somatic_scripts) that can run
all the dockerized somatic mutation callers and then SomaticSeq, described at
somaticseq/utilities/dockered_pipelines.
There is also an alignment workflow described there. You need
docker to run these workflows. Singularity is also
supported, but is not optimized. Let me know if you find bugs.

To create training data to create SomaticSeq classifiers

Dockerized alignment pipeline based on GATK's best practices

Described at
somaticseq/utilities/dockered_pipelines.
The module is somaticseq_make_alignment_scripts.

Utilities

We have some generally useful scripts in utilities. Some
of the more useful tools, e.g.,

  • somaticseq_loci_counter finds overlapping regions among multiple bed files.
  • somaticseq_run_workflows is a rudimentary workflow manager that executes
    multiple scripts at once.
  • somaticseq_split_bed_into_equal_regions splits one bed file into a number of
    output bed files, where each output bed file will have the same total length.
  • somaticseq_linguistic_sequence_complexity calculates sequence complexity
    given a nucleotide sequence (e.g., GCCAGAC) based on
    Troyanskaya OG et al. Bioinformatics 2002.

Contributors

Showing top 3 contributors by commit count.

View all contributors on GitHub →

This article is auto-generated from bioinform/somaticseq via the GitHub API.Last fetched: 6/29/2026