Skip to content

Feature request: empirical sequencing-error-rate estimation #688

Description

@jdidion

Background

I'm the maintainer of atropos, another adapter-trimming tool that started as a cutadapt fork. I'm winding atropos down in favor of actively-maintained tools like fastp and wanted to surface a few capabilities that users might miss, in case they're interesting for fastp.

Proposal

Add a pre-processing pass that estimates the empirical per-base sequencing error rate from the FASTQ input and surfaces that estimate to the user (and optionally feeds it into downstream thresholds such as -n/--n_base_limit, quality cutoffs, or adapter-match error tolerance).

Two methods are worth considering:

  1. Quality-based: sum per-base 10^(-Q/10) and divide by base count — cheap, streams, no calibration. Useful as a sanity check but inflated by any quality-score miscalibration.
  2. Wang et al. 2012 "shadow regression": regress the number of mismatching reads against the number of unique reads across a range of read-length prefixes, then solve for the per-base error rate. Works on any set of reads without requiring alignment to a reference.

Why this is useful

Users tuning adapter-match stringency (--adapter_fasta tolerance, insert-match diff_limit, etc.) currently guess. An empirical baseline lets them pick thresholds that sit a defined distance above the platform's actual error floor — and flags obviously-degraded runs that would otherwise appear as "clean" just because Qs are high.

Prior art

Happy to help with tests/data if you decide to pursue this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions