Package midsv

Licence Test Python PyPI Bioconda

midsv

midsv is a Python module that converts SAM files to MIDSV format.

MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format that represents differences between a reference and a query, with the same length as the reference.

[!CAUTION] MIDSV is intended for targeted amplicon sequences (10-100 kbp).
Using whole chromosomes as references may exhaust memory and crash.

[!IMPORTANT] MIDSV requires long-format cstag tags in the SAM file.
Please use minimap2 with --cs=long option. or use cstag tool to append long-format cstag.

The output includes MIDSV and, optionally, QSCORE.

  • MIDSV preserves original nucleotides while annotating mutations.
  • QSCORE provides Phred quality scores for each nucleotide.

Details of MIDSV (formerly MIDS) are described in our paper.

🛠️Installation

From Bioconda (recommended):

conda install -c bioconda midsv

From PyPI:

pip install midsv

📜Specifications

MIDSV

Op Regex Description
= [ACGTN] Identical sequence
+ [ACGTN] Insertion to the reference
- [ACGTN] Deletion from the reference
* [ACGTN][ACGTN] Substitution
[acgtn] Inversion
| Separator for insertion sites

MIDSV uses | to separate nucleotides in insertion sites so +A|+C|+G|+T|=A can be easily split into [+A, +C, +G, +T, =A] by "+A|+C|+G|+T|=A".split("|").

QSCORE

Op Description
-1 Unknown
| Separator for insertion sites

QSCORE uses -1 for deletions or unknown nucleotides.

As with MIDSV, QSCORE uses | to separate quality scores in insertion sites.

📘Usage

midsv.transform(
    path_sam: str | Path,
    qscore: bool = False,
    keep: str | list[str] = None
) -> list[dict[str, str | int]]
  • path_sam: Path to a SAM file on disk.
  • qscore (bool, optional): Output QSCORE. Defaults to False.
  • keep: Subset of {'FLAG', 'POS', 'SEQ', 'QUAL', 'CIGAR', 'CSTAG'} to include from the SAM file. Defaults to None.

  • transform() returns a list of dictionaries containing QNAME, RNAME, MIDSV, and optionally QSCORE, plus any fields specified by keep.

  • MIDSV and QSCORE are comma-separated strings and have the same reference sequence length.

🖍️Examples

Perfect match

import midsv
from midsv.io import read_sam

# Perfect match

path_sam = "examples/example_match.sam"
print(list(read_sam(path_sam)))
# sam = [
#     ['@SQ', 'SN:example', 'LN:10'],
#     ['match', '0', 'example', '1', '60', '10M', '*', '0', '0', 'ACGTACGTAC', '0123456789', 'cs:Z:=ACGTACGTAC']
# ]

print(midsv.transform(path_sam, qscore=True))
# [{
#   'QNAME': 'control',
#   'RNAME': 'example',
#   'MIDSV': '=A,=C,=G,=T,=A,=C,=G,=T,=A,=C',
#   'QSCORE': '15,16,17,18,19,20,21,22,23,24'
# }]

Insertion, deletion, and substitution

import midsv
from midsv.io import read_sam

path_sam = "examples/example_indels.sam"
print(list(read_sam(path_sam)))
# [
#     ['@SQ', 'SN:example', 'LN:10'],
#     ['indel_sub', '0', 'example', '1', '60', '5M3I1M2D2M', '*', '0', '0', 'ACGTGTTTCGT', '01234!!!56789', 'cs:Z:=ACGT*ag+ttt=C-aa=GT']
# ]

print(midsv.transform(path_sam, qscore=True))
# [{
#   'QNAME': 'indel_sub',
#   'RNAME': 'example',
#   'MIDSV': '=A,=C,=G,=T,*AG,+T|+T|+T|=C,-A,-A,=G,=T',
#   'QSCORE': '15,16,17,18,19,0|0|0|20,-1,-1,21,22'
# }]

Large deletion

import midsv
from midsv.io import read_sam

path_sam = "examples/example_large_deletion.sam"
print(list(read_sam(path_sam)))
# [
#     ['@SQ', 'SN:example', 'LN:10'],
#     ['large-deletion', '0', 'example', '1', '60', '2M', '*', '0', '0', 'AC', '01', 'cs:Z:=AC'],
#     ['large-deletion', '0', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
# ]

print(midsv.transform(path_sam, qscore=True))
# [
#   {'QNAME': 'large-deletion',
#   'RNAME': 'example',
#   'MIDSV': '=A,=C,=N,=N,=N,=N,=N,=N,=A,=C',
#   'QSCORE': '15,16,-1,-1,-1,-1,-1,-1,23,24'}
# ]

Inversion

import midsv
from midsv.io import read_sam

path_sam = "examples/example_inversion.sam"
print(list(read_sam(path_sam)))
# [
#     ['@SQ', 'SN:example', 'LN:10'],
#     ['inversion', '0', 'example', '1', '60', '5M', '*', '0', '0', 'ACGTA', '01234', 'cs:Z:=ACGTA'],
#     ['inversion', '16', 'example', '6', '60', '3M', '*', '0', '0', 'CGT', '567', 'cs:Z:=CGT'],
#     ['inversion', '2048', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
# ]

print(midsv.transform(path_sam, qscore=True))
# [
#   {'QNAME': 'inversion',
#   'RNAME': 'example',
#   'MIDSV': '=A,=C,=G,=T,=A,=c,=g,=t,=A,=C',
#   'QSCORE': '15,16,17,18,19,20,21,22,23,24'}
# ]

🧩Helper functions

Read SAM file

midsv.io.read_sam(path_sam: str | Path) -> Iterator[list[str]]

read_sam() reads a local SAM file into an iterator of string lists.

Read/Write JSON Line (JSONL)

midsv.io.write_jsonl(dicts: list[dict[str, str]], path_output: str | Path)

Since transform() returns a list of dictionaries, write_jsonl() outputs it to a file in JSONL format.

midsv.io.read_jsonl(path_input: str | Path) -> Iterator[dict[str, str]]

Conversely, read_jsonl() reads JSONL as an iterator of dictionaries.

Reverse complement MIDSV

from midsv import formatter

midsv_tag = "=A,=A,-G,+T|+C|=A,=A,*AG,=C"
revcomp_tag = formatter.revcomp(midsv_tag)
print(revcomp_tag)
# =G,*TC,=T,=T,+G|+A|-C,=T,=T

revcomp() returns the reverse complement of a MIDSV string. Insertions are reversed and complemented with their anchor moved to the new position, following the MIDSV specification.

Export VCF

from midsv import transform
from midsv.io import write_vcf

alignments = transform("examples/example_indels.sam", qscore=False)
write_vcf(alignments, "variants.vcf", large_sv_threshold=50)

write_vcf() writes MIDSV output to VCF and supports insertion, deletion, substitution, large insertion, large deletion, and inversion. Insertions longer than large_sv_threshold are emitted as symbolic <INS>, large deletions (or =N padding) use <DEL>, and inversions use <INV>. The INFO field includes TYPE or SVTYPE, SVLEN, SEQ, and QNAME.

Sub-modules

midsv.converter
midsv.formatter
midsv.io
midsv.main
midsv.polisher
midsv.validator

Functions

def transform(path_sam: Path | str, qscore: bool = False, keep: str | list[str] = None) ‑> list[dict[str, str | int]]
Expand source code
def transform(
    path_sam: Path | str,
    qscore: bool = False,
    keep: str | list[str] = None,
) -> list[dict[str, str | int]]:
    """Integrated function to perform MIDSV conversion.

    Args:
        path_sam (str | Path): Path of a SAM file.
        qscore (bool, optional): Output QSCORE. Defaults to False.
        keep (str | list[str], optional): Subset of 'FLAG', 'POS', 'CIGAR', 'SEQ', 'QUAL', 'CSTAG' to keep. Defaults to None.

    Returns:
        list[dict[str, str]]: Dictionary containing QNAME, RNAME, MIDSV, QSCORE, and fields specified by the keep argument.
    """
    # Validation
    keep = validator.keep_argument(keep)
    validator.validate_sam(path_sam, qscore)

    # Formatting
    sqheaders: dict[str, int] = formatter.extract_sqheaders(io.read_sam(path_sam))
    alignments: list[dict[str, str | int]] = formatter.organize_alignments_to_dict(io.read_sam(path_sam))

    # Conversion to MIDSV
    alignments = converter.convert(alignments, qscore)

    # Polishing
    alignments = polisher.polish(alignments, sqheaders, keep)

    return alignments

Integrated function to perform MIDSV conversion.

Args

path_sam : str | Path
Path of a SAM file.
qscore : bool, optional
Output QSCORE. Defaults to False.
keep : str | list[str], optional
Subset of 'FLAG', 'POS', 'CIGAR', 'SEQ', 'QUAL', 'CSTAG' to keep. Defaults to None.

Returns

list[dict[str, str]]
Dictionary containing QNAME, RNAME, MIDSV, QSCORE, and fields specified by the keep argument.