Package `midsv`

midsv

midsv is a Python module to convert SAM to MIDSV format.

MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format representing the difference between a reference and a query with the same length as the reference.

⚠️ MIDSV is for the target amplicon sequence (10-100 kbp). It may crash when whole chromosomes are used as reference due to running out of memory.

MIDSV provides MIDSV, CSSPLIT, and QSCORE.

MIDSV is a simple representation focusing on mutations
CSSPLIT keeps original nucleotides
QSCORE provides Phred quality score on each nucleotide

MIDSV (formerly named MIDS) details are described in our paper.

Installation

From PyPI:

pip install midsv

From Bioconda:

conda install -c bioconda midsv

Usage

midsv.transform(
    sam: list[list],
    midsv: bool = True,
    cssplit: bool = True,
    qscore: bool = True) -> list[dict]

transform() returns a list of dictionaries incuding QNAME, RNAME, MIDSV, CSSPLIT, and QSCORE.
MIDSV, CSSPLIT, and QSCORE are comma-separated and have the same reference sequence length.

import midsv

# Perfect match

sam = [
    ['@SQ', 'SN:example', 'LN:10'],
    ['match', '0', 'example', '1', '60', '10M', '*', '0', '0', 'ACGTACGTAC', '0123456789', 'cs:Z:=ACGTACGTAC']
    ]

midsv.transform(sam)
# [{
#   'QNAME': 'control',
#   'RNAME': 'example',
#   'MIDSV': 'M,M,M,M,M,M,M,M,M,M',
#   'CSSPLIT': '=A,=C,=G,=T,=A,=C,=G,=T,=A,=C',
#   'QSCORE': '15,16,17,18,19,20,21,22,23,24'
# }]

# Insertion, deletion and substitution

sam = [
    ['@SQ', 'SN:example', 'LN:10'],
    ['indel_sub', '0', 'example', '1', '60', '5M3I1M2D2M', '*', '0', '0', 'ACGTGTTTCGT', '01234!!!56789', 'cs:Z:=ACGT*ag+ttt=C-aa=GT']
    ]

midsv.transform(sam)
# [{
#   'QNAME': 'indel_sub',
#   'RNAME': 'example',
#   'MIDSV': 'M,M,M,M,S,3M,D,D,M,M',
#   'CSSPLIT': '=A,=C,=G,=T,*AG,+T|+T|+T|=C,-A,-A,=G,=T',
#   'QSCORE': '15,16,17,18,19,0|0|0|20,-1,-1,21,22'
# }]

# Large deletion

sam = [
    ['@SQ', 'SN:example', 'LN:10'],
    ['large-deletion', '0', 'example', '1', '60', '2M', '*', '0', '0', 'AC', '01', 'cs:Z:=AC'],
    ['large-deletion', '0', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
    ]

midsv.transform(sam)
# [
#   {'QNAME': 'large-deletion',
#   'RNAME': 'example',
#   'MIDSV': 'M,M,D,D,D,D,D,D,M,M',
#   'CSSPLIT': '=A,=C,N,N,N,N,N,N,=A,=C',
#   'QSCORE': '15,16,-1,-1,-1,-1,-1,-1,23,24'}
# ]

# Inversion

sam = [
    ['@SQ', 'SN:example', 'LN:10'],
    ['inversion', '0', 'example', '1', '60', '5M', '*', '0', '0', 'ACGTA', '01234', 'cs:Z:=ACGTA'],
    ['inversion', '16', 'example', '6', '60', '3M', '*', '0', '0', 'CGT', '567', 'cs:Z:=CGT'],
    ['inversion', '2048', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
    ]

midsv.transform(sam)
# [
#   {'QNAME': 'inversion',
#   'RNAME': 'example',
#   'MIDSV': 'M,M,M,M,M,m,m,m,M,M',
#   'CSSPLIT': '=A,=C,=G,=T,=A,=c,=g,=t,=A,=C',
#   'QSCORE': '15,16,17,18,19,20,21,22,23,24'}
# ]

Operators

MIDSV

Op	Description
M	Identical sequence
[1-9][0-9]+	Insertion to the reference
D	Deletion from the reference
S	Substitution
N	Unknown
[mdsn]	Inversion

MIDSV represents insertion as an integer and appends the following operators.

If five insertions follow three matches, MIDSV returns 5M,M,M (not 5,M,M,M) since 5M,M,M keeps reference sequence length in a comma-separated field.

CSSPLIT

Op	Regex	Description
=	[ACGTN]	Identical sequence
+	[ACGTN]	Insertion to the reference
-	[ACGTN]	Deletion from the reference
*	[ACGTN][ACGTN]	Substitution
	[acgtn]	Inversion
\|		Separater of insertion sites

CSSPLIT uses | to separate nucleotides in insertion sites.

Therefore, +A|+C|+G|+T|=A can be easily splited to [+A, +C, +G, +T, =A] by "+A|+C|+G|+T|=A".split("|") in Python.

QSCORE

Op	Description
-1	Unknown
\|	Separator at insertion sites

QSCORE uses -1 at deletion or unknown nucleotides.

As with CSSPLIT, QSCORE uses | to separate quality scores in insertion sites.

Helper functions

Read SAM file

midsv.read_sam(path_of_sam: str | Path) -> list[list]

midsv.read_sam read SAM file into a list of lists.

Read/Write JSON Line (JSONL)

midsv.write_jsonl(dict: list[dict], path_of_jsonl: str | Path)

midsv.read_jsonl(path_of_jsonl: str | Path) -> list[dict]

Since midsv returns a list of dictionaries, midsv.write_jsonl outputs it to a file in JSONL format.

Conversely, midsv.read_jsonl reads JSONL as a list of dictionaries.

Sub-modules

midsv.convert
midsv.format
midsv.io
midsv.main
midsv.proofread
midsv.validate

Functions

def transform(sam: list[list] | Iterator[list], midsv: bool = True, cssplit: bool = True, qscore: bool = True, keep: list[str] = None) ‑> list[dict]

Expand source code

def transform(
    sam: list[list] | Iterator[list],
    midsv: bool = True,
    cssplit: bool = True,
    qscore: bool = True,
    keep: list[str] = None,
) -> list[dict]:
    """Integrated function to perform MIDSV conversion

    Args:
        sam (list[list]): Lists ot SAM format
        midsv (bool, optional): Output MIDSV. Defaults to True.
        cssplit (bool, optional): Output CSSPLIT. Defaults to True.
        qscore (bool, optional): Output QSCORE. Require `midsv == True` or `cssplit == True`. Defaults to True.
        keep (set(str), optional): Subset of {'FLAG', 'POS', 'SEQ', 'QUAL', 'CIGAR', 'CSTAG'} to keep. Defaults to set().
    Returns:
        list[dict]: Dictionary containing QNAME, RNAME, MIDSV, and QSCORE
    """

    if midsv or cssplit:
        pass
    else:
        raise ValueError("Either midsv or cssplit must be True")

    keep = set(keep) if keep else set()
    if keep != set() and not keep.issubset({"FLAG", "POS", "SEQ", "QUAL", "CIGAR", "CSTAG"}):
        raise ValueError("'keep' must be a subset of {'FLAG', 'POS', 'SEQ', 'QUAL', 'CIGAR', 'CSTAG'}")

    sam = list(sam)
    validate.sam_headers(sam)
    validate.sam_alignments(sam)

    sqheaders = format.extract_sqheaders(sam)
    samdict = format.dictionarize_sam(sam)

    if qscore and any(s["QUAL"] == "*" for s in samdict):
        raise ValueError("qscore must be False because the input does not contain QUAL")

    samdict = format.remove_softclips(samdict)
    samdict = format.remove_resequence(samdict)

    for alignment in samdict:
        if midsv:
            alignment["MIDSV"] = convert.cstag_to_midsv(alignment["CSTAG"])
        if cssplit:
            alignment["CSSPLIT"] = convert.cstag_to_cssplit(alignment["CSTAG"])
        if midsv and qscore:
            alignment["QSCORE"] = convert.qual_to_qscore_midsv(alignment["QUAL"], alignment["MIDSV"])
        elif cssplit and qscore:
            alignment["QSCORE"] = convert.qual_to_qscore_cssplit(alignment["QUAL"], alignment["CSSPLIT"])

    samdict_polished = proofread.join(samdict)
    samdict_polished = proofread.pad(samdict_polished, sqheaders)
    samdict_polished = proofread.remove_different_length(samdict_polished, sqheaders)
    samdict_polished = proofread.select(samdict_polished, keep)
    return samdict_polished

Integrated function to perform MIDSV conversion

Args

sam : list[list]: Lists ot SAM format
midsv : bool, optional: Output MIDSV. Defaults to True.
cssplit : bool, optional: Output CSSPLIT. Defaults to True.
qscore : bool, optional: Output QSCORE. Require midsv == True or cssplit == True. Defaults to True.

keep (set(str), optional): Subset of {'FLAG', 'POS', 'SEQ', 'QUAL', 'CIGAR', 'CSTAG'} to keep. Defaults to set().

Returns

list[dict]: Dictionary containing QNAME, RNAME, MIDSV, and QSCORE