Package midsv
midsv
midsv
is a Python module to convert SAM to MIDSV format.
MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format representing the difference between a reference and a query with the same length as the reference.
⚠️ MIDSV is for the target amplicon sequence (10-100 kbp). It may crash when whole chromosomes are used as reference due to running out of memory.
MIDSV provides MIDSV
, CSSPLIT
, and QSCORE
.
MIDSV
is a simple representation focusing on mutationsCSSPLIT
keeps original nucleotidesQSCORE
provides Phred quality score on each nucleotide
MIDSV (formerly named MIDS) details are described in our paper.
Installation
From PyPI:
pip install midsv
From Bioconda:
conda install -c bioconda midsv
Usage
midsv.transform(
sam: list[list],
midsv: bool = True,
cssplit: bool = True,
qscore: bool = True) -> list[dict]
transform()
returns a list of dictionaries incudingQNAME
,RNAME
,MIDSV
,CSSPLIT
, andQSCORE
.MIDSV
,CSSPLIT
, andQSCORE
are comma-separated and have the same reference sequence length.
import midsv
# Perfect match
sam = [
['@SQ', 'SN:example', 'LN:10'],
['match', '0', 'example', '1', '60', '10M', '*', '0', '0', 'ACGTACGTAC', '0123456789', 'cs:Z:=ACGTACGTAC']
]
midsv.transform(sam)
# [{
# 'QNAME': 'control',
# 'RNAME': 'example',
# 'MIDSV': 'M,M,M,M,M,M,M,M,M,M',
# 'CSSPLIT': '=A,=C,=G,=T,=A,=C,=G,=T,=A,=C',
# 'QSCORE': '15,16,17,18,19,20,21,22,23,24'
# }]
# Insertion, deletion and substitution
sam = [
['@SQ', 'SN:example', 'LN:10'],
['indel_sub', '0', 'example', '1', '60', '5M3I1M2D2M', '*', '0', '0', 'ACGTGTTTCGT', '01234!!!56789', 'cs:Z:=ACGT*ag+ttt=C-aa=GT']
]
midsv.transform(sam)
# [{
# 'QNAME': 'indel_sub',
# 'RNAME': 'example',
# 'MIDSV': 'M,M,M,M,S,3M,D,D,M,M',
# 'CSSPLIT': '=A,=C,=G,=T,*AG,+T|+T|+T|=C,-A,-A,=G,=T',
# 'QSCORE': '15,16,17,18,19,0|0|0|20,-1,-1,21,22'
# }]
# Large deletion
sam = [
['@SQ', 'SN:example', 'LN:10'],
['large-deletion', '0', 'example', '1', '60', '2M', '*', '0', '0', 'AC', '01', 'cs:Z:=AC'],
['large-deletion', '0', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
]
midsv.transform(sam)
# [
# {'QNAME': 'large-deletion',
# 'RNAME': 'example',
# 'MIDSV': 'M,M,D,D,D,D,D,D,M,M',
# 'CSSPLIT': '=A,=C,N,N,N,N,N,N,=A,=C',
# 'QSCORE': '15,16,-1,-1,-1,-1,-1,-1,23,24'}
# ]
# Inversion
sam = [
['@SQ', 'SN:example', 'LN:10'],
['inversion', '0', 'example', '1', '60', '5M', '*', '0', '0', 'ACGTA', '01234', 'cs:Z:=ACGTA'],
['inversion', '16', 'example', '6', '60', '3M', '*', '0', '0', 'CGT', '567', 'cs:Z:=CGT'],
['inversion', '2048', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
]
midsv.transform(sam)
# [
# {'QNAME': 'inversion',
# 'RNAME': 'example',
# 'MIDSV': 'M,M,M,M,M,m,m,m,M,M',
# 'CSSPLIT': '=A,=C,=G,=T,=A,=c,=g,=t,=A,=C',
# 'QSCORE': '15,16,17,18,19,20,21,22,23,24'}
# ]
Operators
MIDSV
Op | Description |
---|---|
M | Identical sequence |
[1-9][0-9]+ | Insertion to the reference |
D | Deletion from the reference |
S | Substitution |
N | Unknown |
[mdsn] | Inversion |
MIDSV
represents insertion as an integer and appends the following operators.
If five insertions follow three matches, MIDSV returns 5M,M,M
(not 5,M,M,M
) since 5M,M,M
keeps reference sequence length in a comma-separated field.
CSSPLIT
Op | Regex | Description |
---|---|---|
= | [ACGTN] | Identical sequence |
+ | [ACGTN] | Insertion to the reference |
- | [ACGTN] | Deletion from the reference |
* | [ACGTN][ACGTN] | Substitution |
[acgtn] | Inversion | |
| | Separater of insertion sites |
CSSPLIT
uses |
to separate nucleotides in insertion sites.
Therefore, +A|+C|+G|+T|=A
can be easily splited to [+A, +C, +G, +T, =A]
by "+A|+C|+G|+T|=A".split("|")
in Python.
QSCORE
Op | Description |
---|---|
-1 | Unknown |
| | Separator at insertion sites |
QSCORE
uses -1
at deletion or unknown nucleotides.
As with CSSPLIT
, QSCORE
uses |
to separate quality scores in insertion sites.
Helper functions
Read SAM file
midsv.read_sam(path_of_sam: str | Path) -> list[list]
midsv.read_sam
read SAM file into a list of lists.
Read/Write JSON Line (JSONL)
midsv.write_jsonl(dict: list[dict], path_of_jsonl: str | Path)
midsv.read_jsonl(path_of_jsonl: str | Path) -> list[dict]
Since midsv
returns a list of dictionaries, midsv.write_jsonl
outputs it to a file in JSONL format.
Conversely, midsv.read_jsonl
reads JSONL as a list of dictionaries.
Sub-modules
midsv.convert
midsv.format
midsv.io
midsv.main
midsv.proofread
midsv.validate
Functions
def transform(sam: list[list] | Iterator[list],
midsv: bool = True,
cssplit: bool = True,
qscore: bool = True,
keep: list[str] = None) ‑> list[dict]-
Expand source code
def transform( sam: list[list] | Iterator[list], midsv: bool = True, cssplit: bool = True, qscore: bool = True, keep: list[str] = None, ) -> list[dict]: """Integrated function to perform MIDSV conversion Args: sam (list[list]): Lists ot SAM format midsv (bool, optional): Output MIDSV. Defaults to True. cssplit (bool, optional): Output CSSPLIT. Defaults to True. qscore (bool, optional): Output QSCORE. Require `midsv == True` or `cssplit == True`. Defaults to True. keep (set(str), optional): Subset of {'FLAG', 'POS', 'SEQ', 'QUAL', 'CIGAR', 'CSTAG'} to keep. Defaults to set(). Returns: list[dict]: Dictionary containing QNAME, RNAME, MIDSV, and QSCORE """ if midsv or cssplit: pass else: raise ValueError("Either midsv or cssplit must be True") keep = set(keep) if keep else set() if keep != set() and not keep.issubset({"FLAG", "POS", "SEQ", "QUAL", "CIGAR", "CSTAG"}): raise ValueError("'keep' must be a subset of {'FLAG', 'POS', 'SEQ', 'QUAL', 'CIGAR', 'CSTAG'}") sam = list(sam) validate.sam_headers(sam) validate.sam_alignments(sam) sqheaders = format.extract_sqheaders(sam) samdict = format.dictionarize_sam(sam) if qscore and any(s["QUAL"] == "*" for s in samdict): raise ValueError("qscore must be False because the input does not contain QUAL") samdict = format.remove_softclips(samdict) samdict = format.remove_resequence(samdict) for alignment in samdict: if midsv: alignment["MIDSV"] = convert.cstag_to_midsv(alignment["CSTAG"]) if cssplit: alignment["CSSPLIT"] = convert.cstag_to_cssplit(alignment["CSTAG"]) if midsv and qscore: alignment["QSCORE"] = convert.qual_to_qscore_midsv(alignment["QUAL"], alignment["MIDSV"]) elif cssplit and qscore: alignment["QSCORE"] = convert.qual_to_qscore_cssplit(alignment["QUAL"], alignment["CSSPLIT"]) samdict_polished = proofread.join(samdict) samdict_polished = proofread.pad(samdict_polished, sqheaders) samdict_polished = proofread.remove_different_length(samdict_polished, sqheaders) samdict_polished = proofread.select(samdict_polished, keep) return samdict_polished
Integrated function to perform MIDSV conversion
Args
sam
:list[list]
- Lists ot SAM format
midsv
:bool
, optional- Output MIDSV. Defaults to True.
cssplit
:bool
, optional- Output CSSPLIT. Defaults to True.
qscore
:bool
, optional- Output QSCORE. Require
midsv == True
orcssplit == True
. Defaults to True.
keep (set(str), optional): Subset of {'FLAG', 'POS', 'SEQ', 'QUAL', 'CIGAR', 'CSTAG'} to keep. Defaults to set().
Returns
list[dict]
- Dictionary containing QNAME, RNAME, MIDSV, and QSCORE