Package midsv
midsv
midsv is a Python module that converts SAM files to MIDSV format.
MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format that represents differences between a reference and a query, with the same length as the reference.
[!CAUTION] MIDSV is intended for targeted amplicon sequences (10-100 kbp).
Using whole chromosomes as references may exhaust memory and crash.[!IMPORTANT] MIDSV requires long-format cstag tags in the SAM file.
Please use minimap2 with--cs=longoption. or usecstagtool to append long-format cstag.
The output includes MIDSV and, optionally, QSCORE.
MIDSVpreserves original nucleotides while annotating mutations.QSCOREprovides Phred quality scores for each nucleotide.
Details of MIDSV (formerly MIDS) are described in our paper.
🛠️Installation
From Bioconda (recommended):
conda install -c bioconda midsv
From PyPI:
pip install midsv
📜Specifications
MIDSV
| Op | Regex | Description |
|---|---|---|
| = | [ACGTN] | Identical sequence |
| + | [ACGTN] | Insertion to the reference |
| - | [ACGTN] | Deletion from the reference |
| * | [ACGTN][ACGTN] | Substitution |
| [acgtn] | Inversion | |
| | | Separator for insertion sites |
MIDSV uses | to separate nucleotides in insertion sites so +A|+C|+G|+T|=A can be easily split into [+A, +C, +G, +T, =A] by "+A|+C|+G|+T|=A".split("|").
QSCORE
| Op | Description |
|---|---|
| -1 | Unknown |
| | | Separator for insertion sites |
QSCORE uses -1 for deletions or unknown nucleotides.
As with MIDSV, QSCORE uses | to separate quality scores in insertion sites.
📘Usage
midsv.transform(
path_sam: str | Path,
qscore: bool = False,
keep: str | list[str] = None
) -> list[dict[str, str | int]]
- path_sam: Path to a SAM file on disk.
- qscore (bool, optional): Output QSCORE. Defaults to False.
-
keep: Subset of {'FLAG', 'POS', 'SEQ', 'QUAL', 'CIGAR', 'CSTAG'} to include from the SAM file. Defaults to None.
-
transform()returns a list of dictionaries containingQNAME,RNAME,MIDSV, and optionallyQSCORE, plus any fields specified bykeep. MIDSVandQSCOREare comma-separated strings and have the same reference sequence length.
🖍️Examples
Perfect match
import midsv
from midsv.io import read_sam
# Perfect match
path_sam = "examples/example_match.sam"
print(list(read_sam(path_sam)))
# sam = [
# ['@SQ', 'SN:example', 'LN:10'],
# ['match', '0', 'example', '1', '60', '10M', '*', '0', '0', 'ACGTACGTAC', '0123456789', 'cs:Z:=ACGTACGTAC']
# ]
print(midsv.transform(path_sam, qscore=True))
# [{
# 'QNAME': 'control',
# 'RNAME': 'example',
# 'MIDSV': '=A,=C,=G,=T,=A,=C,=G,=T,=A,=C',
# 'QSCORE': '15,16,17,18,19,20,21,22,23,24'
# }]
Insertion, deletion, and substitution
import midsv
from midsv.io import read_sam
path_sam = "examples/example_indels.sam"
print(list(read_sam(path_sam)))
# [
# ['@SQ', 'SN:example', 'LN:10'],
# ['indel_sub', '0', 'example', '1', '60', '5M3I1M2D2M', '*', '0', '0', 'ACGTGTTTCGT', '01234!!!56789', 'cs:Z:=ACGT*ag+ttt=C-aa=GT']
# ]
print(midsv.transform(path_sam, qscore=True))
# [{
# 'QNAME': 'indel_sub',
# 'RNAME': 'example',
# 'MIDSV': '=A,=C,=G,=T,*AG,+T|+T|+T|=C,-A,-A,=G,=T',
# 'QSCORE': '15,16,17,18,19,0|0|0|20,-1,-1,21,22'
# }]
Large deletion
import midsv
from midsv.io import read_sam
path_sam = "examples/example_large_deletion.sam"
print(list(read_sam(path_sam)))
# [
# ['@SQ', 'SN:example', 'LN:10'],
# ['large-deletion', '0', 'example', '1', '60', '2M', '*', '0', '0', 'AC', '01', 'cs:Z:=AC'],
# ['large-deletion', '0', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
# ]
print(midsv.transform(path_sam, qscore=True))
# [
# {'QNAME': 'large-deletion',
# 'RNAME': 'example',
# 'MIDSV': '=A,=C,=N,=N,=N,=N,=N,=N,=A,=C',
# 'QSCORE': '15,16,-1,-1,-1,-1,-1,-1,23,24'}
# ]
Inversion
import midsv
from midsv.io import read_sam
path_sam = "examples/example_inversion.sam"
print(list(read_sam(path_sam)))
# [
# ['@SQ', 'SN:example', 'LN:10'],
# ['inversion', '0', 'example', '1', '60', '5M', '*', '0', '0', 'ACGTA', '01234', 'cs:Z:=ACGTA'],
# ['inversion', '16', 'example', '6', '60', '3M', '*', '0', '0', 'CGT', '567', 'cs:Z:=CGT'],
# ['inversion', '2048', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
# ]
print(midsv.transform(path_sam, qscore=True))
# [
# {'QNAME': 'inversion',
# 'RNAME': 'example',
# 'MIDSV': '=A,=C,=G,=T,=A,=c,=g,=t,=A,=C',
# 'QSCORE': '15,16,17,18,19,20,21,22,23,24'}
# ]
🧩Helper functions
Read SAM file
midsv.io.read_sam(path_sam: str | Path) -> Iterator[list[str]]
read_sam() reads a local SAM file into an iterator of string lists.
Read/Write JSON Line (JSONL)
midsv.io.write_jsonl(dicts: list[dict[str, str]], path_output: str | Path)
Since transform() returns a list of dictionaries, write_jsonl() outputs it to a file in JSONL format.
midsv.io.read_jsonl(path_input: str | Path) -> Iterator[dict[str, str]]
Conversely, read_jsonl() reads JSONL as an iterator of dictionaries.
Reverse complement MIDSV
from midsv import formatter
midsv_tag = "=A,=A,-G,+T|+C|=A,=A,*AG,=C"
revcomp_tag = formatter.revcomp(midsv_tag)
print(revcomp_tag)
# =G,*TC,=T,=T,+G|+A|-C,=T,=T
revcomp() returns the reverse complement of a MIDSV string. Insertions are reversed and complemented with their anchor moved to the new position, following the MIDSV specification.
Export VCF
from midsv import transform
from midsv.io import write_vcf
alignments = transform("examples/example_indels.sam", qscore=False)
write_vcf(alignments, "variants.vcf", large_sv_threshold=50)
write_vcf() writes MIDSV output to VCF and supports insertion, deletion, substitution, large insertion, large deletion, and inversion. Insertions longer than large_sv_threshold are emitted as symbolic <INS>, large deletions (or =N padding) use <DEL>, and inversions use <INV>. The INFO field includes TYPE or SVTYPE, SVLEN, SEQ, and QNAME.
Sub-modules
midsv.convertermidsv.formattermidsv.iomidsv.mainmidsv.polishermidsv.validator
Functions
def transform(path_sam: Path | str, qscore: bool = False, keep: str | list[str] = None) ‑> list[dict[str, str | int]]-
Expand source code
def transform( path_sam: Path | str, qscore: bool = False, keep: str | list[str] = None, ) -> list[dict[str, str | int]]: """Integrated function to perform MIDSV conversion. Args: path_sam (str | Path): Path of a SAM file. qscore (bool, optional): Output QSCORE. Defaults to False. keep (str | list[str], optional): Subset of 'FLAG', 'POS', 'CIGAR', 'SEQ', 'QUAL', 'CSTAG' to keep. Defaults to None. Returns: list[dict[str, str]]: Dictionary containing QNAME, RNAME, MIDSV, QSCORE, and fields specified by the keep argument. """ # Validation keep = validator.keep_argument(keep) validator.validate_sam(path_sam, qscore) # Formatting sqheaders: dict[str, int] = formatter.extract_sqheaders(io.read_sam(path_sam)) alignments: list[dict[str, str | int]] = formatter.organize_alignments_to_dict(io.read_sam(path_sam)) # Conversion to MIDSV alignments = converter.convert(alignments, qscore) # Polishing alignments = polisher.polish(alignments, sqheaders, keep) return alignmentsIntegrated function to perform MIDSV conversion.
Args
path_sam:str | Path- Path of a SAM file.
qscore:bool, optional- Output QSCORE. Defaults to False.
keep:str | list[str], optional- Subset of 'FLAG', 'POS', 'CIGAR', 'SEQ', 'QUAL', 'CSTAG' to keep. Defaults to None.
Returns
list[dict[str, str]]- Dictionary containing QNAME, RNAME, MIDSV, QSCORE, and fields specified by the keep argument.