FastqZip: An Improved Reference-Based Genome Sequence Lossy Compression Framework
arxiv(2024)
摘要
Storing and archiving data produced by next-generation sequencing (NGS) is a
huge burden for research institutions. Reference-based compression algorithms
are effective in dealing with these data. Our work focuses on compressing FASTQ
format files with an improved reference-based compression algorithm to achieve
a higher compression ratio than other state-of-the-art algorithms. We propose
FastqZip, which uses a new method mapping the sequence to reference for
compression, allows reads-reordering and lossy quality scores, and the BSC or
ZPAQ algorithm to perform final lossless compression for a higher compression
ratio and relatively fast speed. Our method ensures the sequence can be
losslessly reconstructed while allowing lossless or lossy compression for the
quality scores. We reordered the reads to get a higher compression ratio. We
evaluate our algorithms on five datasets and show that FastqZip can outperform
the SOTA algorithm Genozip by around 10
having an acceptable slowdown.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要