SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with ' ', while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. SAM 是 Sequence Alignment/Map 格式的缩写。它是一种 TAB 分隔的文本格式,由标题部分(可选)和排列部分组成。如果存在,标题必须在排列之前。标题行以" "开头,而对齐行则不以" "开头。每个对齐行有 11 个必填字段,包含基本的对齐信息(如映射位置),以及数量可变的可选字段,包含灵活的或对齐器特定的信息。
This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAM file may optionally specify the version being used via the @HD VN tag. For full version history see Appendix B. 本规范适用于 1.6 版本的 SAM 和 BAM 格式。每个 SAM 和 BAM 文件可选择通过 @HD VN 标签指定所使用的版本。有关完整的版本历史,请参见附录 B。
SAM file contents are 7-bit US-ASCII, except for certain field values as individually specified which may contain other Unicode characters encoded in UTF-8. Alternatively and equivalently, SAM files are encoded in UTF-8 but non-ASCII characters are permitted only within certain field values as explicitly specified in the descriptions of those fields. SAM 文件内容为 7 位 US-ASCII,但个别指定的某些字段值除外,这些字段值可能包含以 UTF-8 编码的其他 Unicode 字符。或者,SAM 文件以 UTF-8 编码,但只允许在某些字段值中使用非 ASCII 字符,这些字段值在说明中明确指定。
Where it makes a difference, SAM file contents should be read and written using the POSIX / C locale. For example, floating-point values in SAM always use '.' for the decimal-point character. 在有区别的地方,SAM 文件内容应使用 POSIX / C 本地语言读写。例如,SAM 中的浮点数值始终使用". "作为小数点字符。
The regular expressions in this specification are written using the POSIX / IEEE Std 1003.1 extended syntax. 本规范中的正则表达式使用 POSIX / IEEE Std 1003.1 扩展语法编写。
1.1 An example 1.1 示例
Suppose we have the following alignment with bases in lowercase clipped from the alignment. Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment. 假设我们有以下配准结果,配准结果中的小写碱基被剪除。读数 r001/1 和 r001/2 构成一个读数对;r003 是嵌合读数;r004 代表分裂排列。
Template A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled from raw sequences. 模板 DNA/RNA 序列,其中一部分在测序机上测序,或由原始序列组装而成。
Segment A contiguous sequence or subsequence. 一个连续的序列或子序列。
Read A raw sequence that comes off a sequencing machine. A read may consist of multiple segments. For sequencing data, reads are indexed by the order in which they are sequenced. 从测序机上获取的原始序列。一个读数可能由多个片段组成。对于测序数据,读数按测序顺序排列索引。
Linear alignment An alignment of a read to a single reference sequence that may include insertions, deletions, skips and clipping, but may not include direction changes (i.e., one portion of the alignment on forward strand and another portion of alignment on reverse strand). A linear alignment can be represented in a single SAM record. 线性比对 是指读数与单个参考序列的比对,可能包括插入、删除、跳转和剪切,但不包括方向变化(即一部分比对在正向链上,另一部分比对在反向链上)。线性比对可以用一条 SAM 记录来表示。
Chimeric alignment An alignment of a read that cannot be represented as a linear alignment. A chimeric alignment is represented as a set of linear alignments that do not have large overlaps. Typically, one of the linear alignments in a chimeric alignment is considered the "representative" alignment, and the others are called "supplementary" and are distinguished by the supplementary alignment flag. All the SAM records in a chimeric alignment have the same QNAME and the same values for and flags (see Section 1.4). The decision regarding which linear alignment is representative is arbitrary. 嵌合配准 不能表示为线性配准的读数配准。嵌合对齐表示为一组没有大量重叠的线性对齐。通常情况下,嵌合对齐中的一条线性对齐被视为 "代表 "对齐,其他对齐被称为 "补充 "对齐,并用补充对齐标志加以区分。嵌合排列中的所有 SAM 记录都具有相同的 QNAME 以及 和 标志的相同值(见第 1.4 节)。关于哪个线性排列具有代表性的决定是任意的。
Read alignment A linear alignment or a chimeric alignment that is the complete representation of the alignment of the read. 读数配准 是一种线性配准或嵌合配准,是读数配准的完整表示。
Multiple mapping The correct placement of a read may be ambiguous, e.g., due to repeats. In this case, there may be multiple read alignments for the same read. One of these alignments is considered primary. All the other alignments have the secondary alignment flag set in the SAM records that represent them. All the SAM records have the same QNAME and the same values for and 0x80 flags. Typically the alignment designated primary is the best alignment, but the decision may be arbitrary. 多重映射 读数的正确位置可能不明确,例如由于重复。在这种情况下,同一读数可能会有多个读数配准。其中一个排列被认为是主要排列。所有其他对齐方式都在代表它们的 SAM 记录中设置了辅助对齐标志。所有 SAM 记录的 QNAME 相同, 和 0x80 标志值相同。通常情况下,指定的主要排列是最佳排列,但也可以任意决定。
1-based coordinate system A coordinate system where the first base of a sequence is one. In this coordinate system, a region is specified by a closed interval. For example, the region between the 3rd and the 7 th bases inclusive is . The SAM, VCF, GFF and Wiggle formats are using the 1-based coordinate system. 1 基坐标系 序列的第一个基数为 1 的坐标系。在该坐标系中,一个区域由一个封闭区间指定。例如,第 3 个碱基和第 7 个碱基之间的区域为 。SAM、VCF、GFF 和 Wiggle 格式都使用 1 基坐标系。
0-based coordinate system A coordinate system where the first base of a sequence is zero. In this coordinate system, a region is specified by a half-closed-half-open interval. For example, the region between the 3rd and the 7 th bases inclusive is . The BAM, BCFv2, BED, and PSL formats are using the 0 -based coordinate system. 0 基坐标系 序列的第一个基数为 0 的坐标系。在该坐标系中,一个区域由一个半闭半开的区间指定。例如,第 3 个碱基和第 7 个碱基之间的区域为 。BAM、BCFv2、BED 和 PSL 格式使用基于 0 的坐标系。
Phred scale Given a probability , the phred scale of equals , rounded to the closest integer. Phred 标度 给定概率 , 的 phred 标度等于 ,四舍五入为最接近的整数。
1.2.1 Character set restrictions 1.2.1 字符集限制
Reference sequence names, CIGAR strings, and several other field types are used as values or parts of values of other fields in SAM and related formats such as VCF. To ensure that these other fields' representations are unambiguous, these field types disallow particular delimiter characters. 在 SAM 和 VCF 等相关格式中,参考序列名称、CIGAR 字符串和其他一些字段类型被用作其他字段的值或部分值。为确保这些其他字段的表示法明确无误,这些字段类型不允许使用特定的分隔符。
Query or read names may contain any printable ASCII characters in the range [!- ] apart from ' ', so that SAM alignment lines can be easily distinguished from header lines. (They are also limited in length.) 查询或读取名称可以包含除" "之外的[!-]范围内的任何可打印 ASCII 字符,以便 SAM 对齐行与标题行容易区分。(它们的长度也有限制)。
Reference sequence names may contain any printable ASCII characters in the range [!- ] apart from backslashes, commas, quotation marks, and brackets-i.e., apart from ',"' () [] {} <>'—and may not start with ' ' or ' '. 除反斜线、逗号、引号和括号外,参考序列名称可以包含范围为 [!- ] 的任何可打印 ASCII 字符,即除',"' () [] {} <>'外,不得以' '或' '开头。
Thus they match the following regular expression: 因此,它们与以下正则表达式相匹配:
For clarity, elsewhere in this specification we write this set of allowed characters as a character class [:rname:] and extend the POSIX regular expression notation to use to indicate the omission of ' ' and ' ' from the character class. Thus this regular expression can be written more clearly as [:rname ] [:rname:]*. 为了清楚起见,我们在本规范的其他地方将这组允许使用的字符写成字符类 [:rname:] 并扩展 POSIX 正则表达式符号,使用 来表示从字符类中省略" "和" "。因此,这个正则表达式可以更清楚地写成 [:rname ] [:rname:]* 。
1.3 The header section 1.3 页眉部分
Each header line begins with the character ' ' followed by one of the two-letter header record type codes defined in this section. In the header, each line is TAB-delimited and, apart from @CO lines, each data field follows a format 'TAG:VALUE' where TAG is a two-character string that defines the format and content of VALUE. Thus header lines match /^ @(HD|SQ|RG|PG) ( or /^ @CO t .*/. Within each (non-@CO) header line, no field tag may appear more than once and the order in which the fields appear is not significant. 每个标头行都以字符" "开头,后面跟一个本节定义的双字母标头记录类型代码。在标头中,每行都用 TAB 分隔,除 @CO 行外,每个数据字段都遵循 "TAG:VALUE "格式,其中 TAG 是一个双字符串,定义了 VALUE 的格式和内容。因此,标题行匹配 /^ @(HD|SQ|RG|PG) ( 或 /^ @CO t .*/。在每个(非 @CO)标题行中,字段标记都不能出现超过一次,字段出现的顺序也不重要。
The following table describes the header record types that may be used and their predefined tags. Tags listed with are required; e.g., every @SQ header line must have SN and LN fields. As with alignment optional fields (see Section 1.5), you can freely add new tags for further data fields. Tags containing lowercase letters are reserved for local use and will not be formally defined in any future version of this specification. 下表描述了可使用的标题记录类型及其预定义标记。用 列出的标记为必填标记;例如,每个 @SQ 标头行都必须有 SN 和 LN 字段。与对齐可选字段(见第 1.5 节)一样,您可以为其他数据字段自由添加新标记。包含小写字母的标记保留给本地使用,不会在本规范的任何未来版本中正式定义。
Tag 标签
Description 说明
@HD
文件级元数据。可选。如果存在,则必须只有一行 @HD 且必须是文件的第一行。
File-level metadata. Optional. If present, there must be only one @HD line and it must be the
first line of the file.
VN*
Format version. Accepted format: /^ . 格式版本。接受格式:/^ .
sciences), SINGULAR, SOLID, and ULTIMA. This field should be omitted when the technology is
not in this list (though the PM field may still be present in this case) or is unknown.
PM
Platform model. Free-form text providing further details of the platform/technology used. 平台模型。提供所用平台/技术进一步详情的自由格式文本。
PU
Platform unit (e.g., flowcell-barcode.lane for Illumina or slide for SOLiD). Unique identifier. 平台单位(如 Illumina 的 flowcell-barcode.lane,或 SOLiD 的 slide)。唯一标识符。
SM
Sample. Use pool name where a pool is being sequenced. 样本。如果正在测序,则使用池名称。
@PG
Program. 计划
ID*
程序记录标识符。每个 @PG 行必须有一个唯一的 ID。ID 值将用于其他 @PG 行的对齐 PG 标记和 PP 标记。在合并 SAM 文件时,PG ID 可能会被修改,以便处理碰撞。
Program record identifier. Each @PG line must have a unique ID. The value of ID is used in the
alignment PG tag and PP tags of other @PG lines. PG IDs may be modified when merging SAM
files in order to handle collisions.
PN
Program name 计划名称
CL
Command line. UTF-8 encoding may be used. 命令行。可使用 UTF-8 编码。
PP
上一个 @PG-ID.必须与另一个 @PG 标头的 ID 标记相匹配。@PG 记录可以使用 PP 标记进行链式排列,链中的最后一条记录没有 PP 标记。该链定义了应用于排列的程序顺序。在合并 SAM 文件时,可以修改 PP 值,以处理 PG ID 的碰撞。链中的第一条 PG 记录(即 SAM 记录中 PG 标记所指的记录)描述了对 SAM 记录进行操作的最新程序。链中的下一条 PG 记录描述了在 SAM 记录上操作的下一个最新程序。SAM 记录上的 PG ID 并不要求指向链中最新的 PG 记录。它可以指向链中的任何 PG 记录,这意味着 SAM 记录已被该 PG 记录中的程序和通过 PP 标签指向的程序操作过。
Previous @PG-ID. Must match another @PG header's ID tag. @PG records may be chained using PP
tag, with the last record in the chain having no PP tag. This chain defines the order of programs
that have been applied to the alignment. PP values may be modified when merging SAM files
in order to handle collisions of PG IDs. The first PG record in a chain (i.e., the one referred to
by the PG tag in a SAM record) describes the most recent program that operated on the SAM
record. The next PG record in the chain describes the next most recent program that operated
on the SAM record. The PG ID on a SAM record is not required to refer to the newest PG record
in a chain. It may refer to any PG record in a chain, implying that the SAM record has been
operated on by the program in that PG record, and the program(s) referred to via the PP tag.
DS
Description. UTF-8 encoding may be used. 说明可使用 UTF-8 编码。
VN
Program version 程序版本
@CO
单行文本注释。允许使用无序的多行 @CO。可使用 UTF-8 编码。
One-line text comment. Unordered multiple @CO lines are allowed. UTF-8 encoding may be
used.
1.3.1 Defined sub-sort terms 1.3.1 已定义的子排序术语
While the SS sub-sort field allows implementation-defined keywords, some terms are predefined with specific meanings. 虽然 SS 子排序字段允许执行定义关键字,但有些术语是预定义的,具有特定含义。
lexicographical sort order is defined as a character-based dictionary sort with the character order as defined by the POSIX C locale. For example "abc", "abc17", "abc5", "abc59" and "abcd" are in lexicographical order. 词典排序顺序被定义为基于字符的词典排序,其字符顺序由 POSIX C 本地语言定义。例如,"abc"、"abc17"、"abc5"、"abc59 "和 "abcd "按词典顺序排列。
natural sort order is similar to lexicographical order except that runs of adjacent digits are considered to be numbers embedded within the text string, ordered numerically when compared to each other and ordered as single digits when compared to the surrounding non-digit characters. Runs that differ only in the number of leading zeros (thus are numerically tied) are ordered by more-zeros coming before fewer-zeros. The characters '-' and '.' are considered as ordinary characters, so apparently negative or fractional values are not treated as part of an embedded number. For example, "abc", "abc+5", "abc, "abc.d", "abc03", "abc5", "abc008", "abc08", "abc8", "abc17", "abc17.+", "abc17.2", "abc17.d", "abc59" and "abcd" are in natural order. 自然排序与词典排序类似,但相邻数字的流被视为嵌入文本字符串中的数字,相互比较时按数字排序,而与周围的非数字字符比较时则按个位数排序。仅在前导零的个数上存在差异(因此在数字上是并列的)的字符串,则按多零在前,少零在后的顺序排列。字符"-"和". "被视为普通字符,因此明显的负值或小数不被视为内嵌数字的一部分。例如,"abc"、"abc+5"、"abc "、"abc.d"、"abc03"、"abc5"、"abc008"、"abc08"、"abc8"、"abc17"、"abc17.+"、"abc17.2"、"abc17.d"、"abc59 "和 "abcd "按自然顺序排列。
umi is a lexicographical sort by the UMI tag. The MI tag should be used for comparing UMIs. The RX tag may be used in its absence but is not guaranteed to be unique across multiple libraries. umi 是按 UMI 标记进行的词典排序。在比较 UMI 时应使用 MI 标记。如果没有 RX 标记,也可以使用 RX 标记,但不能保证在多个库中都是唯一的。
1.3.2 Reference MD5 calculation 1.3.2 MD5 计算参考
The M5 tag on @SQ lines allows reference sequences to be uniquely identified through the MD5 digest of the sequence itself. As the digest is based on the sequence and nothing else, it can help resolve ambiguities with reference naming. For example, it allows a quick way of checking that references named ' 1 ', ' Chr 1 ' and 'chr1' in different files are in fact the same. @SQ 行上的 M5 标签允许通过序列本身的 MD5 摘要对参考序列进行唯一标识。由于摘要是基于序列而非其他,因此有助于解决引用命名的歧义。例如,它可以快速检查不同文件中命名为 "1"、"Chr 1 "和 "chr1 "的参考序列实际上是否相同。
The reference sequence must be in the 7-bit US-ASCII character set. All valid reference bases can be represented in this set, and it avoids the problem of determining exactly which 8 -bit representation may have been used. Padding characters (See Section 3.2) must be represented only using the '*' character. 参考序列必须使用 7 位 US-ASCII 字符集。所有有效的引用基都可以用这一字符集表示,而且可以避免确定使用的是哪一个 8 位表示法的问题。填充字符(见第 3.2 节)只能使用 "*"字符表示。
The digest is calculated as follows: 摘要的计算方法如下
All characters outside of the inclusive range 33 ('!') to are stripped out. This removes all unprintable and whitespace characters including spaces and new lines. Everything else is retained, even if not a legal nucleotide code. 除 33 ('!') 至 范围之外的所有字符都会被删除。这将删除所有不可打印字符和空白字符,包括空格和新行。其他所有字符都会保留,即使不是合法的核苷酸代码。
All lowercase characters are converted to uppercase. This operation is equivalent to calling toupper() on characters in the POSIX locale. 所有小写字母都会转换为大写字母。这一操作等同于对 POSIX 本地语言中的字符调用 toupper()。
The MD5 digest is calculated as described in RFC 1321 and presented as a 32 character lowercase hexadecimal number. MD5 摘要的计算方法如 RFC 1321 所述,并以 32 个字符的小写十六进制数表示。
As an example, if the reference contains the following characters (including spaces): 例如,如果引用包含以下字符(包括空格):
ACGT ACGT ACGT
acgt acgt acgt
... 12345 !!!
then the digest is that of the string ACGTACGTACGTACGTACGTACGT...12345!!! and the resulting tag would be M5: dfabdbb36e239a6da88957841f32b8e4. 那么摘要就是字符串 ACGTACGTACGTACGTACGTACGT...12345!!! 结果标签就是 M5: dfabdbb36e239a6da88957841f32b8e4。
In padded SAM files, the padding bases should be inserted into the reference as ' characters. Taking the example in Section 3.2, the padded version of the reference is 在填充的 SAM 文件中,填充基应以 ' 字符的形式插入到参考文献中。以第 3.2 节中的例子为例,参考文献的填充版本为
AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
and the corresponding tag is M5: caad65b937c4bc0b33c08f62a9fb5411. 相应的标签为 M5:caad65b937c4bc0b33c08f62a9fb5411。
1.4 The alignment section: mandatory fields 1.4 对齐部分:必填字段
In the SAM format, each alignment line typically represents the linear alignment of a segment. Each line consists of 11 or more TAB-separated fields. The first eleven fields are always present and in the order shown below; if the information represented by any of these fields is unavailable, that field's value will be a placeholder, either ' 0 ' or ' ' as determined by the field's type. The following table gives an overview of these mandatory fields in the SAM format: 在 SAM 格式中,每个对齐行通常代表一个数据段的线性对齐。每一行由 11 个或更多用 TAB 分隔的字段组成。前 11 个字段总是存在的,其顺序如下表所示;如果其中任何一个字段所代表的信息不可用,则该字段的值将是一个占位符,即 "0 "或" ",由字段类型决定。下表概述了 SAM 格式中的这些必填字段:
Col
Field 现场
Type 类型
Regexp/Range
Brief description 简要说明
1
QNAME
String 字符串
Query template NAME 查询模板 NAME
2
FLAG
Int 内部
bitwise FLAG 位操作 FLAG
3
RNAME
String 字符串
rname: Rname:
Reference sequence NAME 参考序列 NAME
4
POS
Int 内部
1-based leftmost mapping POSition 以 1 为基础的最左侧映射 POSition
5
MAPQ
Int 内部
MAPping Quality MAPping 质量
6
CIGAR
String 字符串
MIDNSHP
CIGAR string 雪茄烟串
7
RNEXT
String 字符串
rname: rname: rname: rname:
Reference name of the mate/next read 配偶/下一个读者的参考名称
8
PNEXT
Int 内部
Position of the mate/next read 队友的位置/下一个读数
9
TLEN
Int 内部
observed Template LENgth 观察到的模板长度
10
SEQ
String 字符串
.
segment SEQuence 段 SEQuence
11
QUAL
String 字符串
ASCII of Phred-scaled base QUALity +33 ASCII 的 Phred 标度基 QUALity +33
All mapped segments in alignment lines are represented on the forward genomic strand. For segments that have been mapped to the reverse strand, the recorded SEQ is reverse complemented from the original unmapped sequence and CIGAR, QUAL, and strand-sensitive optional fields are reversed and thus recorded consistently with the sequence bases as represented. 对齐行中的所有映射片段都表示在正向基因组链上。对于已映射到反向链的片段,记录的 SEQ 与未映射的原始序列进行反向互补,CIGAR、QUAL 和对链敏感的可选字段被反转,因此记录的序列碱基与所表示的序列碱基一致。
QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come from the same template. A QNAME '*' indicates the information is unavailable. In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given. QNAME:查询模板名称。具有相同 QNAME 的读数/段被视为来自同一模板。QNAME '*' 表示信息不可用。在 SAM 文件中,当一个读数的排列是嵌合的或给出了多个映射时,它可能会占用多个排列行。
FLAG: Combination of bitwise FLAGs. Each bit is explained in the following table: FLAG:按位排列的 FLAG 组合。 下表解释了每个位:
Bit 位
Description 说明
1
template having multiple segments in sequencing 具有多个测序段的模板
2
each segment properly aligned according to the aligner 根据校准器正确校准每一段
4
segment unmapped 未映射区段
8
next segment in the template unmapped 模板中未映射的下一段
16
SEQ being reverse complemented SEQ 正在反向互补
32
SEQ of the next segment in the template being reverse complemented 反向互补模板中下一个片段的 SEQ
64
the first segment in the template 模板中的第一段
128
the last segment in the template 模板中的最后一段
256
secondary alignment 次级排列
512
not passing filters, such as platform/vendor quality controls 未通过过滤,如平台/供应商质量控制
1024
PCR or optical duplicate PCR 或光学复本
2048
supplementary alignment 补充校准
For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies 'FLAG & '. This line is called the primary line of the read. 对于 SAM 文件中的每个读取/连续文件,都要求与读取相关的一行且仅有一行满足 "FLAG & '。这一行被称为读取的主要行。
Bit 0x100 marks the alignment not to be used in certain analyses when the tools in use are aware of this bit. It is typically used to flag alternative mappings when multiple mappings are presented in a SAM. 比特 0x100 标志着在某些分析中,当使用的工具意识到该比特时,将不使用对齐方式。当 SAM 中出现多个映射时,它通常用于标记替代映射。
Bit indicates that the corresponding alignment line is part of a chimeric alignment. A line flagged with 0x800 is called as a supplementary line. 位 表示相应的对齐行是嵌合对齐的一部分。标记为 0x800 的行称为补充行。
Bit is the only reliable place to tell whether the read is unmapped. If is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits , and . 位 是判断读取是否未映射的唯一可靠依据。如果设置了 ,就不能对 RNAME、POS、CIGAR、MAPQ 以及位 和 做出任何假设。
Bit 0x10 indicates whether SEQ has been reverse complemented and QUAL reversed. When bit 0 x 4 is unset, this corresponds to the strand to which the segment has been mapped: bit 0 x 10 unset indicates the forward strand, while set indicates the reverse strand. When 0 x 4 is set, this indicates whether the unmapped read is stored in its original orientation as it came off the sequencing machine. 位 0x10 表示 SEQ 是否已反向补码和 QUAL 反转。当第 0 x 4 位未设置时,这对应于测序段所映射的链路:第 0 x 10 位未设置表示正向链路,设置表示反向链路。当 0 x 4 位被设置时,表示未映射读数是否以从测序机上下来时的原始方向存储。
Bits and reflect the read ordering within each template inherent in the sequencing technology used. If and are both set, the read is part of a linear template, but it is neither the first nor the last read. If both and are unset, the index of the read in the template is unknown. This may happen for a non-linear template or when this information is lost during data processing. 位 和 反映了所用测序技术中每个模板内固有的读数排序。 如果 和 都被设置,则读数是线性模板的一部分,但既不是第一个读数,也不是最后一个读数。如果 和 都未设置,则读数在模板中的索引未知。这种情况可能发生在非线性模板中,或者在数据处理过程中丢失了这一信息。
If is unset, no assumptions can be made about and . 如果未设置 ,则无法对 和 进行假设。
Bits that are not listed in the table are reserved for future use. They should not be set when writing and should be ignored on reading by current software. 表中未列出的位保留供将来使用。当前软件在写入时不应设置这些位,在读取时也应忽略它们。
RNAME: Reference sequence NAME of the alignment. If @SQ header lines are present, RNAME (if not ) must be present in one of the SQ-SN tag. An unmapped segment without coordinate has a , at RNAME:比对的参考序列名称。如果存在 @SQ 标题行,RNAME(如果不是 )必须出现在其中一个 SQ-SN 标记中。无坐标的未映射段有一个 ,在
this field. However, an unmapped segment may also have an ordinary coordinate such that it can be placed at a desired position after sorting. If RNAME is , no assumptions can be made about POS and CIGAR. 这个字段。但是,未映射的线段也可能有一个普通坐标,这样就可以在排序后将其放置在所需的位置上。如果 RNAME 为 ,则无法假设 POS 和 CIGAR。
4. POS: 1-based leftmost mapping POSition of the first CIGAR operation that "consumes" a reference base (see table below). The first base in a reference sequence has coordinate 1 . POS is set as 0 for an unmapped read without coordinate. If POS is 0 , no assumptions can be made about RNAME and CIGAR. 4.POS:第一个 "消耗 "参照基的 CIGAR 运算的以 1 为基准的最左侧映射 POS 位置(见下表)。参考序列中的第一个碱基坐标为 1。对于无坐标的未映射读数,POS 设置为 0。如果 POS 为 0,则不能对 RNAME 和 CIGAR 作任何假设。
5. MAPQ: MAPping Quality. It equals mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available. 5.MAPQ:映射质量。等于 映射位置错误},四舍五入为整数。数值 255 表示没有映射质量。
6. CIGAR: CIGAR string. The CIGAR operations are given in the following table (set ' ' if unavailable): 6.CIGAR:CIGAR 字符串。下表给出了 CIGAR 操作(如果没有,则设置为" "):
Op 作品
BAM
Description 说明
消耗查询
Consumes
query
消耗参考资料
Consumes
reference
M
0
alignment match (can be a sequence match or mismatch) 配准匹配(可以是序列匹配或不匹配)
yes 是
yes 是
I
1
insertion to the reference 插入参考资料
yes 是
no 没有
D
2
deletion from the reference 从参考资料中删除
no 没有
yes 是
N
3
skipped region from the reference 从参考文献中跳过的区域
no 没有
yes 是
S
4
soft clipping (clipped sequences present in SEQ) 软剪切(SEQ 中存在剪切序列)
yes 是
no 没有
H
5
hard clipping (clipped sequences NOT present in SEQ) 硬剪切(剪切序列不存在于 SEQ 中)
no 没有
no 没有
P
6
padding (silent deletion from padded reference) 填充(从填充的引用中静默删除)
yo 哟
no 没有
=
7
sequence match 序列匹配
yes 是
yes 是
X
8
sequence mismatch 序列错配
yes 是
yes 是
"Consumes query" and "consumes reference" indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively. "消耗查询 "和 "消耗参考 "分别表示 CIGAR 操作是否会导致比对沿着查询序列和参考序列进行。
H can only be present as the first and/or last operation. H 只能作为第一个和/或最后一个操作出现。
S may only have H operations between them and the ends of the CIGAR string. S 与 CIGAR 字符串两端之间只能进行 H 运算。
For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined. 对于 mRNA 到基因组的比对,N 运算代表内含子。对于其他类型的比对,N 的解释没有定义。
Sum of lengths of the operations shall equal the length of SEQ. 操作的长度总和应等于 SEQ 的长度。
RNEXT: Reference sequence name of the primary alignment of the NEXT read in the template. For the last read, the next read is the first read in the template. If @SQ header lines are present, RNEXT (if not , or ' ') must be present in one of the SQ-SN tag. This field is set as '*' when the information is unavailable, and set as ' ' if RNEXT is identical RNAME. If not ' ' and the next read in the template has one primary mapping (see also bit in FLAG), this field is identical to RNAME at the primary line of the next read. If RNEXT is , no assumptions can be made on PNEXT and bit . RNEXT:模板中下一个读数的主排列参考序列名称。对于最后一个读数,下一个读数是模板中的第一个读数。如果存在 @SQ 标头行,RNEXT(如果不是 或 ' ')必须出现在其中一个 SQ-SN 标记中。当信息不可用时,该字段将被设置为 "*";如果 RNEXT 与 RNAME 相同,该字段将被设置为" "。如果不是' ',且模板中的下一次读取有一个主映射(另见 FLAG 中的位 ),则该字段与下一次读取的主行 RNAME 相同。如果 RNEXT 为 ,则不能对 PNEXT 和位 进行假设。
PNEXT: 1-based Position of the primary alignment of the NEXT read in the template. Set as 0 when the information is unavailable. This field equals POS at the primary line of the next read. If PNEXT is 0 , no assumptions can be made on RNEXT and bit . PNEXT:1-based 下一读数在模板中的主排列位置。当信息不可用时设置为 0。该字段等于下一个读数主排列行的 POS。如果 PNEXT 为 0,则不能假设 RNEXT 和位 。
TLEN: signed observed Template LENgth. For primary reads where the primary alignments of all reads in the template are mapped to the same reference sequence, the absolute value of TLEN equals the distance between the mapped end of the template and the mapped start of the template, inclusively (i.e., end - start +1 ). Note that mapped base is defined to be one that aligns to the reference as described by CIGAR, hence excludes soft-clipped bases. The TLEN field is positive for the leftmost segment of the template, negative for the rightmost, and the sign for any middle segment is undefined. If segments cover the same coordinates then the choice of which is leftmost and rightmost is arbitrary, but the two ends must still have differing signs. It is set as 0 for a single-segment template or when TLEN:带符号的观察模板长度(Template LENgth)。对于模板中所有读数的主排列都映射到同一参考序列的主读数,TLEN 的绝对值等于模板的映射末端与模板的映射起点之间的距离(即末端-起点+1)。 注意,映射碱基的定义是与 CIGAR 所描述的参考文献对齐的碱基,因此不包括软缺口碱基。TLEN 字段对于模板最左边的片段是正值,对于最右边的片段是负值,中间片段的符号未定义。如果片段覆盖相同的坐标,则可以任意选择最左和最右的片段,但两端必须有不同的符号。如果是单段模板或当
the information is unavailable (e.g., when the first or last segment of a multi-segment template is unmapped or when the two are mapped to different reference sequences). 信息不可用(例如,多片段模板的第一个或最后一个片段未映射,或两个片段映射到了不同的参考序列)。
The intention of this field is to indicate where the other end of the template has been aligned without needing to read the remainder of the SAM file. Unfortunately there has been no clear consensus on the definitions of the template mapped start and end. Thus the exact definitions are implementationdefined. 该字段的目的是标明模板另一端已对齐的位置,而无需读取 SAM 文件的其余部分。遗憾的是,对于模板映射的起点和终点的定义还没有达成明确的共识。因此,确切的定义是由执行定义的。
10. SEQ: segment SEQuence. This field can be a when the sequence is not stored. If not a , the length of the sequence must equal the sum of lengths of operations in CIGAR. An ' ' denotes the base is identical to the reference base. No assumptions can be made on the letter cases. 10.SEQ:段 SEQuence。当不存储序列时,该字段可以是 。如果不是 ,序列的长度必须等于 CIGAR 中 操作的长度之和。 表示碱基与参考碱基相同。不能对字母的大小写做任何假设。
11. QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format). A base quality is the phred-scaled base error probability which equals base is wrong . This field can be a when quality is not stored. If not , SEQ must not be a , and the length of the quality string ought to equal the length of SEQ. 11.QUAL(质量):碱基质量加 33 的 ASCII 码(与 Sanger FASTQ 格式中的质量字符串相同)。碱基质量是碱基错误概率的phred-scaled,等于 碱基错误 。如果不存储质量,该字段可以是 。如果不是 ,SEQ 必须不是 ,质量字符串的长度应等于 SEQ 的长度。
1.5 The alignment section: optional fields 1.5 对齐部分:可选字段
All optional fields follow the TAG:TYPE:VALUE format where TAG is a two-character string that matches /[A-Za-z] [A-Za-z0-9]/. Within each alignment line, no TAG may appear more than once and the order in which the optional fields appear is not significant. A TAG containing lowercase letters is reserved for end users. In an optional field, TYPE is a single case-sensitive letter which defines the format of VALUE: 所有可选字段都遵循 TAG:TYPE:VALUE 格式,其中 TAG 是两个字符的字符串,与 /[A-Za-z] [A-Za-z0-9]/ 匹配。在每个对齐行内,任何 TAG 都不得出现超过一次,可选字段的出现顺序也不重要。包含小写字母的 TAG 保留给最终用户。在可选字段中,TYPE 是一个区分大小写的字母,它定义了 VALUE 的格式:
Type 类型
Regexp matching VALUE 与 VALUE 匹配的 Regexp
Description 说明
A
Printable character 可打印字符
i
Signed integer 有符号整数
f
Single-precision floating number 单精度浮点数
Z
Printable string, including space 可打印字符串,包括空格
H
Byte array in the Hex format 十六进制格式的字节数组
B
cCsSiIf
Integer or numeric array 整数或数字数组
For an integer or numeric array (type ' '), the first letter indicates the type of numbers in the following comma separated array. The letter can be one of 'cCsSiIf', corresponding to int8_t (signed 8-bit integer), uint8_t (unsigned 8-bit integer), int16_t, uint16_t, int32_t, uint32_t and float, respectively. During import/export, the element type may be changed if the new type is also compatible with the array. 对于整数或数字数组(类型为" "),第一个字母表示下面逗号分隔数组中的数字类型。该字母可以是'cCsSiIf'中的一个,分别对应 int8_t(有符号 8 位整数)、uint8_t(无符号 8 位整数)、int16_t、uint16_t、int32_t、uint32_t 和 float。 在导入/导出过程中,如果新类型也与数组兼容,则元素类型可以改变。
Predefined tags are described in the separate Sequence Alignment/Map Optional Fields Specification. See that document for details of existing standard tag fields and conventions around creating new tags that may be of general interest. Tags starting with ' ', ' ' or ' ' and tags containing lowercase letters in either position are reserved for local use and will not be formally defined in any future version of these specifications. 预定义标记在单独的《序列对齐/映射可选字段规范》中进行了描述。 有关现有标准标记字段的详细信息和创建新标记的约定,请参阅该文档。以" "、" "或" "开头的标记以及在任一位置包含小写字母的标记保留给本地使用,不会在这些规范的任何未来版本中正式定义。
2 Recommended Practice for the SAM Format 2 SAM 格式的建议做法
This section describes the best practice for representing data in the SAM format. They are not required in general, but may be required by a specific software package for it to function properly. 本节介绍以 SAM 格式表示数据的最佳做法。一般情况下,它们不是必需的,但特定软件包可能需要它们才能正常运行。
The header section 页眉部分
1 The @HD line should be present, with either the SO tag or the GO tag (but not both) specified. 1 应该有 @HD 行,并指定 SO 标记或 GO 标记(但不能同时指定)。
2 The @SQ lines should be present if reads have been mapped. 2 如果已对读数进行了映射,则应出现 @SQ 行。
3 When a RG tag appears anywhere in the alignment section, there should be a single corresponding @RG line with matching ID tag in the header. 3 当 RG 标记出现在对齐部分的任何位置时,页眉中应出现一个与 ID 标记匹配的 @RG 行。
4 When a PG tag appears anywhere in the alignment section, there should be a single corresponding @PG line with matching ID tag in the header. 4 当 PG 标记出现在对齐部分的任何位置时,页眉中应出现一个与 ID 标记匹配的 @PG 行。
Adjacent CIGAR operations should be different. 相邻的 CIGAR 行动应有所不同。
No alignments should be assigned mapping quality 255. 不应将任何排列分配给制图质量为 255 的排列。
Unmapped reads 未映射读数
1 For a unmapped paired-end or mate-pair read whose mate is mapped, the unmapped read should have RNAME and POS identical to its mate. 1 对于伴侣已被映射的未映射成对末端或伴侣对读数,未映射读数的 RNAME 和 POS 应与其伴侣相同。
2 If all segments in a template are unmapped, their RNAME should be set as '*, and POS as 0. 2 如果模板中的所有线段都未映射,则其 RNAME 应设置为'*',POS 应设置为 0。
3 If POS plus the sum of lengths of operations in CIGAR exceeds the length specified in the LN field of the @SQ header line (if exists) with an SN equal to RNAME, the alignment should be unmapped, unless the reference sequence is circular (see below). 3 如果 POS 加上 CIGAR 中 操作的长度总和超过 @SQ 头行(如果存在)LN 字段中指定的长度(SN 等于 RNAME),除非参考序列是环状的(见下文),否则应取消对齐。
4 Unmapped reads should be stored in the orientation in which they came off the sequencing machine and have their reverse flag bit ( ) correspondingly unset. 4 未映射读数应以它们从测序机上下来时的方向存储,并相应地取消反向标志位( )的设置。
Multiple mapping 多重映射
1 When one segment is present in multiple lines to represent a multiple mapping of the segment, only one of these records should have the secondary alignment flag bit ( ) unset. RNEXT and PNEXT point to the primary line of the next read in the template. 1 当一个数据段出现在多行中以表示该数据段的多重映射时,只有其中一条记录的辅助对齐标志位 ( ) 未被设置。RNEXT 和 PNEXT 指向模板中下一次读取的主行。
2 SEQ and QUAL of secondary alignments should be set to ' to reduce the file size. 2 SEQ 和次级排列的 QUAL 应设置为 ' 以减小文件大小。
Optional tags: 可选标签:
1 If the template has more than 2 segments, the TC tag should be present. 1 如果模板有 2 个以上分段,则应出现 TC 标记。
2 The NM tag should be present. 2 应该有 NM 标签。
Circular reference sequences 循环参考序列
Mappings that cross the coordinate 'join' in circular reference sequences (i.e., those whose @SQ headers specify TP : circular) may be represented as follows: 在循环参照序列中跨越坐标 "连接 "的映射(即 @SQ 标头指定 TP : 循环的映射)可表示如下:
1 (Preferred) As usual POS should be between 1 and the @SQ header's LN value, but POS plus the sum of the lengths of operations may exceed LN. Coordinates greater than LN are interpreted by subtracting LN so that bases at are considered to be mapped at positions ; thus each ( 1 -based) position is interpreted as 1(首选)通常情况下,POS 应介于 1 和 @SQ 标头的 LN 值之间,但 POS 加上 操作的长度总和可能会超过 LN。大于 LN 的坐标将通过减去 LN 来解释,因此 位置上的碱基将被视为映射到 位置上;因此每个(基于 1 的)位置 将被解释为 。
2 Alternatively, such alignments may be split across several records: one record representing the initial portion of the segment ending at LN, one representing the final portion starting from 1 , and any other records representing additional portions in between spanning the entire reference sequence. One record (chosen arbitrarily) is considered primary and the remainder have their supplementary flag bit set. 2 或者,这种排列也可以分成几条记录:一条记录代表以 LN 为终点的片段的初始部分,一条记录代表以 1 为终点的片段的最终部分,其他任何记录则代表横跨整个参考序列之间的其他部分。其中一条记录(可任意选择)被视为主要记录,其余记录的补充标志位 被设置。
8. Annotation dummy reads: These have SEQ set to *, FLAG bits and set (secondary and filtered), and a CT tag. 8.注释哑读数:这些假读数的 SEQ 设置为 *,FLAG 位 和 设置为(二级和过滤),并带有 CT 标记。
1 If you wish to store free text in a CT tag, use the key value Note (uppercase N) to match GFF3. 1 如果希望在 CT 标签中存储自由文本,请使用键值 "注释"(大写 N)来匹配 GFF3。
2 Multi-segment annotation (e.g., a gene with introns) should be described with multiple lines in SAM (like a multi-segment read). Where there is a clear biological direction (e.g., a gene), the first segment (FLAG bit 0x40) is used for the first section (e.g., the end of the gene). Thus a GenBank entry location like complement(join(85052..85354, 85441..85621, 86097..86284)) would have three lines in SAM with a common QNAME: 2 多区段注释(如带有内含子的基因)应在 SAM 中用多行描述(就像多区段读取一样)。如果有明确的生物学方向(如基因),则第一段(FLAG 位 0x40)用于第一节(如基因的 端)。因此,像 complement(join(85052..85354, 85441..85621, 86097..86284)) 这样的 GenBank 条目位置在 SAM 中会有三行,具有共同的 QNAME:
FLAG
POS
CIGAR
Optional fields 可选字段
The 5' fragment 5' 片段
86097
188 M
FI:i:1
TC:i:3
Middle fragment 中间片段
85441
181 M
FI:i:2
TC:i:3
The 3' fragment 3' 片段
85052
303M
FI:i:3
TC:i:3
3 If converting GFF3 to SAM, store any key, values from column 9 in the CT tag, except for the unique ID which is used for the QNAME. GFF3 columns 1 (seqid), 4 (start) and 5 (end) are encoded using SAM columns RNAME, POS and CIGAR to hold the length. GFF3 columns 3 (type) and 7 (strand) are stored explicitly in the CT tag. Remaining GFF3 columns 2 (source), 6 (score), and 8 (phase) are stored in the CT tag using key values FSource, FScore and FPhase (uppercase keys are restricted in GFF3, so these names avoid clashes). Split location features are described with multiple lines in GFF3, and similarly become multi-segment dummy reads in SAM, with the RNEXT and PNEXT columns filled in appropriately. In the absence of a convention in SAM/BAM for reads wrapping the origin of a circular genome, any GFF3 feature line wrapping the origin must be split into two segments in SAM. 3 如果将 GFF3 转换为 SAM,除了用于 QNAME 的唯一 ID 外,将第 9 列的任何键值存储在 CT 标签中。GFF3 第 1 列(seqid)、第 4 列(start)和第 5 列(end)使用 SAM 列 RNAME、POS 和 CIGAR 编码,以保存长度。GFF3 第 3 列(类型)和第 7 列(链)明确存储在 CT 标记中。其余的 GFF3 第 2 列(源)、第 6 列(分数)和第 8 列(阶段)使用键值 FSource、FScore 和 FPhase 存储在 CT 标签中(大写键值在 GFF3 中受到限制,因此这些名称可避免冲突)。分割位置特征在 GFF3 中用多行描述,在 SAM 中同样成为多区段虚读,并适当填写 RNEXT 和 PNEXT 列。由于 SAM/BAM 中没有对环状基因组原点包裹读数的约定,因此任何包裹原点的 GFF3 特征行都必须在 SAM 中分割成两段。
3 Guide for Describing Assembly Sequences in SAM 3 SAM 中的装配序列描述指南
3.1 Unpadded versus padded representation 3.1 无衬垫表示法与有衬垫表示法
To describe alignments, we can regard the reference sequence with no respect to other alignments against it. Such a reference sequence is called an unpadded reference. A position on an unpadded reference, referred to as an unpadded position, is not affected by any alignments. When we use unpadded references and positions to describe alignments, we say we are using the unpadded representation. 在描述排列时,我们可以将参考序列视为与其他排列无关的序列。这样的参照序列称为无填充参照。无填充参考序列上的位置称为无填充位置,不受任何排列的影响。当我们使用无加载参照和位置来描述排列时,我们说我们正在使用无加载表示法。
Alternatively, to describe the same alignments, we can modify the reference sequence to contain pads that make room for sequences inserted relative to the reference. A pad is effectively a gap and conventionally represented by an asterisk . A reference sequence containing pads is called a padded reference. A position which counts the 's is referred to as a padded position. A padded reference sequence may be affected by the query alignments and because of gap insertions is typically longer than the unpadded reference. The padded position of one query alignment may be affected by other query alignments. 或者,为了描述相同的排列,我们可以修改参考序列,使其包含垫区,为相对于参考序列插入的序列留出空间。填充实际上是一个间隙,通常用星号 表示。包含填充的参照序列称为填充参照。计算 's 的位置称为填充位置。填充的参考序列可能会受到查询排列的影响,而且由于间隙插入,通常比未填充的参考序列要长。一个查询排列的填充位置可能会受到其他查询排列的影响。
Unpadded and padded are different representations of the same alignments. They are convertible to each other with no loss of any information. The unpadded representation is more common due to the convenience of a fixed coordinate system, while the padded representation has the advantage that alignments can be simply described by the start and end coordinates without using complex CIGAR strings. SAM traditionally uses the padded representation for de novo assembly. The ACE assembly format uses the padded representation exclusively. 无衬垫和有衬垫是相同排列的不同表示。它们可以相互转换,不会丢失任何信息。无填充表示法由于使用固定坐标系比较方便而比较常见,而填充表示法的优点是可以简单地用开始和结束坐标来描述排列,而不需要使用复杂的 CIGAR 字符串。SAM 传统上使用填充表示法进行从头组装。ACE 汇编格式只使用填充表示法。
3.2 Padded SAM 3.2 填充式 SAM
The SAM format is typically used to describe alignments against an unpadded reference sequence, but it is also able to describe alignments against a padded reference. In the latter case, we say we are using a padded . A padded SAM is a valid SAM, but with the difference that the reference and positions in use are padded. There may be more than one way to describe the padded representation. We recommend the following; see also the discussion in Cock et al. SAM 格式通常用于描述与无填充参考序列的比对,但也可以描述与填充参考序列的比对。在后一种情况下,我们说使用的是填充的 。填充的 SAM 也是有效的 SAM,不同之处在于使用的参考序列和位置都是填充的。描述填充表示法的方法可能不止一种。我们建议采用以下方法;另请参阅 Cock 等人在 中的讨论。
In a padded SAM, alignments and coordinates are described with respect to the padded reference sequence. Unlike traditional padded representations like the ACE file format where pads/gaps are recorded in reads using 's, we do not write *'s in the SEQ field of the SAM format. Instead, we describe pads in the query sequences as deletions from the padded reference using the CIGAR 'D' operation. In a padded SAM, the insertion and padding CIGAR operations (' ' and ' ') are not used because the padded reference already considers all the insertions. 在填充的 SAM 中,排列和坐标是相对于填充的参考序列来描述的。在传统的填充表示法(如 ACE 文件格式)中,用 's 记录读数中的填充/间隙,与此不同,我们不会在 SAM 格式的 SEQ 字段中写入 *。 相反,我们使用 CIGAR "D "操作将查询序列中的填充描述为从填充参考中删除。在填充的 SAM 中,不使用插入和填充 CIGAR 操作(' '和' '),因为填充的引用已经考虑了所有的插入。
The following shows the padded SAM for the example alignment in Section 1.1. Notably, the length of ref is 47 instead of 45 . POS of the last three alignments are all shifted by 2. CIGAR of alignments bridging the 2 bp insertion are also changed. 下面是第 1.1 节中对齐示例的填充 SAM。值得注意的是,ref 的长度是 47 而不是 45。最后三条排列的 POS 都移动了 2。
Here we also exemplify the recommended practice for storing the reference sequence and the reference annotations in SAM when necessary. For a reference sequence in SAM, QNAME should be identical to RNAME, POS set to 1 and FLAG to 516 (filtered and unmapped); for an annotation, FLAG should be set to 768 (filtered and secondary) with no restriction to QNAME. Dummy reads for annotation would typically have a CT tag to hold the annotation information; see the discussion of dummy reads in Section 2. See also the separate Optional Fields Specification for full details of the CT and PT annotation tags. 这里我们还举例说明了必要时在 SAM 中存储参考序列和参考注释的推荐做法。对于 SAM 中的参考序列,QNAME 应与 RNAME 相同,POS 设为 1,FLAG 设为 516(已过滤且未映射);对于注释,FLAG 应设为 768(已过滤且二级),QNAME 不受限制。用于注释的虚假读数通常会有一个 CT 标签来保存注释信息;参见第 2 节中关于虚假读数的讨论。有关 CT 和 PT 注释标记的全部详情,请参阅单独的《可选字段规范》。
4 The BAM Format Specification 4 BAM 格式规范
4.1 The BGZF compression format 4.1 BGZF 压缩格式
BGZF is block compression implemented on top of the standard gzip file format. The goal of BGZF is to provide good compression while allowing efficient random access to the BAM file for indexed queries. The BGZF format is 'gunzip compatible', in the sense that a compliant gunzip utility can decompress a BGZF compressed file. BGZF 是在标准 gzip 文件格式基础上实现的块压缩。 BGZF 的目标是提供良好的压缩效果,同时允许对 BAM 文件进行高效的随机访问,以便进行索引查询。BGZF 格式是 "与 gunzip 兼容 "的,也就是说,兼容 gunzip 的实用程序可以解压 BGZF 压缩文件。
A BGZF file is a series of concatenated BGZF blocks, each no larger than 64 Kb before or after compression. Each BGZF block is itself a spec-compliant gzip archive which contains an "extra field" in the format described in RFC1952. The gzip file format allows the inclusion of application-specific extra fields and these are ignored by compliant decompression implementation. The gzip specification also allows gzip files to be concatenated. The result of decompressing concatenated gzip files is the concatenation of the uncompressed data. BGZF 文件是一系列连接的 BGZF 块,压缩前后每个块的大小均不超过 64 Kb。每个 BGZF 块本身都是一个符合规范的 gzip 压缩包,其中包含一个 RFC1952 所述格式的 "额外字段"。gzip 文件格式允许包含特定于应用程序的额外字段,这些字段会被符合规范的解压缩执行程序忽略。gzip 规范还允许连接 gzip 文件。解压缩串联 gzip 文件的结果是解压缩数据的串联。
Each BGZF block contains a standard gzip file header with the following standard-compliant extensions: 每个 BGZF 块都包含一个标准的 gzip 文件头,并带有以下符合标准的扩展名:
The F.EXTRA bit in the header is set to indicate that extra fields are present. 标头中的 F.EXTRA 位被设置,表示存在额外字段。
The extra field used by BGZF uses the two subfield ID values 66 and 67 (ASCII 'BC'). BGZF 使用的额外字段使用两个子字段 ID 值 66 和 67(ASCII "BC")。
The length of the BGZF extra field payload (field LEN in the gzip specification) is 2 (two bytes of payload). BGZF 额外字段有效载荷(gzip 规范中的字段 LEN)的长度为 2(两个字节的有效载荷)。
The payload of the BGZF extra field is a 16-bit unsigned integer in little endian format. This integer gives the size of the containing BGZF block minus one. BGZF 额外字段的有效载荷是一个 16 位无符号整数,采用 little endian 格式。该整数表示包含的 BGZF 块的大小减一。
On disk, a complete BGZF file is a series of blocks as shown in the following table. (All integers are little endian as is required by RFC1952.) 在磁盘上,一个完整的 BGZF 文件是一系列块,如下表所示。(根据 RFC1952 的要求,所有整数都是小字尾)。
Field 现场
Description 说明
Type 类型
Value 价值
List of compression blocks (until the end of the file) 压缩块列表(直至文件末尾)
ID1
gzip IDentifier 1
uint8_t
31
ID2
gzip IDentifier2
uint8_t
139
CM
gzip Compression Method gzip 压缩方法
uint8_t
8
FLG
gzip FLaGs
uint8_t
4
MTIME
gzip Modification TIME gzip 修改时间
uint32_t
XFL
gzip eXtra FLags
uint8_t
OS
gzip Operating System gzip 操作系统
uint8_t
XLEN
gzip eXtra LENgth gzip 额外长度
uint16_t
Extra subfield(s) (total size XLEN) 额外子字段(总大小 XLEN
Additional RFC1952 extra subfields if present 附加 RFC1952 额外子字段(如果存在
SI1
Subfield Identifier1 子字段标识符1
uint8_t
66
S12
Subfield Identifier2 子字段标识符2
uint8_t
67
SLEN
Subfield LENgth 子字段长度
uint16_t
2
BSIZE
total Block SIZE minus 1 总块大小减 1
uint16_t
Additional RFC1952 extra subfields if present 附加 RFC1952 额外子字段(如果存在
CDATA
Compressed DATA by zlib::deflate() 通过 zlib::deflate() 压缩的数据
uint8_t [BSIZE-XLEN-19]
CRC32
CRC-32
uint32_t
ISIZE
Input SIZE (length of uncompressed data) 输入 SIZE(未压缩数据的长度)
uint32_t
The random access method to be described next limits the uncompressed contents of each BGZF block to a maximum of bytes of data. Thus while ISIZE is stored as a uint32_t as per the gzip format, in BGZF it is limited to the range . BSIZE can represent BGZF block sizes in the range [1,65536], though typically BSIZE will be rather less than ISIZE due to compression. 接下来介绍的随机访问方法将每个 BGZF 块的未压缩内容限制为最多 字节的数据。因此,虽然 ISIZE 按照 gzip 格式存储为 uint32_t,但在 BGZF 中,它被限制在 的范围内。BSIZE 可以表示 [1,65536] 范围内的 BGZF 数据块大小,不过由于压缩的原因,BSIZE 通常会比 ISIZE 小。
4.1.1 Random access 4.1.1 随机访问
BGZF files support random access through the BAM file index. To achieve this, the BAM file index uses virtual file offsets into the BGZF file. Each virtual file offset is an unsigned 64-bit integer, defined as: coffset<<16|uoffset, where coffset is an unsigned byte offset into the BGZF file to the beginning of a BGZF block, and uoffset is an unsigned byte offset into the uncompressed data stream represented by that BGZF block. Virtual file offsets can be compared, but subtraction between virtual file offsets and addition between a virtual offset and an integer are both disallowed. BGZF 文件通过 BAM 文件索引支持随机存取。为此,BAM 文件索引使用 BGZF 文件中的虚拟文件偏移量。每个虚拟文件偏移量都是一个无符号的 64 位整数,定义为:coffset<<16|uoffset,其中 coffset 是 BGZF 文件中到 BGZF 块开头的无符号字节偏移量,uoffset 是该 BGZF 块所代表的未压缩数据流的无符号字节偏移量。可以比较虚拟文件偏移量,但不允许在虚拟文件偏移量之间做减法,也不允许在虚拟偏移量和整数之间做加法。
4.1.2 End-of-file marker 4.1.2 文件结束标记
An end-of-file (EOF) trailer or marker block should be written at the end of BGZF files, so that unintended file truncation can be easily detected. The EOF marker block is a particular empty BGZF block encoded with the default zlib compression level settings, and consists of the following 28 hexadecimal bytes: 应在 BGZF 文件末尾写入文件末尾 (EOF) 拖车或标记块,以便轻松检测到意外的文件截断。EOF 标记块是一个特定的空 BGZF 块,使用默认的 zlib 压缩级别设置编码,由以下 28 个十六进制字节组成:
The presence of this EOF marker at the end of a BGZF file indicates that the immediately following physical EOF is the end of the file as intended by the program that wrote it. Empty BGZF blocks are not otherwise special; in particular, the presence of an EOF marker block does not by itself signal end of file. 在 BGZF 文件末尾出现 EOF 标记表明,紧随其后的物理 EOF 就是编写该文件的程序所希望的文件末尾。空的 BGZF 块在其他方面并不特殊;特别是,EOF 标记块的存在本身并不表示文件的结束。
The absence of this final EOF marker should trigger a warning or error soon after opening a BGZF file where random access is available. When reading a BGZF file in sequential streaming fashion, ideally this EOF check should be performed when the end of the stream is reached. Checking that the final BGZF block in the file decompresses to empty or checking that the last 28 bytes of the file are exactly the bytes above are both sufficient tests; each is likely more convenient in different circumstances. 在打开可随机存取的 BGZF 文件后,如果没有最后的 EOF 标记,就会很快触发警告或错误。 在以顺序流方式读取 BGZF 文件时,理想情况下应在到达流的末尾时执行 EOF 检查。检查文件中的最后一个 BGZF 块是否解压为空,或检查文件的最后 28 个字节是否正好是上述字节,这两种检查方法都已足够;在不同的情况下,每种方法都可能更方便。
4.2 The BAM format 4.2 BAM 格式
BAM is compressed in the BGZF format. All multi-byte numbers in BAM are little-endian, regardless of the machine endianness. The format is formally described in the following table where values in brackets are the default when the corresponding information is not available; an underlined word in uppercase denotes a field in the SAM format. BAM 采用 BGZF 格式压缩。BAM 中的所有多字节数字都是小字节,与机器的字节序无关。下表对格式进行了正式描述,括号中的值是在没有相应信息时的默认值;下划线的大写单词表示 SAM 格式中的字段。
Field 现场
Description 说明
Type 类型
Value 价值
magic 魔法
BAM magic string BAM 魔法字符串
BAM \1 BAM 1
I_text
Length of the header text, including any NUL padding 页眉文本的长度,包括任何 NUL 填充
uint32_t
text 文本
Plain header text in SAM; not necessarily NUL-terminated SAM 中的纯标头文本;不一定以 NUL 结尾
char [I_text] 字符 [I_text]
n_ref
# reference sequences # 参考序列
uint32_t
List of reference information ( f 参考信息列表 ( f )
I_name
Length of the reference name plus 1 (including NUL) 引用名称的长度加 1(包括 NUL)
Phred-scaled base qualities. See Section 4.2 .3 虹彩比例基础质量。参见第 4.2.3 节
char [I_seq]
List of auxiliary data (until the end of the alignment block) 辅助数据列表(直至对齐程序块结束)
tag 标签
Two-character tag 双字符标签
char [2]
val_type
Value type: AcCsSiIfZHB, see Section 4.2.4 值类型:AcCsSiIfZHB,见第 4.2.4 节
char 烧焦
value 价值
Tag value 标签值
(by val_type) (按 Val_type)
Most length and count fields described as uint32_t have additional constraints on their range: I_text due to implementation limits; n_ref because refID and next_refID are signed; I_ref because tlen is signed; those marked "limited" are limited by available memory and the practical size of the data represented well before they are limited by, e.g., Java's signed 32-bit integer maximum array size. 大多数描述为 uint32_t 的长度和计数字段的范围都有额外的限制:I_text 由于实现限制;n_ref 因为 refID 和 next_refID 是带符号的;I_ref 因为 tlen 是带符号的;标有 "limited "的字段受可用内存和所表示数据的实际大小的限制,远远早于 Java 的带符号 32 位整数最大数组大小的限制。
4.2.1 BIN field calculation 4.2.1 BIN 字段计算
BIN is calculated using the reg2bin() function in Section 5.3. For mapped reads this uses POS-1 (i.e., 0 -based left position) and the alignment end point using the alignment length from the CIGAR string. For unmapped reads (e.g., paired-end reads where only one part is mapped, see Section 2) and reads whose CIGAR strings consume no reference bases at all, the alignment is treated as being of length one. Note unmapped reads with POS 0 (which becomes -1 in BAM) therefore use reg2bin which is computed as 4680. BIN 使用第 5.3 节中的 reg2bin() 函数计算。对于映射读数,使用 POS-1(即基于 0 的左侧位置),并使用 CIGAR 字符串中的排列长度来计算排列终点。对于未映射读数(例如只有一部分被映射的成对末端读数,见第 2 节)和 CIGAR 字符串完全不包含参考碱基的读数,配准长度被视为 1。请注意,POS 为 0(在 BAM 中为-1)的未映射读数使用 reg2bin ,计算结果为 4680。
4.2.2 N_CIGAR_OP field 4.2.2 N_CIGAR_OP 字段
With 16 bits, n_cigar_op can keep at most 65535 CIGAR operations in BAM files. For an alignment with more CIGAR operations, BAM stores the real CIGAR, encoded the same way as the cigar field in BAM, in the CG optional tag of type ' ', and sets CIGAR to ' S ' as a placeholder, where ' ' equals seq, ' ' is the reference sequence length in the alignment, and ' S ' and ' N ' are the soft-clipping and reference-clip CIGAR operators, respectively-i.e., in the binary form, n_cigar_op and cigar . If tag CG is present and the first CIGAR operation clips the entire read, a BAM parsing library is expected to update n_cigar_op and cigar with the real CIGAR stored in the CG tag and remove the now-redundant CG tag. 在 16 位的情况下,n_cigar_op 最多可以在 BAM 文件中保存 65535 次 CIGAR 操作。对于有更多 CIGAR 操作的排列,BAM 会将真正的 CIGAR(编码方式与 BAM 中的 cigar 字段相同)存储在类型为' '的 CG 可选标记中,并将 CIGAR 设置为' S '作为占位符、其中' '等于 seq,' '是比对中的参考序列长度,'S'和'N'分别是软剪辑和参考剪辑 CIGAR 操作符--即'S'和'N'。e.,即 n_cigar_op 和 cigar 。如果存在 CG 标记,并且第一次 CIGAR 操作剪切了整个读数,那么 BAM 解析库将使用 CG 标记中存储的真实 CIGAR 更新 n_cigar_op 和 cigar,并删除现在多余的 CG 标记。
4.2.3 SEQ and QUAL encoding 4.2.3 SEQ 和 QUAL 编码
Sequence is encoded in 4-bit values, with adjacent bases packed into the same byte starting with the highest 4 bits first. When I_seq is odd the bottom 4 bits of the last byte are undefined, but we recommend writing these as zero. The case-insensitive base codes '=ACMGRSVTWYHKDBN' are mapped to respectively with all other characters mapping to ' N ' (value 15). 序列以 4 位值编码,相邻碱基从最高的 4 位开始打包到同一字节。当 I_seq 为奇数时,最后一个字节的最下面 4 位是未定义的,但我们建议将其写为 0。不区分大小写的碱基编码"=ACMGRSVTWYHKDBN "分别映射为 ,所有其他字符映射为 "N"(值 15)。
Omitted sequence, represented in SAM as ' ', is represented by l_seq being 0 and seq and qual zero-length. 省略序列在 SAM 中表示为" ",l_seq 表示 0,seq 和 qual 表示零长度。
Base qualities are stored as bytes in the range [0,93], without any +33 conversion to printable ASCII. When base qualities are omitted but the sequence is not, qual is filled with 0xFF bytes (to length I_seq). 基本特质以 [0,93] 范围内的字节形式存储,不需要将 +33 转换为可打印的 ASCII 编码。当基本质量被省略但序列未被省略时,qual 将以 0xFF 字节填充(长度为 I_seq)。
4.2.4 Auxiliary data encoding 4.2.4 辅助数据编码
Optional alignment fields are stored immediately after each other immediately following the qual field, and are included in block_size. Each field is represented as a two-character tag followed by a single type character and then its value, whose length is determined by the field's type. 可选对齐字段紧跟在 qual 字段之后存储,并包含在 block_size 中。每个字段用一个双字符标记表示,后面跟一个类型字符,然后是其值,值的长度由字段类型决定。
Single character 'A' fields have a total length of 4 bytes, with the value represented as a single byte: 单字符 "A "字段的总长度为 4 个字节,值以单字节表示:
A
char 烧焦
While all single (i.e., non-array) integer types are stored in SAM as 'i', in BAM any of 'cCsSiI' may be used together with the correspondingly-sized binary integer value, chosen according to the field value's magnitude. Similarly floating point ' ' fields are represented as IEEE binary 32 values. Thus BAM numeric fields have a total length of 4,5 , or 7 bytes: 在 SAM 中,所有单整数(即非数组)类型都存储为 "i",而在 BAM 中,任何 "cCsSiI "都可以与相应大小的二进制整数值一起使用,二进制整数值根据字段值的大小选择。 同样,浮点" "字段也用 IEEE 二进制 32 值表示。因此,BAM 数字字段的总长度为 4、5 或 7 字节:
c
i 8
(i.e., int8_t) (即 int8_t)
C
u 8
(i.e., uint8_t) (即 uint8_t)
s
int16_t
S
uint16_t
i
int32_t
I
uint32_t
f
float 浮动
String fields and hex-formatted byte arrays are represented as NUL-terminated text strings: 字符串字段和十六进制格式的字节数组表示为以 NUL 结尾的文本字符串:
Z
char 烧焦
char 烧焦
char 烧焦
NUL
The representation of a ' ' array field starts with a sub-type character similar to the numeric field types above and a count (uint32_t, but limited by memory and block_size) giving the number of elements in the array. The array elements follow, encoded as binary integers or IEEE floats sized according to the sub-type: 一个" "数组字段的表示以一个与上述数字字段类型类似的子类型字符和一个计数(uint32_t,但受内存和 block_size 的限制)开始,计数表示数组中元素的个数。随后是数组元素,根据子类型编码为二进制整数或 IEEE 浮点数:
5 Indexing BAM 5 BAM 索引
Indexing aims to achieve fast retrieval of alignments overlapping a specified region without going through the whole alignments. BAM must be sorted by the reference ID and then the leftmost coordinate before indexing. 索引的目的是在不浏览整个排列的情况下,快速检索与指定区域重叠的排列。在索引之前,BAM 必须按参考 ID 排序,然后按最左坐标排序。
This section describes the binning scheme underlying coordinate-sorted BAM indices and its implementation in the long-established BAI format. The CSI format documented elsewhere uses a similar binning scheme and can also be used to index BAM. 本节介绍坐标排序 BAM 索引的分选方案及其在历史悠久的 BAI 格式中的应用。其他地方记录的 CSI 格式使用类似的分选方案,也可用于 BAM 索引。
5.1 Algorithm 5.1 算法
5.1.1 Basic binning index 5.1.1 基本分选指数
The UCSC binning scheme was suggested by Richard Durbin and Lincoln Stein and is explained in Kent et al. In this scheme, each bin represents a contiguous genomic region which is either fully contained in or non-overlapping with another bin; each alignment is associated with a bin which represents the smallest region containing the entire alignment. The binning scheme is essentially a representation of R-tree. A distinct bin uniquely corresponds to a distinct internal node in a R-tree. Bin A is a child of Bin B if the region represented by A is contained in B . 在该方案中,每个分选区代表一个连续的基因组区域,该区域要么完全包含在另一个分选区中,要么与另一个分选区不重叠;每个排列与一个分选区相关联,该分选区代表包含整个排列的最小区域。分选方案本质上是 R 树的一种表现形式。一个不同的 bin 唯一对应于 R 树中一个不同的内部节点。如果 A 所代表的区域包含在 B 中,那么 Bin A 就是 Bin B 的子节点。
To find the alignments that overlap a specified region, we need to get the bins that overlap the region, and then test each alignment in the bins to check overlap. To quickly find alignments associated with a specified bin, we can keep in the index the start file offsets of chunks of alignments which all have the bin. As alignments are sorted by the leftmost coordinates, alignments having the same bin tend to be clustered together on the disk and therefore usually a bin is only associated with a few chunks. Traversing all the alignments having the same bin usually needs a few seek calls. Given the set of bins that overlap the specified region, we can visit alignments in the order of their leftmost coordinates and stop seeking the rest when an alignment falls outside the required region. This strategy saves half of the seek calls in average. 为了找到与指定区域重叠的排列,我们需要获取与该区域重叠的分区,然后测试分区中的每条排列以检查重叠情况。为了快速找到与指定bin相关的排列,我们可以在索引中保留所有具有该bin的排列块的起始文件偏移量。由于排列是按照最左侧的坐标排序的,具有相同分区的排列往往会在磁盘上集中在一起,因此通常一个分区只与几个分块相关联。遍历具有相同分区的所有排列通常需要几次寻道调用。给定一组与指定区域重叠的 bin,我们可以按照最左侧坐标的顺序访问排列,当一个排列不在所需区域内时,就停止搜索其余排列。这种策略平均可以节省一半的寻道调用次数。
In the BAI format, each bin may span or . Bin 0 spans a 512 Mbp region, bins span , and bins span 16 kbp regions. This implies that this index format does not support reference chromosome sequences longer than . 在 BAI 格式中,每个分区可以跨 或 。分区 0 跨 512 Mbp 区域,分区 跨 ,分区 跨 16 kbp 区域。这意味着这种索引格式不支持长度超过 的参考染色体序列。
The CSI format generalises the sizes of the bins, and supports reference sequences of the same length as are supported by SAM and BAM. CSI 格式扩大了分区的大小,并支持与 SAM 和 BAM 相同长度的参考序列。
5.1.2 Reducing small chunks 5.1.2 缩小小块
Around the boundary of two adjacent bins, we may see many small chunks with some having a shorter bin while the rest having a larger bin. To reduce the number of seek calls, we may join two chunks having the same bin if they are close to each other. After this process, a joined chunk will contain alignments with different bins. We need to keep in the index the file offset of the end of each chunk to identify its boundaries. 在两个相邻分区的边界附近,我们可能会看到许多小块,其中一些分区较短,而其他分区较大。为了减少寻道调用次数,我们可以将两个具有相同分区的小块连接起来,如果它们彼此靠近的话。经过这个过程后,连接后的数据块将包含不同分区的排列。我们需要在索引中保留每个分块末尾的文件偏移量,以识别其边界。
5.1.3 Combining with linear index 5.1.3 与线性指数相结合
For an alignment starting beyond 64 Mbp , we always need to seek to some chunks in bin 0 , which can be avoided by using a linear index. In the linear index, for each tiling 16384bp window on the reference, we record the smallest file offset of the alignments that overlap with the window. Given a region [rbeg, rend), we only need to visit a chunk whose end file offset is larger than the file offset of the 16 kbp window containing rbeg. 对于超过 64 Mbp 的排列,我们总是需要寻找 0 仓中的某些块,而使用线性索引可以避免这种情况。在线性索引中,对于参考文献上的每个 16384bp 的平铺窗口,我们记录与该窗口重叠的排列的最小文件偏移。给定一个区域 [rbeg,rend],我们只需要访问其末端文件偏移大于包含 rbeg 的 16 kbp 窗口文件偏移的块。
With both binning and linear indices, we can retrieve alignments in most of regions with just one seek call. 利用分选和线性索引,我们只需一次寻道调用就能检索到大部分区域的排列结果。
5.1.4 A conceptual example 5.1.4 概念范例
Suppose we have a genome shorter than 144kbp. We can design a binning scheme which consists of three types of bins: bin 0 spans , bin 1,2 and 3 span 48 kbp and bins from 4 to 12 span 16 kbp each: 假设我们的基因组短于 144kbp。我们可以设计一个由三类分仓组成的分仓方案:0 仓跨度为 ,1、2 和 3 仓跨度为 48 kbp,4 至 12 仓各跨度为 16 kbp:
0 (0-144kbp) 0(0-144kbp)
1 (0-48kbp) 1(0-48kbp)
2 (48-96kbp) 2(48-96kbp)
3 (96-144kbp) 3(96-144kbp)
10
11
12
An alignment starting at 65 kbp and ending at 67 kbp would have a bin number 8 , which is the smallest bin containing the alignment. Similarly, an alignment starting at 51 kbp and ending at 70 kbp would go to bin 2, while an alignment between to bin 0 . Suppose we want to find all the alignments overlapping region . We first calculate that bin 0,2 and 8 overlap with this region and then traverse the alignments in these bins to find the required alignments. With a binning index alone, we need to visit the alignment at as it belongs to bin 0 . But with a linear index, we know that such an alignment stops before 64 kbp and cannot overlap the specified region. A seek call can thus be saved. 一条从 65 kbp 开始到 67 kbp 结束的排列将进入第 8 仓,这是包含该排列的最小仓。同样,一条从 51 kbp 开始到 70 kbp 结束的排列将进入 bin 2,而 之间的排列将进入 bin 0。假设我们要查找与 区域重叠的所有排列。我们首先计算出 0、2 和 8 号分区与该区域重叠,然后遍历这些分区中的排列,找到所需的排列。如果仅使用分区索引,我们需要访问位于 的排列,因为它属于 0 号分区。但如果使用线性索引,我们就知道这样的排列在 64 kbp 之前就停止了,不可能与指定区域重叠。因此可以节省一次寻道调用。
5.2 The BAI index format for BAM files 5.2 BAM 文件的 BAI 索引格式
Field 现场
Description 说明
Type 类型
Value 价值
magic 魔法
Magic string 魔法字符串
char [4]
BAI \1 BAI (1
n_ref
# reference sequences # 参考序列
uint32_t
List of indices 索引列表
n_bin
# distinct bins (for the binning index) # 个不同的分选箱(用于分选索引)
uint32_t
List of distinct bins 不同箱列表
bin 箱柜
Distinct bin 独特的垃圾桶
uint32_t
n_chunk
# chunks # 块
uint32_t
limited 有限
List of chunks ( chunk) 数据块列表 ( chunk)
chunk_beg 分块
(Virtual) file offset of the start of the chunk (虚拟)文件偏移量,即块的起始位置
uint64_t
chunk_end 块结束
(Virtual) file offset of the end of the chunk (虚拟)数据块末尾的文件偏移量
uint64_t
n_intv
# 16kbp intervals (for the linear index) # 16kbp 间隔(用于线性索引)
uint32_t
List of intervals ( 区间列表 ( )
ioffset
(Virtual) file offset of the first alignment in the interval (虚拟)区间内第一次对齐的文件偏移量
uint64_t
n_no_coor (optional) n_no_coor(可选)
Number of unplaced unmapped reads (RNAME *) 未定位的未映射读数(RNAME *)的数量
uint64_t
The index file may optionally contain additional metadata providing a summary of the number of mapped and placed unmapped read-segments per reference sequence, and of any unplaced unmapped read-segments. This is stored in an optional extra metadata pseudo-bin for each reference sequence, and in the optional trailing n_no_coor field at the end of the file. 索引文件可以选择性地包含附加元数据,提供每个参考序列的映射和放置的未映射读段的数量汇总,以及任何未放置的未映射读段的数量汇总。 这些元数据存储在每个参考序列的可选额外元数据伪文本框中,以及文件末尾的可选尾部 n_noo_coor 字段中。
The pseudo-bins appear in the references' lists of distinct bins as bin number 37450 (which is beyond the normal range) and are laid out so as to be compatible with real bins and their chunks: 伪仓在参考文献的不同仓列表中显示为 37450 号仓(超出正常范围),其布局与真实仓及其分块相匹配:
bin 箱柜
Magic bin number 魔法箱编号
uint32_t
37450
n_chunk
# chunks # 块
uint32_t
2
ref_beg 参考基准
(Virtual) file offset of the start of reads placed on this reference (虚拟)文件偏移量,读取该引用的起始位置
uint64_t
ref_end
(Virtual) file offset of the end of reads placed on this reference (虚拟)文件偏移量,即该引用的读数末尾的偏移量
uint64_t
n_mapped
Number of mapped read-segments for this reference 该参考文献的映射读段数
uint64_t
n_unmapped
Number of unmapped read-segments for this reference 该参考文献的未映射读段数量
uint64_t
The ref_beg/ref_end fields locate the first and last reads on this reference sequence, whether they are mapped or placed unmapped. Thus they are equal to the minimum chunk_beg and maximum chunk_end respectively. ref_beg/ref_end 字段用于定位参考序列上的第一个和最后一个读数,无论这些读数是映射的还是未映射的。因此,它们分别等于最小 chunk_beg 和最大 chunk_end。
5.3 C source code for computing bin number and overlapping bins 5.3 计算仓号和重叠仓的 C 源代码
The following functions compute bin numbers and overlaps for a BAI-style binning scheme with 6 levels and a minimum bin size of . See the CSI specification for generalisations of these functions designed for binning schemes with arbitrary depth and sizes. 以下函数用于计算 BAI 式分选方案(6 级,最小分选尺寸为 )的分选区编号和重叠。请参阅 CSI 规范,了解这些函数针对任意深度和大小的分选方案所设计的通用函数。
When these functions are called with regions representing unplaced unmapped reads, e.g., reg2bin , they involve operations such as ( -1 ) which are undefined or implementation-defined in some programming languages. They must be implemented as if these operations use the common two's-complement semantics: reg2bin and reg2bins returns . 在调用这些函数时,如果使用的是代表未定位未映射读取的区域,例如 reg2bin ,就会涉及 ( -1 ) 等操作,这些操作在某些编程语言中是未定义的,或者是实现定义的。在实现这些操作时,必须将其视为使用常见的二元补码语义:reg2bin 和 reg2bins 返回 。
/* calculate bin given an alignment covering [beg,end) (zero-based, half-closed-half-open) */
int reg2bin(int beg, int end)
{
--end;
if (beg>>14 == end>>14) return ((1<<15)-1)/7 + (beg>>14);
if (beg>>17 == end>>17) return ((1<<12)-1)/7 + (beg>>17);
if (beg>>20 == end>>20) return ((1<<9)-1)/7 + (beg>>20);
if (beg>>23 == end>>23) return ((1<<6)-1)/7 + (beg>>23);
if (beg>>26 == end>>26) return ((1<<3)-1)/7 + (beg>>26);
return 0;
}
/* calculate the list of bins that may overlap with region [beg,end) (zero-based) */
#define MAX_BIN (((1<<18)-1)/7)
int reg2bins(int beg, int end, uint16_t list[MAX_BIN])
{
int i = 0, k;
--end;
list [i++] = 0;
for (k = 1 + (beg>>26); k <= 1 + (end>>26); ++k) list[i++] = k;
for (k = 9 + (beg>>23); k <= 9 + (end>>23); ++k) list[i++] = k;
for (k = 73 + (beg>>20); k <= 73 + (end>>20); ++k) list[i++] = k;
for (k = 585 + (beg>>17); k <= 585 + (end>>17); ++k) list[i++] = k;
for (k = 4681 + (beg>>14); k <= 4681 + (end>>14); ++k) list[i++] = k;
return i;
}
Appendix A Parsing region notation 附录 A 区域解析符号
Parsing region notation such as name [:begin [-end]] (in which omission of the outer bracketed portion indicates a request for the entire reference sequence) would be simple if name could not itself contain ':' characters, but this is not the case. (No such notation containing an optional ' ' appears in the SAM format itself, but various tools use this notation as a convenient way for their users to specify regions of interest.) 如果名称本身不包含": "字符,那么像名称[:begin [-end]]这样的区域符号(省略括号外的部分表示请求整个参考序列)的解析就会很简单,但事实并非如此。(在 SAM 格式本身中并没有包含可选" "的这种符号,但各种工具都使用这种符号,方便用户指定感兴趣的区域)。
The set of valid reference sequence names is usually already known when parsing this notation-for example, because the associated @SQ headers have already been encountered. Tools can use this set to determine unambiguously which colons could delimit a known-valid reference sequence name. 在解析这种符号时,通常已经知道了有效的参考序列名称集--例如,因为已经遇到过相关的 @SQ 标头。工具可以使用该集合明确地确定哪些冒号可以分隔已知有效的参考序列名称。
In pseudocode form, a string str can be parsed as follows: 字符串 str 的伪代码解析如下
consider the rightmost ' ' character, if any, of str 考虑 str 中最右边的" "字符(如果有)。
if is of the form 'prefix:NUM' or 'prefix:NUM-NUM' 如果 的形式为 "前缀:NUM "或 "前缀:NUM-NUM
or generally 'prefix: suffix' for some plausible interval suffix 或用 "前缀:后缀 "来表示某个可信的区间后缀
then 则
if both prefix and str are in the known set then ...error: ambiguous representation 如果前缀和 str 都在已知集合中,则...错误:表示不明确
else if prefix is in the known set then return (prefix, NUM. . . NUM) 否则,如果前缀在已知集合中,则返回 (前缀, NUM... NUM)
else if is in the known set then return (str, entire sequence) else 如果 在已知集合中,则返回 (str,整个序列)
else ...error: unknown reference sequence name 否则...错误:未知参考序列名称
else ...either str does not contain a colon or the suffix is not plausibly numeric 否则......要么 str 中不包含冒号,要么后缀不可能是数字
if is in the known set then return (str, entire sequence) 如果 在已知集合中,则返回 (str,整个序列)
else ...error: unknown reference sequence name or invalid interval syntax 否则...错误:参考序列名称未知或区间语法无效
The check leading to "error: ambiguous representation" is important as it prevents confusing interpretations of actually ambiguous input. Typically the set of valid reference sequence names will not contain names that are prefixes of other names in the set, so in practice this error will not usually be encountered in non-malicious data. 导致 "错误:表示含糊 "的检查非常重要,因为它可以防止对实际含糊的输入进行混乱的解释。通常情况下,有效的参考序列名称集合不会包含作为集合中其他名称前缀的名称,因此在非恶意数据中通常不会出现这种错误。
Either in addition to this algorithm or as an alternative to it, tools can use additional delimiter characters to make an unambiguously parsable notation. We recommend a convention using curly brackets around the reference sequence name - {name} [:begin [-end]] - as being memorable, easily typed, unambiguous, and not expanded by most shells. 除了这种算法或作为其替代,工具还可以使用额外的分隔符来制作明确可解析的符号。我们推荐在参考序列名称周围使用大括号--{name}[:begin [-end]]。- 我们推荐在引用序列名称--{name} [:begin[-end]]--周围使用大括号,因为这样便于记忆、易于键入、明确无误,而且大多数 shell 不会展开。
Appendix B SAM Version History 附录 B SAM 版本历史
This lists the date of each tagged SAM version along with changes that have been made while that version was current. The key changes that caused the version number to change are shown in bold. 其中列出了每个标记的 SAM 版本的日期,以及该版本生效期间所做的更改。导致版本号变化的主要更改以粗体显示。
Additions and changes to the standard predefined tags are listed in the separate Sequence Alignment/Map Optional Fields Specification. 对标准预定义标记的添加和更改在单独的《序列对齐/映射可选字段规范》中列出。
1.6: 28 November 2017 to current 1.6:2017 年 11 月 28 日至今
Add SINGULAR to the list of @RG PL header tag values. (May 2023) 在 @RG PL 标题标记值列表中添加 SINGULAR。(2023年5月)
Clarify that @RG PI values are integers. (May 2023) 说明 @RG PI 值是整数。(2023年5月)
Add ELEMENT and ULTIMA to the list of @RG PL header tag values. (Aug 2022) 在 @RG PL 标头标记值列表中添加 ELEMENT 和 ULTIMA。(2022年8月)
Clarify that header field tags must be distinct within each line, and that the ordering of both header fields and alignment optional fields is not significant. (Jun 2021) 说明标题字段标记在每一行中必须是不同的,而且标题字段和对齐方式可选字段的排序并不重要。(2021 年 6 月)
Clarify the meaning of TLEN when secondary alignments are present. (May 2021) 明确二级排列时 TLEN 的含义。(2021 年 5 月)
Bin calculation changed for alignment records whose CIGAR strings consume no reference bases: like unmapped records, they are considered to have length one (rather than zero). (Jan 2021) 更改了 CIGAR 字符串不消耗参考碱基的排列记录的 Bin 计算:与未映射记录一样,它们被视为长度为 1(而不是 0)。(2021年1月)
Correct the description of index pseudo-bins, which previously stated that ref_beg/ref_end, then named unmapped_beg/unmapped_end, include only placed unmapped reads. (Jul 2020) 更正索引伪区间的描述,之前的描述称 ref_beg/ref_end(后被命名为 unmapped_beg/unmapped_end)只包括放置的未映射读数。(2020年7月)
Add DNBSEQ to the list of @RG PL header tag values. (Apr 2020) 在 @RG PL 标头标记值列表中添加 DNBSEQ。(2020年4月)
Restricted the allowable punctuation characters in reference sequence names (in @SQ SN, RNAME, etc). The sets of characters allowed in @SQ SN and @SQ AN are now identical, which enlarges the previous AN set. (Jan 2019) 限制了参考序列名称(@SQ SN、RNAME 等)中允许使用的标点符号。@SQ SN 和 @SQ AN 中允许使用的字符集现在完全相同,这扩大了之前的 AN 字符集。(2019年1月)
We recommend that implementations validating reference sequence names do so using the rules in Section 1.2.1; are more lenient for files declaring @HD VN ; and validate AN only against these rules, not the previous more restrictive AN rules. 我们建议验证引用序列名称的实现使用第 1.2.1 节中的规则;对于声明 @HD VN 的文件应更加宽松;并且仅根据这些规则验证 AN,而不是之前限制性更强的 AN 规则。
Add @HD SS sorting details header tag. (Oct 2018) 添加 @HD SS 排序详细信息标题标签。(2018年10月)
B array optional fields may have no entries - this was already representable in BAM, clarified that empty arrays are permitted in SAM too. (Jul 2018) B 数组可选字段可能没有条目 - 这在 BAM 中已经可以表示,澄清了 SAM 中也允许使用空数组。(2018年7月)
Add @RG BC header tag. (Apr 2018) 添加 @RG BC 标题标签。(2018年4月)
Permit UTF-8 in a few header tags. (Mar 2018) 允许在一些标题标记中使用 UTF-8。(2018年3月)
Add support for CIGAR strings with more than 65,535 operations. (Nov 2017) 增加对操作数超过 65,535 次的 CIGAR 字符串的支持。(2017年11月)
1.5: 23 May 2013 to November 2017 1.5:2013 年 5 月 23 日至 2017 年 11 月
Add @SQ AN header tag, allowing only alphanumeric and '*+. @_ - -' characters in its names. (Jul 2017) 添加 @SQ AN 标头标签,只允许使用字母数字和 "*+.@_ - -'字符。(2017年7月)
Removal of FLAG letters. (July 2010) 删除 FLAG 字母。(2010 年 7 月)
The SM header field, previously mandatory for @RG, is now optional. (July 2010) SM 标头字段以前是 @RG 的必选字段,现在是可选字段。(2010 年 7 月)
1.0: 2009 to July 2010 1.0: 2009 年至 2010 年 7 月
Initial edition. 初版。
Hence in particular SAM files must not begin with a byte order mark (BOM) and lines of text are delimited by ASCII line terminator characters only. In addition to the local platform's text file line termination conventions, implementations may wish to support LF and CR LF for interoperability with other platforms. 因此,SAM 文件尤其不能以字节序号(BOM)开头,文本行只能以 ASCII 行结束符字符分隔。除了本地平台的文本文件行结束约定外,为实现与其他平台的互操作性,实施可能希望支持 LF 和 CR LF。
The values in the FLAG column correspond to bitwise flags as follows: : first/next is reverse-complemented properly aligned/multiple segments; 0: no flags set, thus a mapped single segment; x810: supplementary/reversecomplemented; : last (second of a pair)/reverse-complemented/properly aligned/multiple segments. FLAG 列中的值与位标志对应如下: :第一个/下一个是反向补充正确对齐/多个区段;0:未设置标志,因此是映射的单个区段; x810:补充/反向补充; :最后一个(一对区段中的第二个)/反向补充/正确对齐/多个区段。
Chimeric alignments are primarily caused by structural variations, gene fusions, misassemblies, RNA-seq or experimental protocols. They are more frequent given longer reads. For a chimeric alignment, the linear alignments constituting the alignment are largely non-overlapping; each linear alignment may have high mapping quality and is informative in SNP/INDEL calling. In contrast, multiple mappings are caused primarily by repeats. They are less frequent given longer reads. If a read has multiple mappings, all these mappings are almost entirely overlapping with each other; except the single-best optimal mapping, all the other mappings get mapping quality and are ignored by most SNP/INDEL callers. 嵌合比对主要是由结构变异、基因融合、错误组装、RNA-seq 或实验方案引起的。在读数较长的情况下,嵌合对齐更为常见。对于嵌合对齐来说,构成对齐的线性对齐基本上是不重叠的;每条线性对齐都可能具有很高的映射质量,在 SNP/INDEL 调用中具有参考价值。相比之下,多重配对主要由重复序列引起。由于读数较长,多重映射的频率较低。如果一个读数有多个映射,那么所有这些映射几乎都是相互重叠的;除了单个最佳映射外,所有其他映射的映射质量都是 ,而且大多数 SNP/INDEL 调用器都会忽略这些映射。 Characters that are not disallowed include ' ', which historically appeared in reference names derived from NCBI FASTA files, and ' ', which appears in HLA allele names. Appendix A describes approaches for parsing name [:begin-end] region notation unambiguously even though name may itself contain colons. 字符包括" "和" "," "和" "出现在 HLA 等位基因名称中。附录 A 介绍了即使名称本身可能包含冒号,也能明确解析名称[:begin-end]区域符号的方法。 Best practice is to use lowercase tags while designing and experimenting with new data field tags or for fields of local interest only. For new tags that are of general interest, raise an hts-specs issue or email samtools-devel@lists.sourceforge.net to have an uppercase equivalent added to the specification. This way collisions of the same uppercase tag being used with different meanings can be avoided. 最佳做法是在设计和试验新数据字段标记时,或仅在本地感兴趣的字段中使用小写标记。对于具有普遍意义的新标记,请提出 hts-specs 问题或发送电子邮件至 samtools-devel@lists.sourceforge.net 以在规范中添加等效的大写标记。这样就可以避免相同的大写标记被用于不同的含义。
It is known that widely used software libraries have differing definitions of the queryname sort order, meaning care should be taken when operating on multiple files of varying provenance. Tools may wish to use the sub-sort field to explicitly distinguish between natural and lexicographical ordering. See Section 1.3.1. 众所周知,广泛使用的软件库对 queryname 排序顺序的定义各不相同,这意味着在对不同来源的多个文件进行操作时应小心谨慎。工具可能希望使用子排序字段来明确区分自然排序和词典排序。参见第 1.3.1 节。 The repetition of sort-order enables a limited form of validation. For example, @HD SO:queryname SS:coordinate:TLEN would indicate that the data has been re-sorted (by query name) by a non-SS-aware tool and the SS field should be ignored. 排序顺序的重复可以进行有限的验证。例如,@HD SO:queryname SS:coordinate:TLEN 表示数据已被非 SS 感知工具重新排序(按查询名称),SS 字段应被忽略。 See https://www.ncbi.nlm.nih.gov/grc/help/definitions for descriptions of alternate locus and primary assembly. 参见 https://www.ncbi.nlm.nih.gov/grc/help/definitions 以了解备用位点和主要装配的说明。 For example, given '@SQ SN:MT AN: chrMT,M, chrM LN:16569 TP:circular', tools can ensure that a user's request for any of 'MT', 'chrMT', 'M', or 'chrM' succeeds and refers to the same sequence. 例如,在给定"@SQ SN:MT AN: chrMT,M, chrM LN:16569 TP:circular "的情况下,工具可以确保用户对 "MT"、"chrMT"、"M "或 "chrM "中任何一个的请求都能成功,并指向相同的序列。 The previous footnote's example identifies MT as a circular chromosome. The TP field is often omitted, which implies linear. 前面脚注的例子将 MT 识别为环状染色体。TP字段经常被省略,这意味着线性。
Reference sequence names may contain any printable ASCII characters with the exception of certain punctuation characters, and may not start with '*' or '='. See Section 1.2 .1 for details and an explanation of the [:rname:] notation. 参考序列名称可以包含任何可打印的 ASCII 字符(某些标点符号除外),并且不能以 "*"或"="开头。有关 [:rname:] 符号的详细信息和解释,请参见第 1.2.1 节。
The manipulation of bitwise flags is described at Wikipedia (see "Bit field") and elsewhere. 位标志的操作方法在维基百科(参见 "位字段")和其他地方有所介绍。 For example, in Illumina paired-end sequencing, first ( ) corresponds to the R1 'forward' read and last ( 0 x 80 ) to the R2 'reverse' read. (Despite the terminology, this is unrelated to the segments' orientations when they are mapped: either, neither, or both may have their reverse flag bits ( ) set after mapping.) 例如,在 Illumina 成对端测序中,第一个 ( ) 对应 R1 "正向 "读数,最后一个 ( 0 x 80 ) 对应 R2 "反向 "读数。(尽管用的是术语,但这与片段映射时的方向无关:映射后,任一片段的反向标志位 ( ) 都可能被设置。)
Thus a segment aligning in the forward direction at base 100 for length 50 and a segment aligning in the reverse direction at base 200 for length 50 indicate the template covers bases 100 to 249 and has length 150 . 因此,长度为 50 的基 100 上沿正向排列的区段和长度为 50 的基 200 上沿反向排列的区段表明模板覆盖基 100 至 249,长度为 150。
The earliest versions of this specification used to (in original orientation, TLEN#1; dashed parts of the reads indicate soft-clipped bases) while later ones used leftmost to rightmost mapped base (TLEN#2). Note: these two definitions agree in most alignments, but differ in the case of overlaps where the first segment aligns beyond the start of the last segment. 本规范的最早版本使用 到 (原始方向,TLEN#1;读数的虚线部分表示软剪切碱基),而后来的版本使用最左到最右的映射碱基(TLEN#2)。注意:这两个定义在大多数排列中都是一致的,但在重叠的情况下有所不同,即第一个片段的排列超出了最后一个片段的起始位置。
Unambiguous scenario 明确的方案
Ambiguous scenario 模棱两可的方案 The number of digits in an integer optional field is not explicitly limited in SAM. However, BAM can represent values in the range ), so in practice this is the realistic range of values for SAM's ' ' as well. 在 SAM 中,整数可选字段的位数没有明确限制。不过,BAM 可以表示 范围内的值,因此实际上这也是 SAM 的" "的实际取值范围。 For example, the six-character Hex string ' 1 AE 301 ' represents the byte array [0x1a, 0xe3, 0x1]. 例如,六字符十六进制字符串 "1 AE 301 "表示字节数组 [0x1a、0xe3、0x1]。 Explicit typing eases format parsing and helps to reduce the file size when SAM is converted to BAM. 显式键入简化了格式解析,并有助于在将 SAM 转换为 BAM 时减小文件大小。 See SAMtags.pdf at https://github.com/samtools/hts-specs. 请参阅 https://github.com/samtools/hts-specs 中的 SAMtags.pdf。
The impact of this representation on indexing and random access is yet to be explored by implementations. 这种表示法对索引和随机存取的影响还有待实施探索。
Peter J. A. Cock, James K. Bonfield, Bastien Chevreux, and Heng Li, SAM/BAM format v1.5 extensions for de novo assemblies, bioRxiv 020024; doi:10.1101/020024.
Writing pads/gaps as 's in the SEQ field might have been more convenient, but this caused concerns for backward compatibility. 在 SEQ 字段中将焊盘/间隙写成 's 可能更方便,但这会引起向后兼容性问题。 See Annotation and Padding in SAMtags.pdf. 请参阅 SAMtags.pdf 中的注释和填充。
L. Peter Deutsch, GZIP file format specification version 4.3, RFC 1952. L. Peter Deutsch,GZIP 文件格式规范 4.3 版,RFC 1952。 It is worth noting that there is a known bug in the Java GZIPInputStream class that concatenated gzip archives cannot be successfully decompressed by this class. BGZF files can be created and manipulated using the built-in Java util.zip package, but naive use of GZIPInputStream on a BGZF file will not work due to this bug. 值得注意的是,Java GZIPInputStream 类中存在一个已知错误,即该类无法成功解压缩连接的 gzip 压缩包。可以使用内置的 Java util.zip 包创建和处理 BGZF 文件,但由于该错误,在 BGZF 文件上天真地使用 GZIPInputStream 将无法工作。
Empty in the sense of having been formed by compressing a data block of length zero. 空,即压缩长度为零的数据块后形成的空。 An implementation that supports reopening a BAM file in append mode could produce a file by writing headers and alignment records to it, closing it (adding an EOF marker); then reopening it for append, writing more alignment records, and closing it (adding an EOF marker). The resulting BAM file would contain an embedded insignificant EOF marker block that should be effectively ignored when it is read. 支持在追加模式下重新打开 BAM 文件的实现可以通过写入文件头和对齐记录、关闭文件(添加 EOF 标记)来生成文件;然后重新打开文件进行追加、写入更多对齐记录并关闭文件(添加 EOF 标记)。生成的 BAM 文件将包含一个嵌入的不重要的 EOF 标记块,在读取时应被有效忽略。 It is useful to produce a diagnostic at the beginning of reading a file, so that interactive users can abort lengthy analysis of potentially-corrupted files. Of course, this is only possible if the stream in question supports random access. 在开始读取文件时生成诊断信息非常有用,这样交互式用户就可以放弃对可能损坏的文件进行冗长的分析。当然,这只有在有关流支持随机存取的情况下才有可能。
As noted in Section 1.4, reserved FLAG bits should be written as zero and ignored on reading by current software. 如第 1.4 节所述,保留的 FLAG 位应写为 0,并在当前软件读取时忽略。 For backward compatibility, an absent QNAME (represented as ' ' in SAM) is stored as a C string "*\0". 为了向后兼容,不存在的 QNAME(在 SAM 中表示为" ")将存储为 C 字符串 "*\0"。
The signedness and size used for each integer value is an implementation choice, but is typically the smallest that suffices. 每个整数值的符号性和大小由实现选择,但通常是最小的即可。 The BAM representation of ' ' field values as textual hexadecimal digits rather than binary data is for historical reasons. Modern applications may prefer to use 'B, C' array fields rather than ' ' fields. 出于历史原因,BAM 将" "字段值表示为文本十六进制数字,而不是二进制数据。现代应用程序可能更喜欢使用 "B、C "数组字段,而不是" "字段。
See CSIv1.pdf at https://github.com/samtools/hts-specs. This is a separate specification because CSI is also used to index other coordinate-sorted file formats in addition to BAM. 请参见 CSIv1.pdf,网址为https://github.com/samtools/hts-specs。这是一个单独的规范,因为除了 BAM 之外,CSI 还用于索引其他坐标排序文件格式。 W. James Kent et al., The Human Genome Browser at UCSC, Genome Res. 2002 12: 996-1006; doi:10.1101/ gr.229102; PMID:12045153. See in particular The Database, p1003. W. James Kent 等人,《加州大学洛杉矶分校的人类基因组浏览器》,《基因组研究》,2002 年 12 期:996-1006;doi:10.1101/ gr.229102;PMID:12045153。特别参见《数据库》,第 1003 页。
The number of chunks in a single bin is effectively limited by available memory and in any case is typically a maximum of some thousands. 单个仓中的块数实际上受到可用内存的限制,在任何情况下,通常最多为数千个。 By placed unmapped read we mean a read that is unmapped according to its FLAG but whose RNAME and POS fields are filled in, thus "placing" it on a reference sequence (see Section 2). In contrast, unplaced unmapped reads have '*, and 0 for RNAME and POS. "放置的未映射读数 "是指根据 FLAG 未映射的读数,但其 RNAME 和 POS 字段已被填入,从而 "放置 "在参考序列上(见第 2 节)。相反,未定位未映射读数的 RNAME 和 POS 字段为 "*"和 "0"。