SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with ' ', while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. SAM 是 Sequence Alignment/Map 格式的缩写。它是一种 TAB 分隔的文本格式,由标题部分(可选)和排列部分组成。如果存在,标题必须在排列之前。标题行以" "开头,而对齐行则不以" "开头。每个对齐行有 11 个必填字段,包含基本的对齐信息(如映射位置),以及数量可变的可选字段,包含灵活的或对齐器特定的信息。
This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAM file may optionally specify the version being used via the @HD VN tag. For full version history see Appendix B. 本规范适用于 1.6 版本的 SAM 和 BAM 格式。每个 SAM 和 BAM 文件可选择通过 @HD VN 标签指定所使用的版本。有关完整的版本历史,请参见附录 B。
SAM file contents are 7-bit US-ASCII, except for certain field values as individually specified which may contain other Unicode characters encoded in UTF-8. Alternatively and equivalently, SAM files are encoded in UTF-8 but non-ASCII characters are permitted only within certain field values as explicitly specified in the descriptions of those fields. SAM 文件内容为 7 位 US-ASCII,但个别指定的某些字段值除外,这些字段值可能包含以 UTF-8 编码的其他 Unicode 字符。或者,SAM 文件以 UTF-8 编码,但只允许在某些字段值中使用非 ASCII 字符,这些字段值在说明中明确指定。
Where it makes a difference, SAM file contents should be read and written using the POSIX / C locale. For example, floating-point values in SAM always use '.' for the decimal-point character. 在有区别的地方,SAM 文件内容应使用 POSIX / C 本地语言读写。例如,SAM 中的浮点数值始终使用". "作为小数点字符。
The regular expressions in this specification are written using the POSIX / IEEE Std 1003.1 extended syntax. 本规范中的正则表达式使用 POSIX / IEEE Std 1003.1 扩展语法编写。
1.1 An example 1.1 示例
Suppose we have the following alignment with bases in lowercase clipped from the alignment. Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment. 假设我们有以下配准结果,配准结果中的小写碱基被剪除。读数 r001/1 和 r001/2 构成一个读数对;r003 是嵌合读数;r004 代表分裂排列。
Template A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled from raw sequences. 模板 DNA/RNA 序列,其中一部分在测序机上测序,或由原始序列组装而成。
Segment A contiguous sequence or subsequence. 一个连续的序列或子序列。
Read A raw sequence that comes off a sequencing machine. A read may consist of multiple segments. For sequencing data, reads are indexed by the order in which they are sequenced. 从测序机上获取的原始序列。一个读数可能由多个片段组成。对于测序数据,读数按测序顺序排列索引。
Linear alignment An alignment of a read to a single reference sequence that may include insertions, deletions, skips and clipping, but may not include direction changes (i.e., one portion of the alignment on forward strand and another portion of alignment on reverse strand). A linear alignment can be represented in a single SAM record. 线性比对 是指读数与单个参考序列的比对,可能包括插入、删除、跳转和剪切,但不包括方向变化(即一部分比对在正向链上,另一部分比对在反向链上)。线性比对可以用一条 SAM 记录来表示。
Chimeric alignment An alignment of a read that cannot be represented as a linear alignment. A chimeric alignment is represented as a set of linear alignments that do not have large overlaps. Typically, one of the linear alignments in a chimeric alignment is considered the "representative" alignment, and the others are called "supplementary" and are distinguished by the supplementary alignment flag. All the SAM records in a chimeric alignment have the same QNAME and the same values for and flags (see Section 1.4). The decision regarding which linear alignment is representative is arbitrary. 嵌合配准 不能表示为线性配准的读数配准。嵌合对齐表示为一组没有大量重叠的线性对齐。通常情况下,嵌合对齐中的一条线性对齐被视为 "代表 "对齐,其他对齐被称为 "补充 "对齐,并用补充对齐标志加以区分。嵌合排列中的所有 SAM 记录都具有相同的 QNAME 以及 和 标志的相同值(见第 1.4 节)。关于哪个线性排列具有代表性的决定是任意的。
Read alignment A linear alignment or a chimeric alignment that is the complete representation of the alignment of the read. 读数配准 是一种线性配准或嵌合配准,是读数配准的完整表示。
Multiple mapping The correct placement of a read may be ambiguous, e.g., due to repeats. In this case, there may be multiple read alignments for the same read. One of these alignments is considered primary. All the other alignments have the secondary alignment flag set in the SAM records that represent them. All the SAM records have the same QNAME and the same values for and 0x80 flags. Typically the alignment designated primary is the best alignment, but the decision may be arbitrary. 多重映射 读数的正确位置可能不明确,例如由于重复。在这种情况下,同一读数可能会有多个读数配准。其中一个排列被认为是主要排列。所有其他对齐方式都在代表它们的 SAM 记录中设置了辅助对齐标志。所有 SAM 记录的 QNAME 相同, 和 0x80 标志值相同。通常情况下,指定的主要排列是最佳排列,但也可以任意决定。
1-based coordinate system A coordinate system where the first base of a sequence is one. In this coordinate system, a region is specified by a closed interval. For example, the region between the 3rd and the 7 th bases inclusive is . The SAM, VCF, GFF and Wiggle formats are using the 1-based coordinate system. 1 基坐标系 序列的第一个基数为 1 的坐标系。在该坐标系中,一个区域由一个封闭区间指定。例如,第 3 个碱基和第 7 个碱基之间的区域为 。SAM、VCF、GFF 和 Wiggle 格式都使用 1 基坐标系。
0-based coordinate system A coordinate system where the first base of a sequence is zero. In this coordinate system, a region is specified by a half-closed-half-open interval. For example, the region between the 3rd and the 7 th bases inclusive is . The BAM, BCFv2, BED, and PSL formats are using the 0 -based coordinate system. 0 基坐标系 序列的第一个基数为 0 的坐标系。在该坐标系中,一个区域由一个半闭半开的区间指定。例如,第 3 个碱基和第 7 个碱基之间的区域为 。BAM、BCFv2、BED 和 PSL 格式使用基于 0 的坐标系。
Phred scale Given a probability , the phred scale of equals , rounded to the closest integer. Phred 标度 给定概率 , 的 phred 标度等于 ,四舍五入为最接近的整数。
1.2.1 Character set restrictions 1.2.1 字符集限制
Reference sequence names, CIGAR strings, and several other field types are used as values or parts of values of other fields in SAM and related formats such as VCF. To ensure that these other fields' representations are unambiguous, these field types disallow particular delimiter characters. 在 SAM 和 VCF 等相关格式中,参考序列名称、CIGAR 字符串和其他一些字段类型被用作其他字段的值或部分值。为确保这些其他字段的表示法明确无误,这些字段类型不允许使用特定的分隔符。
Query or read names may contain any printable ASCII characters in the range [!- ] apart from ' ', so that SAM alignment lines can be easily distinguished from header lines. (They are also limited in length.) 查询或读取名称可以包含除" "之外的[!-]范围内的任何可打印 ASCII 字符,以便 SAM 对齐行与标题行容易区分。(它们的长度也有限制)。
Reference sequence names may contain any printable ASCII characters in the range [!- ] apart from backslashes, commas, quotation marks, and brackets-i.e., apart from ',"' () [] {} <>'—and may not start with ' ' or ' '. 除反斜线、逗号、引号和括号外,参考序列名称可以包含范围为 [!- ] 的任何可打印 ASCII 字符,即除',"' () [] {} <>'外,不得以' '或' '开头。
Thus they match the following regular expression: 因此,它们与以下正则表达式相匹配:
For clarity, elsewhere in this specification we write this set of allowed characters as a character class [:rname:] and extend the POSIX regular expression notation to use to indicate the omission of ' ' and ' ' from the character class. Thus this regular expression can be written more clearly as [:rname ] [:rname:]*. 为了清楚起见,我们在本规范的其他地方将这组允许使用的字符写成字符类 [:rname:] 并扩展 POSIX 正则表达式符号,使用 来表示从字符类中省略" "和" "。因此,这个正则表达式可以更清楚地写成 [:rname ] [:rname:]* 。
1.3 The header section 1.3 页眉部分
Each header line begins with the character ' ' followed by one of the two-letter header record type codes defined in this section. In the header, each line is TAB-delimited and, apart from @CO lines, each data field follows a format 'TAG:VALUE' where TAG is a two-character string that defines the format and content of VALUE. Thus header lines match /^ @(HD|SQ|RG|PG) ( or /^ @CO t .*/. Within each (non-@CO) header line, no field tag may appear more than once and the order in which the fields appear is not significant. 每个标头行都以字符" "开头,后面跟一个本节定义的双字母标头记录类型代码。在标头中,每行都用 TAB 分隔,除 @CO 行外,每个数据字段都遵循 "TAG:VALUE "格式,其中 TAG 是一个双字符串,定义了 VALUE 的格式和内容。因此,标题行匹配 /^ @(HD|SQ|RG|PG) ( 或 /^ @CO t .*/。在每个(非 @CO)标题行中,字段标记都不能出现超过一次,字段出现的顺序也不重要。
The following table describes the header record types that may be used and their predefined tags. Tags listed with are required; e.g., every @SQ header line must have SN and LN fields. As with alignment optional fields (see Section 1.5), you can freely add new tags for further data fields. Tags containing lowercase letters are reserved for local use and will not be formally defined in any future version of this specification. 下表描述了可使用的标题记录类型及其预定义标记。用 列出的标记为必填标记;例如,每个 @SQ 标头行都必须有 SN 和 LN 字段。与对齐可选字段(见第 1.5 节)一样,您可以为其他数据字段自由添加新标记。包含小写字母的标记保留给本地使用,不会在本规范的任何未来版本中正式定义。
Tag 标签
Description 说明
@HD
文件级元数据。可选。如果存在,则必须只有一行 @HD 且必须是文件的第一行。
File-level metadata. Optional. If present, there must be only one @HD line and it must be the
first line of the file.
VN*
Format version. Accepted format: /^ . 格式版本。接受格式:/^ .
sciences), SINGULAR, SOLID, and ULTIMA. This field should be omitted when the technology is
not in this list (though the PM field may still be present in this case) or is unknown.
PM
Platform model. Free-form text providing further details of the platform/technology used. 平台模型。提供所用平台/技术进一步详情的自由格式文本。
PU
Platform unit (e.g., flowcell-barcode.lane for Illumina or slide for SOLiD). Unique identifier. 平台单位(如 Illumina 的 flowcell-barcode.lane,或 SOLiD 的 slide)。唯一标识符。
SM
Sample. Use pool name where a pool is being sequenced. 样本。如果正在测序,则使用池名称。
@PG
Program. 计划
ID*
程序记录标识符。每个 @PG 行必须有一个唯一的 ID。ID 值将用于其他 @PG 行的对齐 PG 标记和 PP 标记。在合并 SAM 文件时,PG ID 可能会被修改,以便处理碰撞。
Program record identifier. Each @PG line must have a unique ID. The value of ID is used in the
alignment PG tag and PP tags of other @PG lines. PG IDs may be modified when merging SAM
files in order to handle collisions.
PN
Program name 计划名称
CL
Command line. UTF-8 encoding may be used. 命令行。可使用 UTF-8 编码。
PP
上一个 @PG-ID.必须与另一个 @PG 标头的 ID 标记相匹配。@PG 记录可以使用 PP 标记进行链式排列,链中的最后一条记录没有 PP 标记。该链定义了应用于排列的程序顺序。在合并 SAM 文件时,可以修改 PP 值,以处理 PG ID 的碰撞。链中的第一条 PG 记录(即 SAM 记录中 PG 标记所指的记录)描述了对 SAM 记录进行操作的最新程序。链中的下一条 PG 记录描述了在 SAM 记录上操作的下一个最新程序。SAM 记录上的 PG ID 并不要求指向链中最新的 PG 记录。它可以指向链中的任何 PG 记录,这意味着 SAM 记录已被该 PG 记录中的程序和通过 PP 标签指向的程序操作过。
Previous @PG-ID. Must match another @PG header's ID tag. @PG records may be chained using PP
tag, with the last record in the chain having no PP tag. This chain defines the order of programs
that have been applied to the alignment. PP values may be modified when merging SAM files
in order to handle collisions of PG IDs. The first PG record in a chain (i.e., the one referred to
by the PG tag in a SAM record) describes the most recent program that operated on the SAM
record. The next PG record in the chain describes the next most recent program that operated
on the SAM record. The PG ID on a SAM record is not required to refer to the newest PG record
in a chain. It may refer to any PG record in a chain, implying that the SAM record has been
operated on by the program in that PG record, and the program(s) referred to via the PP tag.
DS
Description. UTF-8 encoding may be used. 说明可使用 UTF-8 编码。
VN
Program version 程序版本
@CO
单行文本注释。允许使用无序的多行 @CO。可使用 UTF-8 编码。
One-line text comment. Unordered multiple @CO lines are allowed. UTF-8 encoding may be
used.
1.3.1 Defined sub-sort terms 1.3.1 已定义的子排序术语
While the SS sub-sort field allows implementation-defined keywords, some terms are predefined with specific meanings. 虽然 SS 子排序字段允许执行定义关键字,但有些术语是预定义的,具有特定含义。
lexicographical sort order is defined as a character-based dictionary sort with the character order as defined by the POSIX C locale. For example "abc", "abc17", "abc5", "abc59" and "abcd" are in lexicographical order. 词典排序顺序被定义为基于字符的词典排序,其字符顺序由 POSIX C 本地语言定义。例如,"abc"、"abc17"、"abc5"、"abc59 "和 "abcd "按词典顺序排列。
natural sort order is similar to lexicographical order except that runs of adjacent digits are considered to be numbers embedded within the text string, ordered numerically when compared to each other and ordered as single digits when compared to the surrounding non-digit characters. Runs that differ only in the number of leading zeros (thus are numerically tied) are ordered by more-zeros coming before fewer-zeros. The characters '-' and '.' are considered as ordinary characters, so apparently negative or fractional values are not treated as part of an embedded number. For example, "abc", "abc+5", "abc, "abc.d", "abc03", "abc5", "abc008", "abc08", "abc8", "abc17", "abc17.+", "abc17.2", "abc17.d", "abc59" and "abcd" are in natural order. 自然排序与词典排序类似,但相邻数字的流被视为嵌入文本字符串中的数字,相互比较时按数字排序,而与周围的非数字字符比较时则按个位数排序。仅在前导零的个数上存在差异(因此在数字上是并列的)的字符串,则按多零在前,少零在后的顺序排列。字符"-"和". "被视为普通字符,因此明显的负值或小数不被视为内嵌数字的一部分。例如,"abc"、"abc+5"、"abc "、"abc.d"、"abc03"、"abc5"、"abc008"、"abc08"、"abc8"、"abc17"、"abc17.+"、"abc17.2"、"abc17.d"、"abc59 "和 "abcd "按自然顺序排列。
umi is a lexicographical sort by the UMI tag. The MI tag should be used for comparing UMIs. The RX tag may be used in its absence but is not guaranteed to be unique across multiple libraries. umi 是按 UMI 标记进行的词典排序。在比较 UMI 时应使用 MI 标记。如果没有 RX 标记,也可以使用 RX 标记,但不能保证在多个库中都是唯一的。
1.3.2 Reference MD5 calculation 1.3.2 MD5 计算参考
The M5 tag on @SQ lines allows reference sequences to be uniquely identified through the MD5 digest of the sequence itself. As the digest is based on the sequence and nothing else, it can help resolve ambiguities with reference naming. For example, it allows a quick way of checking that references named ' 1 ', ' Chr 1 ' and 'chr1' in different files are in fact the same. @SQ 行上的 M5 标签允许通过序列本身的 MD5 摘要对参考序列进行唯一标识。由于摘要是基于序列而非其他,因此有助于解决引用命名的歧义。例如,它可以快速检查不同文件中命名为 "1"、"Chr 1 "和 "chr1 "的参考序列实际上是否相同。
The reference sequence must be in the 7-bit US-ASCII character set. All valid reference bases can be represented in this set, and it avoids the problem of determining exactly which 8 -bit representation may have been used. Padding characters (See Section 3.2) must be represented only using the '*' character. 参考序列必须使用 7 位 US-ASCII 字符集。所有有效的引用基都可以用这一字符集表示,而且可以避免确定使用的是哪一个 8 位表示法的问题。填充字符(见第 3.2 节)只能使用 "*"字符表示。
The digest is calculated as follows: 摘要的计算方法如下
All characters outside of the inclusive range 33 ('!') to are stripped out. This removes all unprintable and whitespace characters including spaces and new lines. Everything else is retained, even if not a legal nucleotide code. 除 33 ('!') 至 范围之外的所有字符都会被删除。这将删除所有不可打印字符和空白字符,包括空格和新行。其他所有字符都会保留,即使不是合法的核苷酸代码。
All lowercase characters are converted to uppercase. This operation is equivalent to calling toupper() on characters in the POSIX locale. 所有小写字母都会转换为大写字母。这一操作等同于对 POSIX 本地语言中的字符调用 toupper()。
The MD5 digest is calculated as described in RFC 1321 and presented as a 32 character lowercase hexadecimal number. MD5 摘要的计算方法如 RFC 1321 所述,并以 32 个字符的小写十六进制数表示。
As an example, if the reference contains the following characters (including spaces): 例如,如果引用包含以下字符(包括空格):
ACGT ACGT ACGT
acgt acgt acgt
... 12345 !!!
then the digest is that of the string ACGTACGTACGTACGTACGTACGT...12345!!! and the resulting tag would be M5: dfabdbb36e239a6da88957841f32b8e4. 那么摘要就是字符串 ACGTACGTACGTACGTACGTACGT...12345!!! 结果标签就是 M5: dfabdbb36e239a6da88957841f32b8e4。
In padded SAM files, the padding bases should be inserted into the reference as ' characters. Taking the example in Section 3.2, the padded version of the reference is 在填充的 SAM 文件中,填充基应以 ' 字符的形式插入到参考文献中。以第 3.2 节中的例子为例,参考文献的填充版本为
AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
and the corresponding tag is M5: caad65b937c4bc0b33c08f62a9fb5411. 相应的标签为 M5:caad65b937c4bc0b33c08f62a9fb5411。
1.4 The alignment section: mandatory fields 1.4 对齐部分:必填字段
In the SAM format, each alignment line typically represents the linear alignment of a segment. Each line consists of 11 or more TAB-separated fields. The first eleven fields are always present and in the order shown below; if the information represented by any of these fields is unavailable, that field's value will be a placeholder, either ' 0 ' or ' ' as determined by the field's type. The following table gives an overview of these mandatory fields in the SAM format: 在 SAM 格式中,每个对齐行通常代表一个数据段的线性对齐。每一行由 11 个或更多用 TAB 分隔的字段组成。前 11 个字段总是存在的,其顺序如下表所示;如果其中任何一个字段所代表的信息不可用,则该字段的值将是一个占位符,即 "0 "或" ",由字段类型决定。下表概述了 SAM 格式中的这些必填字段:
Col
Field 现场
Type 类型
Regexp/Range
Brief description 简要说明
1
QNAME
String 字符串
Query template NAME 查询模板 NAME
2
FLAG
Int 内部
bitwise FLAG 位操作 FLAG
3
RNAME
String 字符串
rname: Rname:
Reference sequence NAME 参考序列 NAME
4
POS
Int 内部
1-based leftmost mapping POSition 以 1 为基础的最左侧映射 POSition
5
MAPQ
Int 内部
MAPping Quality MAPping 质量
6
CIGAR
String 字符串
MIDNSHP
CIGAR string 雪茄烟串
7
RNEXT
String 字符串
rname: rname: rname: rname:
Reference name of the mate/next read 配偶/下一个读者的参考名称
8
PNEXT
Int 内部
Position of the mate/next read 队友的位置/下一个读数
9
TLEN
Int 内部
observed Template LENgth 观察到的模板长度
10
SEQ
String 字符串
.
segment SEQuence 段 SEQuence
11
QUAL
String 字符串
ASCII of Phred-scaled base QUALity +33 ASCII 的 Phred 标度基 QUALity +33
All mapped segments in alignment lines are represented on the forward genomic strand. For segments that have been mapped to the reverse strand, the recorded SEQ is reverse complemented from the original unmapped sequence and CIGAR, QUAL, and strand-sensitive optional fields are reversed and thus recorded consistently with the sequence bases as represented. 对齐行中的所有映射片段都表示在正向基因组链上。对于已映射到反向链的片段,记录的 SEQ 与未映射的原始序列进行反向互补,CIGAR、QUAL 和对链敏感的可选字段被反转,因此记录的序列碱基与所表示的序列碱基一致。
QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come from the same template. A QNAME '*' indicates the information is unavailable. In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given. QNAME:查询模板名称。具有相同 QNAME 的读数/段被视为来自同一模板。QNAME '*' 表示信息不可用。在 SAM 文件中,当一个读数的排列是嵌合的或给出了多个映射时,它可能会占用多个排列行。
FLAG: Combination of bitwise FLAGs. Each bit is explained in the following table: FLAG:按位排列的 FLAG 组合。 下表解释了每个位:
Bit 位
Description 说明
1
template having multiple segments in sequencing 具有多个测序段的模板
2
each segment properly aligned according to the aligner 根据校准器正确校准每一段
4
segment unmapped 未映射区段
8
next segment in the template unmapped 模板中未映射的下一段
16
SEQ being reverse complemented SEQ 正在反向互补
32
SEQ of the next segment in the template being reverse complemented 反向互补模板中下一个片段的 SEQ
64
the first segment in the template 模板中的第一段
128
the last segment in the template 模板中的最后一段
256
secondary alignment 次级排列
512
not passing filters, such as platform/vendor quality controls 未通过过滤,如平台/供应商质量控制
1024
PCR or optical duplicate PCR 或光学复本
2048
supplementary alignment 补充校准
For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies 'FLAG & '. This line is called the primary line of the read. 对于 SAM 文件中的每个读取/连续文件,都要求与读取相关的一行且仅有一行满足 "FLAG & '。这一行被称为读取的主要行。
Bit 0x100 marks the alignment not to be used in certain analyses when the tools in use are aware of this bit. It is typically used to flag alternative mappings when multiple mappings are presented in a SAM. 比特 0x100 标志着在某些分析中,当使用的工具意识到该比特时,将不使用对齐方式。当 SAM 中出现多个映射时,它通常用于标记替代映射。
Bit indicates that the corresponding alignment line is part of a chimeric alignment. A line flagged with 0x800 is called as a supplementary line. 位 表示相应的对齐行是嵌合对齐的一部分。标记为 0x800 的行称为补充行。
Bit is the only reliable place to tell whether the read is unmapped. If is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits , and . 位 是判断读取是否未映射的唯一可靠依据。如果设置了 ,就不能对 RNAME、POS、CIGAR、MAPQ 以及位 和 做出任何假设。
Bit 0x10 indicates whether SEQ has been reverse complemented and QUAL reversed. When bit 0 x 4 is unset, this corresponds to the strand to which the segment has been mapped: bit 0 x 10 unset indicates the forward strand, while set indicates the reverse strand. When 0 x 4 is set, this indicates whether the unmapped read is stored in its original orientation as it came off the sequencing machine. 位 0x10 表示 SEQ 是否已反向补码和 QUAL 反转。当第 0 x 4 位未设置时,这对应于测序段所映射的链路:第 0 x 10 位未设置表示正向链路,设置表示反向链路。当 0 x 4 位被设置时,表示未映射读数是否以从测序机上下来时的原始方向存储。
Bits and reflect the read ordering within each template inherent in the sequencing technology used. If and are both set, the read is part of a linear template, but it is neither the first nor the last read. If both and are unset, the index of the read in the template is unknown. This may happen for a non-linear template or when this information is lost during data processing. 位 和 反映了所用测序技术中每个模板内固有的读数排序。 如果 和 都被设置,则读数是线性模板的一部分,但既不是第一个读数,也不是最后一个读数。如果 和 都未设置,则读数在模板中的索引未知。这种情况可能发生在非线性模板中,或者在数据处理过程中丢失了这一信息。
If is unset, no assumptions can be made about and . 如果未设置 ,则无法对 和 进行假设。
Bits that are not listed in the table are reserved for future use. They should not be set when writing and should be ignored on reading by current software. 表中未列出的位保留供将来使用。当前软件在写入时不应设置这些位,在读取时也应忽略它们。
RNAME: Reference sequence NAME of the alignment. If @SQ header lines are present, RNAME (if not ) must be present in one of the SQ-SN tag. An unmapped segment without coordinate has a , at RNAME:比对的参考序列名称。如果存在 @SQ 标题行,RNAME(如果不是 )必须出现在其中一个 SQ-SN 标记中。无坐标的未映射段有一个 ,在
this field. However, an unmapped segment may also have an ordinary coordinate such that it can be placed at a desired position after sorting. If RNAME is , no assumptions can be made about POS and CIGAR. 这个字段。但是,未映射的线段也可能有一个普通坐标,这样就可以在排序后将其放置在所需的位置上。如果 RNAME 为 ,则无法假设 POS 和 CIGAR。
4. POS: 1-based leftmost mapping POSition of the first CIGAR operation that "consumes" a reference base (see table below). The first base in a reference sequence has coordinate 1 . POS is set as 0 for an unmapped read without coordinate. If POS is 0 , no assumptions can be made about RNAME and CIGAR. 4.POS:第一个 "消耗 "参照基的 CIGAR 运算的以 1 为基准的最左侧映射 POS 位置(见下表)。参考序列中的第一个碱基坐标为 1。对于无坐标的未映射读数,POS 设置为 0。如果 POS 为 0,则不能对 RNAME 和 CIGAR 作任何假设。
5. MAPQ: MAPping Quality. It equals mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available. 5.MAPQ:映射质量。等于 映射位置错误},四舍五入为整数。数值 255 表示没有映射质量。
6. CIGAR: CIGAR string. The CIGAR operations are given in the following table (set ' ' if unavailable): 6.CIGAR:CIGAR 字符串。下表给出了 CIGAR 操作(如果没有,则设置为" "):
Op 作品
BAM
Description 说明
消耗查询
Consumes
query
消耗参考资料
Consumes
reference
M
0
alignment match (can be a sequence match or mismatch) 配准匹配(可以是序列匹配或不匹配)
yes 是
yes 是
I
1
insertion to the reference 插入参考资料
yes 是
no 没有
D
2
deletion from the reference 从参考资料中删除
no 没有
yes 是
N
3
skipped region from the reference 从参考文献中跳过的区域
no 没有
yes 是
S
4
soft clipping (clipped sequences present in SEQ) 软剪切(SEQ 中存在剪切序列)
yes 是
no 没有
H
5
hard clipping (clipped sequences NOT present in SEQ) 硬剪切(剪切序列不存在于 SEQ 中)
no 没有
no 没有
P
6
padding (silent deletion from padded reference) 填充(从填充的引用中静默删除)
yo 哟
no 没有
=
7
sequence match 序列匹配
yes 是
yes 是
X
8
sequence mismatch 序列错配
yes 是
yes 是
"Consumes query" and "consumes reference" indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively. "消耗查询 "和 "消耗参考 "分别表示 CIGAR 操作是否会导致比对沿着查询序列和参考序列进行。
H can only be present as the first and/or last operation. H 只能作为第一个和/或最后一个操作出现。
S may only have H operations between them and the ends of the CIGAR string. S 与 CIGAR 字符串两端之间只能进行 H 运算。
For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined. 对于 mRNA 到基因组的比对,N 运算代表内含子。对于其他类型的比对,N 的解释没有定义。
Sum of lengths of the operations shall equal the length of SEQ. 操作的长度总和应等于 SEQ 的长度。
RNEXT: Reference sequence name of the primary alignment of the NEXT read in the template. For the last read, the next read is the first read in the template. If @SQ header lines are present, RNEXT (if not , or ' ') must be present in one of the SQ-SN tag. This field is set as '*' when the information is unavailable, and set as ' ' if RNEXT is identical RNAME. If not ' ' and the next read in the template has one primary mapping (see also bit in FLAG), this field is identical to RNAME at the primary line of the next read. If RNEXT is , no assumptions can be made on PNEXT and bit . RNEXT:模板中下一个读数的主排列参考序列名称。对于最后一个读数,下一个读数是模板中的第一个读数。如果存在 @SQ 标头行,RNEXT(如果不是 或 ' ')必须出现在其中一个 SQ-SN 标记中。当信息不可用时,该字段将被设置为 "*";如果 RNEXT 与 RNAME 相同,该字段将被设置为" "。如果不是' ',且模板中的下一次读取有一个主映射(另见 FLAG 中的位 ),则该字段与下一次读取的主行 RNAME 相同。如果 RNEXT 为 ,则不能对 PNEXT 和位 进行假设。
PNEXT: 1-based Position of the primary alignment of the NEXT read in the template. Set as 0 when the information is unavailable. This field equals POS at the primary line of the next read. If PNEXT is 0 , no assumptions can be made on RNEXT and bit . PNEXT:1-based 下一读数在模板中的主排列位置。当信息不可用时设置为 0。该字段等于下一个读数主排列行的 POS。如果 PNEXT 为 0,则不能假设 RNEXT 和位 。
TLEN: signed observed Template LENgth. For primary reads where the primary alignments of all reads in the template are mapped to the same reference sequence, the absolute value of TLEN equals the distance between the mapped end of the template and the mapped start of the template, inclusively (i.e., end - start +1 ). Note that mapped base is defined to be one that aligns to the reference as described by CIGAR, hence excludes soft-clipped bases. The TLEN field is positive for the leftmost segment of the template, negative for the rightmost, and the sign for any middle segment is undefined. If segments cover the same coordinates then the choice of which is leftmost and rightmost is arbitrary, but the two ends must still have differing signs. It is set as 0 for a single-segment template or when TLEN:带符号的观察模板长度(Template LENgth)。对于模板中所有读数的主排列都映射到同一参考序列的主读数,TLEN 的绝对值等于模板的映射末端与模板的映射起点之间的距离(即末端-起点+1)。 注意,映射碱基的定义是与 CIGAR 所描述的参考文献对齐的碱基,因此不包括软缺口碱基。TLEN 字段对于模板最左边的片段是正值,对于最右边的片段是负值,中间片段的符号未定义。如果片段覆盖相同的坐标,则可以任意选择最左和最右的片段,但两端必须有不同的符号。如果是单段模板或当
the information is unavailable (e.g., when the first or last segment of a multi-segment template is unmapped or when the two are mapped to different reference sequences). 信息不可用(例如,多片段模板的第一个或最后一个片段未映射,或两个片段映射到了不同的参考序列)。
The intention of this field is to indicate where the other end of the template has been aligned without needing to read the remainder of the SAM file. Unfortunately there has been no clear consensus on the definitions of the template mapped start and end. Thus the exact definitions are implementationdefined. 该字段的目的是标明模板另一端已对齐的位置,而无需读取 SAM 文件的其余部分。遗憾的是,对于模板映射的起点和终点的定义还没有达成明确的共识。因此,确切的定义是由执行定义的。
10. SEQ: segment SEQuence. This field can be a when the sequence is not stored. If not a , the length of the sequence must equal the sum of lengths of operations in CIGAR. An ' ' denotes the base is identical to the reference base. No assumptions can be made on the letter cases. 10.SEQ:段 SEQuence。当不存储序列时,该字段可以是 。如果不是 ,序列的长度必须等于 CIGAR 中