|
SeqAn3 3.4.0
The Modern C++ library for sequence analysis.
|
Provides files and formats for handling read mapping data. More...
Classes | |
| class | seqan3::format_bam |
| The BAM format. More... | |
| class | seqan3::format_sam |
| The SAM format (tag). More... | |
| struct | seqan3::ref_info_not_given |
| Type tag which indicates that no reference information has been passed to the SAM file on construction. More... | |
| class | seqan3::sam_file_header< ref_ids_type > |
| Stores the header information of SAM/BAM files. More... | |
| class | seqan3::sam_file_input< traits_type_, selected_field_ids_, valid_formats_ > |
| A class for reading SAM files, both SAM and its binary representation BAM are supported. More... | |
| struct | seqan3::sam_file_input_default_traits< ref_sequences_t, ref_ids_t > |
| The default traits for seqan3::sam_file_input. More... | |
| interface | seqan3::sam_file_input_format< t > |
| The generic concept for alignment file input formats. More... | |
| struct | seqan3::sam_file_input_options< sequence_legal_alphabet > |
| The options type defines various option members that influence the behaviour of all or some formats. More... | |
| interface | sam_file_input_traits |
| The requirements a traits_type for seqan3::sam_file_input must meet. More... | |
| class | seqan3::sam_file_output< selected_field_ids_, valid_formats_, ref_ids_type > |
| A class for writing SAM files, both SAM and its binary representation BAM are supported. More... | |
| interface | seqan3::sam_file_output_format< t > |
| The generic concept for alignment file out formats. More... | |
| struct | seqan3::sam_file_output_options |
| The options type defines various option members that influence the behavior of all or some formats. More... | |
| struct | seqan3::sam_file_program_info_t |
| Stores information of the program/tool that was used to create a SAM/BAM file. More... | |
| struct | seqan3::sam_flag_printer< sam_flag > |
| A sam_flag can be printed as an integer value. More... | |
| class | seqan3::sam_record< field_types, field_ids > |
| The record type of seqan3::sam_file_input. More... | |
| class | seqan3::sam_tag_dictionary |
| The SAM tag dictionary class that stores all optional SAM fields. More... | |
| struct | seqan3::sam_tag_type< tag_value > |
| The generic base class. More... | |
Enumerations | |
| enum class | seqan3::sam_flag : uint16_t { seqan3::sam_flag::none = 0 , seqan3::sam_flag::paired = 0x1 , seqan3::sam_flag::proper_pair = 0x2 , seqan3::sam_flag::unmapped = 0x4 , seqan3::sam_flag::mate_unmapped = 0x8 , seqan3::sam_flag::on_reverse_strand = 0x10 , seqan3::sam_flag::mate_on_reverse_strand = 0x20 , seqan3::sam_flag::first_in_pair = 0x40 , seqan3::sam_flag::second_in_pair = 0x80 , seqan3::sam_flag::secondary_alignment = 0x100 , seqan3::sam_flag::failed_filter = 0x200 , seqan3::sam_flag::duplicate = 0x400 , seqan3::sam_flag::supplementary_alignment = 0x800 } |
| An enum flag that describes the properties of an aligned read (given as a SAM record). More... | |
Other literals | |
| template<small_string< 2 > str> | |
| constexpr uint16_t | seqan3::literals::operator""_tag () |
| The SAM tag literal, such that tags can be used in constant expressions. | |
| template<small_string< 2 > str> | |
| constexpr uint16_t | operator""_tag () |
| The SAM tag literal, such that tags can be used in constant expressions. | |
Provides files and formats for handling read mapping data.
SAM/BAM files are primarily used to store pairwise alignments of read mapping data.
The SAM file abstraction supports reading 10 different fields:
There exists one more field for SAM files, the seqan3::field::header_ptr, but this field is mostly used internally. Please see the seqan3::sam_file_output::header member function for details on how to access the seqan3::sam_file_header of the file.
All of these fields are retrieved by default (and in that order).
The seqan3::sam_file_input class comes with four constructors: One for construction from a file name, one for construction from an existing stream and a known format and both of the former with or without additional reference information. Constructing from a file name automatically picks the format based on the extension of the file name. Constructing from a stream can be used if you have a non-file stream, like std::cin or std::istringstream. It also comes in handy, if you cannot use file-extension based detection, but know that your input file has a certain format.
Passing reference information, e.g.
comes in handy once you want to convert the CIGAR string, read from your file, into an actual alignment. This will be covered in the section "Transforming the CIGAR information into an actual alignment".
In most cases the template parameters are deduced automatically:
Reading from a std::istringstream:
Note that this is not the same as writing sam_file_input<> (with angle brackets). In the latter case they are explicitly set to their default values, in the former case automatic deduction happens which chooses different parameters depending on the constructor arguments. For opening from file, sam_file_input<> would have also worked, but for opening from stream it would not have.
You can define your own traits type to further customise the types used by and returned by this class, see seqan3::sam_file_input_default_traits for more details. As mentioned above, specifying at least one template parameter yourself means that you loose automatic deduction. The following is equivalent to the automatic type deduction example with a stream from above:
You can iterate over this file record-wise:
In the above example, rec has the type seqan3::sam_file_input::record_type which is a specialisation of seqan3::record and behaves like a std::tuple (that's why we can access it via get). Instead of using the seqan3::field based interface on the record, you could also use std::get<0> or even std::get<dna4_vector> to retrieve the sequence, but it is not recommended, because it is more error-prone.
If you want to skip specific fields from the record you can pass a non-empty fields trait object to the seqan3::sam_file_input constructor to select the fields that should be read from the input. For example, you may only be interested in the mapping flag and mapping quality of your SAM data to get some statistics. The following snippets demonstrate the usage of such a fields trait object.
When reading a file, all fields not present in the file (but requested implicitly or via the selected_field_ids parameter) are ignored and the respective value in the record stays empty.
Instead of using get on the record, you can also use structured bindings to decompose the record into its elements. Considering the example of reading only the flag and mapping quality like before you can also write:
In this case you immediately get the two elements of the tuple: flag of seqan3::sam_file_input::flag_type and mapq of seqan3::sam_file_input::mapq_type.
In SeqAn, we represent an alignment as a tuple of two seqan3::aligned_sequences.
The conversion from a CIGAR string to an alignment can be done with the function seqan3::alignment_from_cigar. You need to pass the reference sequence with the position the read was aligned to and the read sequence. All of it is already in the record when reading a SAM file:
The code will print the following:
Since SeqAn files are ranges, you can also create views over files. A useful example is to filter the records based on certain criteria, e.g. minimum length of the sequence field:
You can check whether a file is at its end by comparing begin() and end() (if they are the same, the file is at its end).
We currently support reading the following formats:
The seqan3::sam_file_output class comes with two constructors, one for construction from a file name and one for construction from an existing stream and a known format. The first one automatically picks the format based on the extension of the file name. The second can be used if you have a non-file stream, like std::cout or std::ostringstream, that you want to read from and/or if you cannot use file-extension based detection, but know that your output file has a certain format.
In most cases the template parameters are deduced completely automatically:
Writing to std::cout:
Note that this is not the same as writing sam_file_output<> (with angle brackets). In the latter case they are explicitly set to their default values, in the former case automatic deduction happens which chooses different parameters depending on the constructor arguments. For opening from file, sam_file_output<> would have also worked, but for opening from stream it would not have.
The easiest way to write to a SAM/BAM file is to use the push_back() member functions. These work similarly to how they work on a std::vector. You may also use a tuple like interface or the emplace_back() function but this is not recommended since one would have to keep track of the correct order of many fields (14 in total). For the record based interface using push_back() please also see the seqan3::record documentation on how to specify a record with the correct field and type lists.
You may also use the output file's iterator for writing, however, this rarely provides an advantage.
If you want to omit non-required parameter or change the order of the parameters, you can pass a non-empty fields trait object to the seqan3::sam_file_output constructor to select the fields that are used for interpreting the arguments.
The following snippet demonstrates the usage of such a field_traits object.
A different way of passing custom fields to the file is to pass a seqan3::record – instead of a tuple – to push_back(). The seqan3::record clearly indicates which of its elements has which seqan3::field so the file will use that information instead of the template argument. This is especially handy when reading from one file and writing to another, because you don't have to configure the output file to match the input file, it will just work:
This will copy the seqan3::field::flag and seqan3::field::ref_offset value into the new output file.
You can write multiple records at once, by assigning to the file:
Record-wise writing in batches also works for writing from input files directly to output files, because input files are also input ranges in SeqAn:
This can be combined with file-based views to create I/O pipelines:
We currently support writing the following formats:
|
strong |
An enum flag that describes the properties of an aligned read (given as a SAM record).
The SAM flag are bitwise flags, which means that each value corresponds to a specific bit that is set and that they can be combined and tested using binary operations. See this tutorial for an introduction on bitwise operations on enum flags.
Example:
Adapted from the SAM specifications are the following additional information to some flag values:
| Enumerator | |
|---|---|
| none | None of the flags below are set. |
| paired | The aligned read is paired (paired-end sequencing). |
| proper_pair | The two aligned reads in a pair have a proper distance between each other. |
| unmapped | The read is not mapped to a reference (unaligned). |
| mate_unmapped | The mate of this read is not mapped to a reference (unaligned). |
| on_reverse_strand | The read sequence has been reverse complemented before being mapped (aligned). |
| mate_on_reverse_strand | The mate sequence has been reverse complemented before being mapped (aligned). |
| first_in_pair | Indicates the ordering (see details in the seqan3::sam_flag description). |
| second_in_pair | Indicates the ordering (see details in the seqan3::sam_flag description). |
| secondary_alignment | This read alignment is an alternative (possibly suboptimal) to the primary. |
| failed_filter | The read alignment failed a filter, e.g. quality controls. |
| duplicate | The read is marked as a PCR duplicate or optical duplicate. |
| supplementary_alignment | This sequence is part of a split alignment and is not the primary alignment. |
|
constexpr |
The SAM tag literal, such that tags can be used in constant expressions.
| char_t | The char type. Usually char. Parameter pack ...s must be of length 2 since SAM tags consist of two letters (char0 and char1). |
A SAM tag consists of two letters, initialized via the string literal ""_tag, which delegate to its unique id.
The purpose of those tags is to fill or query the seqan3::sam_tag_dictionary for a specific key (tag_id) and retrieve the corresponding value.
|
The SAM tag literal, such that tags can be used in constant expressions.
| char_t | The char type. Usually char. Parameter pack ...s must be of length 2 since SAM tags consist of two letters (char0 and char1). |
A SAM tag consists of two letters, initialized via the string literal ""_tag, which delegate to its unique id.
The purpose of those tags is to fill or query the seqan3::sam_tag_dictionary for a specific key (tag_id) and retrieve the corresponding value.