Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need to extract fastq of specific region in bam file

    I have a bam file that aligns a number of reads against a single reference. And I want to extract the fastq of each read in the exact location that those reads overlap a specific position in the reference.

    The reference is ~3kb in length, and I want to extract the sequence from each read where it overlaps the positions 946-1047 in the reference. What is the best way to do that?


    Thanks,
    John

  • #2
    use samtools mpileup to convert from a mapping of each read to a nucleotide by nucleotide report of the reference sequence and quality scores of each basecall at that reference position.

    If you need the output in fastq that would not be the answer. If you want the information of fastq (basecall and quality) in a reference-oriented format, then it would be the answer!
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment


    • #3
      Wouldn't mpileup flatten my data into consensus REF and ALT alleles and just show me counts of the alleles at each position? I need to get the sequence from each read (the query) across the specific region of the reference to which it maps. I thought mpileup would give me a per reference-base VCF output, with alleles just counted.

      Sorry if I'm being dense. Would mpileup let me get the read-based sequence across a specific position defined in terms of the reference that the read is mapped on? At the end I need 1 sequence per read that mapped, but just the piece of that read that mapped to a specific place in the reference.

      Comment


      • #4
        @jmartin: Use the samtools view option posted by Devon in this thread: http://seqanswers.com/forums/showthread.php?t=39766

        You can then use reformat.sh from BBMap to extract the reads from the SAM file. Reads may extend beyond the exact boundries you are setting.

        Code:
        $ reformat.sh in=region.sam out=reads.fq

        Comment


        • #5
          The samtools view option from GenoMax will extract the reads. If you wanted to examine the basecalls across that region on a nucleotide by nucleotide basis, then the mpileup would be helpful, I think.

          mpileup output is this:

          chr1 276 G 22 ...T,,...,,...,,.. hhhhhhhdddhhhhh

          This says at position 276 of chr1 there were 22 reads. Most matched the reference G in forward or reverse read orientation (. or ,) but one T basecall was also present. The last column gives the quality score of the basecalls.
          Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

          Comment


          • #6
            @GenoMax: In my case the issue is specifically that I need to produce, per query read, the read-based sequence that exactly covers a specific region in the reference. I am able to find the subset of reads that map to the region using samtools as you suggest.

            My problem is that I need to only have the sequence within the boundaries that I've set (the read sequence mapping to the exact position of the feature of interest in my reference).

            Comment


            • #7
              @SNPsaurus: Is the 5th column of the output you illustrate (the "...T,,...,,...,,..") showing me in a consistent order, the state of each mapped read at that base in the reference? So if I have 1000 reads mapping onto a base, I would have 1000 characters in that string?

              Assuming the order was consistent, I could use that to reconstruct each read. But I would need to know the id of the read.

              That sounds close to what I need, but I believe that format would foil me when the read depth at a base position changes. The length of that "...T,,...,,...,,.." string would change when the number of reads covering a position change. And thus I couldn't assume that the 40th character in the 5th column string would always be from the same read.

              Comment


              • #8
                Hmm, the mpileup specification has characters for read start and read end. I haven't had to worry about reconstructing the haplotype so I haven't investigated the behavior of how those are used.

                You can cap the number of basecalls at a position, but also have it show all (I've done mpileups with read depths of 30,000 or more when looking for rare variants). You can also show only basecalls of a certain quality score but you wouldn't want to exclude any.

                I would hope that without filtering, every start read symbol in a particular column would have an end read symbol in the same column (after accounting for these special characters) 100 bp away (or whatever the read length).

                This still wouldn't get you back to the read ID, though.
                Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                Comment


                • #9
                  Originally posted by jmartin View Post
                  @GenoMax: In my case the issue is specifically that I need to produce, per query read, the read-based sequence that exactly covers a specific region in the reference. I am able to find the subset of reads that map to the region using samtools as you suggest.

                  My problem is that I need to only have the sequence within the boundaries that I've set (the read sequence mapping to the exact position of the feature of interest in my reference).
                  If that is a hard requirement then that would almost certainly require custom scripting to cut the reads (extending outside the interval) down to the exact interval, if you want to retain fastq format. If you only need the sequence then perhaps doing a multiple sequence alignment with extracted reads (using fasta format) may suffice to get the exact region.

                  Comment


                  • #10
                    @GenoMax: I'm afraid that a multiple alignment including all my reads that pile up over the feature + my reference will stretch and pull the individual read sequences to make them fit the new alignment. Thats a problem for me because my end goal is to look at the state of 106 variant loci within each read along the place they align to the feature in the reference. I am afraid that a multiple alignment would deform the alignments I have currently, which have been highly refined using GATK's indel realigner.

                    Comment


                    • #11
                      I don't know a way of keeping the realignment intact in addition to respecting those boundaries while doing read extraction. Perhaps someone else (who does this sort of thing regularly) will have other ideas.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-25-2024, 11:49 AM
                      0 responses
                      19 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-24-2024, 08:47 AM
                      0 responses
                      18 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      62 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X