Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • extracting unmapped reads from BAM using Samtools?

    Hi, I am trying to extract unmapped reads from my BWA sorted BAM file (paired end reads);

    First I filter for the following things using these command lines which seems to work fine:
    1) An unmapped read whose mate is mapped.
    samtools view -u -f 4 -F264 alignments.bam > temp1.bam
    2) A mapped read who's mate is unmapped
    samtools view -u -f 8 -F 260 alignments.bam > temp2.bam
    3) Both reads of the pair are unmapped
    samtools view -u -f 12 -F 256 alignments.bam > temp3.bam

    Then I try to merge the files and sort it so it's ordered by read name using the following command
    samtools merge -u - temp[123].bam | samtools sort -n - unmapped

    BUT I get this error message:
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] EOF marker is absent. The input is probably truncated.


    My next step would be to convert the BAM to fastq but I can't seem to get past this error despite trying out various other options. I get the same message after just trying to samtools merge comman on its own?

    I also tried the bam2fastq script from Hudson alpha but it only seemed to output reads where both in the pair had not mapped, but I want to extract all unmapped reads?

    Any help MUCH appreciated!

  • #2
    Originally posted by Lspoor View Post
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    you get this warning when you read an uncompressed BAM from hard drive. i dont think this is the error.

    you should sort before merge.

    also you might filter the same read into different files. im not sure if this doesnt cause problems if you merge them later on.

    Comment


    • #3
      OK Thanks, maybe I have duplicated reads somewhere? I'll try sorting the files before merging them and see if I get anywhere.

      Comment


      • #4
        Thanks I got that to work, now I have a BAM file containing an alignment made from single (mixed) reads (whose corresponding pair was not of good enough quality to include in the analysis), so they are a mixture of reads sequenced in each direction ie Some have headers ending in 1: @EBRI093151_0019_FC:1:2:2016:3554#CGATGN/1 and some have headers ending in 2:@EBRI093151_0019_FC:1:2:2042:1391#CGATCT/2.

        These reads were mapped using BWA to produce a BAM fie.

        I am trying to filter the unmapped reads out of this file using the following command line:
        samtools view -uf 4 alignments.bam | bamToFastq -bam stdin -fq1 unmapped.fastq -fq2 empty.fastq

        (This is using Hydra-sv bamtoFastq command). But it only outputs the reads ending in 2.

        I'm guessing this command can't handle a mixed input of reads (which are not paired, but are similarly not single end!). Do you know if there is any way I can adapt this, or to filter the input BAM file to separate the reads perhaps?

        Comment


        • #5
          Hi Lspoor,

          I'm wondering whether the command below can extract all reads unmapped from a bam file.

          samtools view -u -f 4 alignments.bam>tmp.bam

          Comment


          • #6
            Originally posted by volks View Post
            you get this warning when you read an uncompressed BAM from hard drive. i dont think this is the error.
            Agreed. This warning is a false positive due to a bug in samtools -u output, there is a one line fix but it hasn't been applied yet:
            Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

            Comment


            • #7
              Originally posted by Delphine Song View Post
              Hi Lspoor,

              I'm wondering whether the command below can extract all reads unmapped from a bam file.

              samtools view -u -f 4 alignments.bam>tmp.bam
              in most cases you should go for:
              samtools view -b -f 4 alignments.bam > tmp.bam

              Options: -b output BAM
              -u uncompressed BAM output (force -b)

              Comment


              • #8
                Why make 3 tmp files

                I've recently been asked to do the same thing that Lspoor has posted about. Why do we need 3 tmp files? Wouldn't using -f 4 retrieve all unmapped lines?

                My guess would be that this keeps all the reads in some kind of order, but if you are going to merge and sort them in the end what is the point?

                Comment


                • #9
                  Originally posted by cwisch88 View Post
                  I've recently been asked to do the same thing that Lspoor has posted about. Why do we need 3 tmp files? Wouldn't using -f 4 retrieve all unmapped lines?
                  Yes. The OP also wanted mapped reads whose mates did not map, that's what step 2 does.

                  Comment


                  • #10
                    Just to clarify...

                    Thanks swbarnes!

                    Using these bits allows for lots of specificity on what you want to retrieve from your alignments, it's just a little cumbersome to someone who hasn't strayed into this portion of the samtools.

                    Thanks for confirming that -f 4 just retrieves the unmapped reads.

                    Comment


                    • #11
                      -f 4 will get every read with the 4 flagged. That usually means unmapped. There is one exception that I know of. If your read hangs off the end of one of your reference sequences, bwa will map it at the end of that reference, and give it the 4 flag, as a sign that something is off. (This confuses Picard Tools, and you have to either fix the .bam, or tell Picard tools to be lenient about that problem)

                      Comment


                      • #12
                        Originally posted by swbarnes2 View Post
                        -f 4 will get every read with the 4 flagged. That usually means unmapped. There is one exception that I know of. If your read hangs off the end of one of your reference sequences, bwa will map it at the end of that reference, and give it the 4 flag, as a sign that something is off. (This confuses Picard Tools, and you have to either fix the .bam, or tell Picard tools to be lenient about that problem)
                        That isn't an exception. FLAG bit 0x4 isn't a sign that something is off, it is the authoritative indicator that a read is unmapped.

                        What you're describing in BWA is the fact that the implementation actually searches against all your references concatenated together - and then checks if you've got an unlucky read at the boundary. In this case it isn't a real mapping, so FLAG bit 0x4 is set (unmapped). Confusingly (but allowed within the specification) BWA leaves the potential mapping position in place.

                        Comment


                        • #13
                          Originally posted by swbarnes2 View Post
                          -f 4 will get every read with the 4 flagged. That usually means unmapped. There is one exception that I know of. If your read hangs off the end of one of your reference sequences, bwa will map it at the end of that reference, and give it the 4 flag, as a sign that something is off. (This confuses Picard Tools, and you have to either fix the .bam, or tell Picard tools to be lenient about that problem)
                          Could you please further specify how to fix the .bam file? I donot want to use the lenient stringency in Picard, because it causes my further problem when calling SNPs in GATK.

                          Comment


                          • #14
                            hi,
                            very thanks for that ,now I have BAM files and how can I extracting exon reads from BAM using Samtools?
                            any ideas will be awesome!


                            Thanks,
                            huili

                            Comment


                            • #15
                              lhlbio,

                              If you have the annotations for your reference you can collect them all in the: 'chr1:start-end' format and then run this command:

                              samtools view aln.sorted.bam chr2:20,100,000-20,200,000 [region2 [...]]

                              If that doesn't work because there are too many arguments; I'd do each one separately in a shell script and then 'samtools merge' the results.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 07-16-2024, 05:49 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-15-2024, 06:53 AM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              40 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 09:45 AM
                              0 responses
                              205 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X