Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • splaisan
    senior molecular biologist
    • Jun 2009
    • 32

    keep read address using tophat

    Maybe overlooking something but ...
    when I use tophat with paired reads having a name as
    @SRR479052.1 HWI-ST188:1:1101:1222:2140/1
    I end up with the second part clipped and a arbitrary read number put instead in the resulting bam as SRR479052.5415964 which does not support marking optical duplicates with picard
    Code:
    READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*.
    Q: Can I preserve the full read address when using tophat with some magic argument? or should I parse both fastQ and bam to reconstitute this missing info?

    Thanks for help
    Stephane
    http://www.bits.vib.be/index.php
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    That's because the second part isn't part of the read name. There's an option in fastq-dump to put the original read name where it should be rather than just numbering things sequentially.

    Comment

    • splaisan
      senior molecular biologist
      • Jun 2009
      • 32

      #3
      Problem is I downloaded the fastq pre-made from the EBI repo and mapped them all :-( without figuring this out. I can fix this by patching the fatsQ but will still need to remap the whole shebang...

      Thanks for the info anyway (for next time)
      http://www.bits.vib.be/index.php

      Comment

      • splaisan
        senior molecular biologist
        • Jun 2009
        • 32

        #4
        picard markDuplicate compatible reads from SRA data

        few days later, the issue is fixed by:
        • NOT downloading the fastq files from SRA but instead the .sra formatted data using Aspera (I used the browser link)
        • Use the sratoolkit command fastq-dump (thanks Devon) to convert .sra to .fastq and split reads in paired files. The trick was here to use the specific parameter -F|--origfmt to ensure 'Defline contains only original sequence name' and that the remaining text was discarder

        The resulting command in my case was (after correcting typo!):
        fastq-dump -F --split-3 --gzip *.sra -O fastq_read_folder
        TIP: I used P|P|S|S to speed this dramatically for the 26 input files on my 24 thread machine.

        My reads have now a header line as

        @HWI-ST188:1:1101:1222:2140
        NAGACGAAGGTTCTTCAGTTAAACAGTTTAGAGCCCCATAAGAGCAAACTGTAGTGTAAAGAGGAAAAGTAAGTACAATCTTTCCAGACACACAACTAATA
        +HWI-ST188:1:1101:1222:2140
        #1:BDDDDHHHHHIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIHGIFHCGIEHIIIHIIIIDEHHCHEHEEEEEECCECCCBCCBBBBCCCCA
        which after tophat mapping results for that particular read in

        HWI-ST188:1:1101:1222:2140 99 chr10 59953037 50 101M = 59953061 125 NAGACGAAGGTTCTTCAGTTAAACAGTTTAGAGCCCCATAAGAGCAAACTGTAGTGTAAAGAGGAAAAGTAAGTACAATCTTTCCAGACACACAACTAATA #1:BDDDDHHHHHIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIHGIFHCGIEHIIIHIIIIDEHHCHEHEEEEEECCECCCBCCBBBBCCCCA AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0C100 YT:Z:UU NH:i:1
        Running picard on such BAM data is now able to identify few 1000' optical repeats in the full sample.

        CQFD
        Last edited by splaisan; 02-15-2014, 06:14 AM. Reason: typo
        http://www.bits.vib.be/index.php

        Comment

        • GenoMax
          Senior Member
          • Feb 2008
          • 7142

          #5
          Don't see a "-F" in your fastq-dump command above. Typo?

          Comment

          • splaisan
            senior molecular biologist
            • Jun 2009
            • 32

            #6
            shame on me! corrected now (thanks)
            http://www.bits.vib.be/index.php

            Comment

            Latest Articles

            Collapse

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            11 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            23 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 12:03 PM
            0 responses
            28 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 11:40 AM
            0 responses
            22 views
            0 reactions
            Last Post SEQadmin2  
            Working...