Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by maubp View Post
    Very misguided given PHRED zero would have been fine for this

    Thanks for the information. I'm not sure what off the shelf solution to recommend here - personally I'd write a Python script to filter out these duff reads...
    Being a non-programmer I used the tools form scriptome. Here is my pipeline to get rid of those reads:

    1Fasta to tab
    perl -e ' $count=0; $len=0; while(<>) { s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) { print "\n" } s/ |$/\t/; $count++; $_ .= "\t"; } else { s/ //g; $len += length($_) } print $_; } print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n"; ' Corrida_16_01RMDSPFR004_2_F3.csfasta > Trepo_2.tab

    2Choose lines not containing "."
    perl -e ' $string=q{.}; $count=0; while(<>) { if (!/\Q$string\E/) { print $_; $count++ } } warn "\nChose $count lines with string [$string] out of $. total lines.\n"; ' Trepo_2.tab > TrepoNoDot_2.tab

    3Choose col 0
    perl -e ' @cols=(0); while(<>) { s/\r?\n//; @F=split /\t/, $_; print join("\t", @F[@cols]), "\n" } warn "\nChose columns ", join(", ", @cols), " for $. lines\n\n" ' TrepoNoDot_2.tab > Trepo2ID.lst

    4 tab to csfasta
    perl -e ' $len=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) { print " $F[1]" } print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n"; } warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n"; ' TrepoNoDot_2.tab > TrepoNoDot_2_F3.csfasta


    Qual file

    5 Extract the corresponding .qual values from a list of IDs
    perl -e ' ($id,$fasta)=@ARGV; open(ID,$id); while (<ID>) { s/\r?\n//; /^>?(\S+)/; $ids{$1}++; } $num_ids = keys %ids; open(F, $fasta); $s_read = $s_wrote = $print_it = 0; while (<F>) { if (/^>(\S+)/) { $s_read++; if ($ids{$1}) { $s_wrote++; $print_it = 1; delete $ids{$1} } else { $print_it = 0 } }; if ($print_it) { print $_ } }; END { warn "Searched $s_read FASTA records.\nFound $s_wrote IDs out of $num_ids in the ID list.\n" } ' Trepo2ID.lst Corrida_16_01RMDSPFR004_2_F3_QV.qual > TrepoNoDot_2_F3_QV.qual
    Last edited by pepperoni; 10-17-2011, 07:17 PM.

    Comment


    • #32
      Originally posted by maubp View Post
      How would you supply a potentially very long list of IDs to grep?
      using the '-f FILE' option, where FILE contains the list of (regexps matching the) readnames, or, if the readnames are all fixed strings, then also use '-F' switch to grep in addition to '-f FILE'.

      cdbfasta/cdbyank is very useful but there seems to be an upper limit on the number of reads that can be indexed, and hence retrieved.
      Last edited by cs; 07-05-2012, 01:49 PM. Reason: added info

      Comment


      • #33
        to extract reads from fastq file based on a id list (one id per line)

        Code:
        cat in.fastq | grep -f id.txt -A 3 | grep -v "\-\-" > out.fastq

        Comment


        • #34
          Originally posted by pmiguel View Post
          cdbfasta -Q and cdbyank are the best solutions I have found. Fast!

          But also, would not the GNU version of grep work since it allows you to return a number of lines after the line that matches with the -A parameter?

          --
          Phillip
          You can specify multiple patterns with the -f option. Place the IDs in a file and run

          grep file.fastq -F -A 3 -f filter.ids | grep -v '^--$' > output.fastq

          You need the second grep because if two matches are consecutive they will be separated by a '---' line. Using -F speeds up the search as it specifies that the IDs are fixed strings.

          Whether that will be faster or slower than other options is open to debate, but as it does not require file parsing I'd bet it is fastest.

          And I concur with Phillip: whenever there is a basic standalone command that can do the job, it is generally best to use it directly instead of making new, more specialized and unneeded programs. Of course, writing programs is fun and helps you learn, but I can think of better ways for using my scarce time.

          Comment


          • #35
            Don't forget the "-w" option (otherwise looking for "@HWI-ST314:228:C1NU0ACXX:6:1101:15011:2455" will pull you "@HWI-ST314:228:C1NU0ACXX:6:1101:15011:24553").

            Code:
            grep -F -w -A3 -f IDs.list all.fastq | grep -v '^--$'

            Comment


            • #36
              Grep mystery: no output if input file too long?

              Well, the grep command above seems to work only with a list of limited length. With a subset of the list many records are retrieved but with the full list the output is an empty file. Any idea why grep could just fail with no error message?

              Comment


              • #37
                I eventually used fqextract.c and it worked even on a large list.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM
                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:35 AM
                0 responses
                14 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-09-2024, 02:46 PM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-07-2024, 06:57 AM
                0 responses
                17 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-06-2024, 07:17 AM
                0 responses
                19 views
                0 likes
                Last Post seqadmin  
                Working...
                X