Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by maubp View Post
    Very misguided given PHRED zero would have been fine for this

    Thanks for the information. I'm not sure what off the shelf solution to recommend here - personally I'd write a Python script to filter out these duff reads...
    Being a non-programmer I used the tools form scriptome. Here is my pipeline to get rid of those reads:

    1Fasta to tab
    perl -e ' $count=0; $len=0; while(<>) { s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) { print "\n" } s/ |$/\t/; $count++; $_ .= "\t"; } else { s/ //g; $len += length($_) } print $_; } print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n"; ' Corrida_16_01RMDSPFR004_2_F3.csfasta > Trepo_2.tab

    2Choose lines not containing "."
    perl -e ' $string=q{.}; $count=0; while(<>) { if (!/\Q$string\E/) { print $_; $count++ } } warn "\nChose $count lines with string [$string] out of $. total lines.\n"; ' Trepo_2.tab > TrepoNoDot_2.tab

    3Choose col 0
    perl -e ' @cols=(0); while(<>) { s/\r?\n//; @F=split /\t/, $_; print join("\t", @F[@cols]), "\n" } warn "\nChose columns ", join(", ", @cols), " for $. lines\n\n" ' TrepoNoDot_2.tab > Trepo2ID.lst

    4 tab to csfasta
    perl -e ' $len=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) { print " $F[1]" } print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n"; } warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n"; ' TrepoNoDot_2.tab > TrepoNoDot_2_F3.csfasta


    Qual file

    5 Extract the corresponding .qual values from a list of IDs
    perl -e ' ($id,$fasta)=@ARGV; open(ID,$id); while (<ID>) { s/\r?\n//; /^>?(\S+)/; $ids{$1}++; } $num_ids = keys %ids; open(F, $fasta); $s_read = $s_wrote = $print_it = 0; while (<F>) { if (/^>(\S+)/) { $s_read++; if ($ids{$1}) { $s_wrote++; $print_it = 1; delete $ids{$1} } else { $print_it = 0 } }; if ($print_it) { print $_ } }; END { warn "Searched $s_read FASTA records.\nFound $s_wrote IDs out of $num_ids in the ID list.\n" } ' Trepo2ID.lst Corrida_16_01RMDSPFR004_2_F3_QV.qual > TrepoNoDot_2_F3_QV.qual
    Last edited by pepperoni; 10-17-2011, 07:17 PM.

    Comment


    • #32
      Originally posted by maubp View Post
      How would you supply a potentially very long list of IDs to grep?
      using the '-f FILE' option, where FILE contains the list of (regexps matching the) readnames, or, if the readnames are all fixed strings, then also use '-F' switch to grep in addition to '-f FILE'.

      cdbfasta/cdbyank is very useful but there seems to be an upper limit on the number of reads that can be indexed, and hence retrieved.
      Last edited by cs; 07-05-2012, 01:49 PM. Reason: added info

      Comment


      • #33
        to extract reads from fastq file based on a id list (one id per line)

        Code:
        cat in.fastq | grep -f id.txt -A 3 | grep -v "\-\-" > out.fastq

        Comment


        • #34
          Originally posted by pmiguel View Post
          cdbfasta -Q and cdbyank are the best solutions I have found. Fast!

          But also, would not the GNU version of grep work since it allows you to return a number of lines after the line that matches with the -A parameter?

          --
          Phillip
          You can specify multiple patterns with the -f option. Place the IDs in a file and run

          grep file.fastq -F -A 3 -f filter.ids | grep -v '^--$' > output.fastq

          You need the second grep because if two matches are consecutive they will be separated by a '---' line. Using -F speeds up the search as it specifies that the IDs are fixed strings.

          Whether that will be faster or slower than other options is open to debate, but as it does not require file parsing I'd bet it is fastest.

          And I concur with Phillip: whenever there is a basic standalone command that can do the job, it is generally best to use it directly instead of making new, more specialized and unneeded programs. Of course, writing programs is fun and helps you learn, but I can think of better ways for using my scarce time.

          Comment


          • #35
            Don't forget the "-w" option (otherwise looking for "@HWI-ST314:228:C1NU0ACXX:6:1101:15011:2455" will pull you "@HWI-ST314:228:C1NU0ACXX:6:1101:15011:24553").

            Code:
            grep -F -w -A3 -f IDs.list all.fastq | grep -v '^--$'

            Comment


            • #36
              Grep mystery: no output if input file too long?

              Well, the grep command above seems to work only with a list of limited length. With a subset of the list many records are retrieved but with the full list the output is an empty file. Any idea why grep could just fail with no error message?

              Comment


              • #37
                I eventually used fqextract.c and it worked even on a large list.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Today, 10:49 AM
                0 responses
                9 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-25-2024, 11:49 AM
                0 responses
                21 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-24-2024, 08:47 AM
                0 responses
                20 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                62 views
                0 likes
                Last Post seqadmin  
                Working...
                X