Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by bioliyezhang View Post
    maybe try FixMateInformation of picard tools.
    Not sure whether it will work.
    I believe FixMateInformation is to be used on ALIGNED SAM files, to make sure both mates of a paired-end read have the same info, and not on pre-mapped fastq files to ensure both files contain the same mate pairs and in the same order

    Comment


    • #17
      Dear btmb,
      I'm afraid I still cannot run it. Sorry to keep bothering?

      I have corrected tabs and spaces to avoid getting the Unexpected indent Error,

      but now I get:

      Traceback (most recent call last):
      File "fastq_mate_pair", line 61, in <module>
      if count==1:
      NameError: name 'count' is not defined
      Thanks again for any help,

      Carmen

      Comment


      • #18
        Hi bmtb,

        I hope this is what you are looking for. Even I had the same problem where reads in forward and reverse files were not in order. I use this shell script to extract common reads and orphan reads. It produces 4 files forward_matched, reverse_matched, forward_orphan and reverse_orphan.

        Usage is simple

        Code:
        filteresCommonPEreads.sh forward.fastq reverse.fastq
        Hope this helps.

        P.S : change file extension to sh.
        Attached Files

        Comment


        • #19
          btmb,

          I got your edited script to run today, thank you. However, it prints nothing out to the created files. I get the following error.
          finished reading in: 46na_1_trimmed_clipped.fastq
          finished reading in: 46na_1_trimmed_clipped.fastq
          number of paired reads: 37819703
          number of single reads left from 46na_1_trimmed_clipped.fastq: 3984165
          number of single reads left from 46na_2_trimmed_clipped.fastq: 4582643
          Traceback (most recent call last):
          File "/usr/bin/fastqcombinepairedend_edit.py", line 106, in <module>
          SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')
          KeyError: 'HWI-ST594:1:2110:2163:8239#TTAGGCTTAGGC'
          Here are the corresponding lines in the script:
          print 'number of single reads left from %s: %d'%(setoffiles[1],len(single2))
          for label in single1:
          SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')
          It looks like the dictionary (label) contains nothing, and I can not find it anywhere else in the script. This is my first Python script to use and I would like to get it running. How would I go about fixing this?

          Comment


          • #20
            Hi, maybe I'm tot late but for someone looking into this. this neat one liner does the job as well:
            Code:
            paste file1_PE_DATA.fastq file2_PE_DATA.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf  | head | sed 's/\t\t/\n/g' | awk '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}'
            it joins, sorts and splits again.

            best,

            Phil

            Comment


            • #21
              Originally posted by sphil View Post
              Hi, maybe I'm tot late but for someone looking into this. this neat one liner does the job as well:
              Code:
              paste file1_PE_DATA.fastq file2_PE_DATA.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf  | head | sed 's/\t\t/\n/g' | awk '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}'
              it joins, sorts and splits again.

              best,

              Phil
              Hey Phil

              Maybe I didn't understand but I used the "cat" command in front of the line and ran it. It produced two files, but the script just finished without error (thought it was done) after writing only a couple of reads in file1 and just some "2:N:0:ACTGAT" lines in file2. What is the problem here do you think?

              -Bo

              Comment


              • #22
                try this one:
                paste pe_file1.fastq pe_file2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | sed 's/\t\t/\n/g' | awk '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}'

                the shuf and head command kinda crashed it. sry..

                Comment


                • #23
                  count and name are not defined

                  I have the same problem as carmeyeii.
                  After declare count=0 name=None in fastqcombinepairedend.py, it gives me the empty files as following:

                  ./PECombiner.sh
                  finished reading in: /fastqData/SRR579533_1_trimmed_clipped.fastq
                  finished reading in: /fastqData/SRR579533_2_trimmed_clipped.fastq
                  number of paired reads: 0
                  number of single reads left from /fastqData/SRR579533_1_trimmed_clipped.fastq: 0
                  number of single reads left from /fastqData/SRR579533_2_trimmed_clipped.fastq: 0


                  Is the way I declare the variable is not right?
                  Something went wrong here, is anybody can help?

                  Thank you in advance,



                  Originally posted by carmeyeii View Post
                  Dear btmb,

                  I have tried the code and it seems not to be working 'name' and 'count' are not defined before they are evaluated it seems, and the 'count' block seems to me as it sets to 0 when the header is encountered, but I'm not sure. Are there any python-saavy bioinformaticians that would care to help us out with this code?

                  Thanks!
                  Last edited by newuserofrnaseq; 01-13-2014, 11:17 AM.

                  Comment


                  • #24
                    I also struggled to get that script to work and I was a little frustrated once I did get it working. Mainly because it uses a lot of memory, as someone commented previously, but also it strips the pair information off the output and creates hardcoded file names.

                    I ended up writing my own tool for pairing reads called Pairfq. The problem I kept running into is that most approaches assume 4 line Fastq as input and the sequence name has to be in a certain format. That means you have to come up with different ways to solve this simple task if you are using Fasta or your sequence names are a little different. It was my aim to try and solve these problems.

                    Here is an example of the usage:

                    Code:
                    $ pairfq makepairs -f s_1_1_trimmed.fq \
                    -r s_1_2_trimmed.fq \
                    -fp s_1_1_trimmed_p.fq \
                    -rp s_1_2_trimmed_p.fq \
                    -fs s_1_1_trimmed_s.fq \
                    -rs s_1_2_trimmed_s.fq
                    My observations are that the above command uses about 43% as much memory as the Python script listed above in the thread. This command is a bit slower because it is not making any assumptions about the format (see below). It is also possible to specify that an index should be used. For example,

                    Code:
                    $ pairfq makepairs -f s_1_1_trimmed.fq \
                    -r s_1_2_trimmed.fq \
                    -fp s_1_1_trimmed_p.fq \
                    -rp s_1_2_trimmed_p.fq \
                    -fs s_1_1_trimmed_s.fq \
                    -rs s_1_2_trimmed_s.fq \
                    --index
                    This will result in almost no memory being used (15 MB RAM actually). The execution will be much slower with this option, but this is the only method to my knowledge that can handle pairing really large sequence sets without a big memory machine.

                    The input can be Fasta or Fastq, compressed (with gzip or bzip2) or uncompressed, and the sequence identifiers can be in Casava 1.4 or 1.8+ format as explained on the project wiki (note that pairing the reads is just one of the functions of Pairfq). The outputs are separate files of paired and unpaired forward and reverse reads (which can be optionally compressed).

                    Hopefully, this will save you some time and help to avoid crafting custom shell commands for this task.

                    Comment


                    • #25
                      I also have a tool for fixing paired reads from interleaved files. It's extremely fast (over 200MB/s) and uses little memory, but cannot fix 2-file input, only single-file interleaved. It's based on the fact that if you corrupt an interleaved file by throwing away some of the reads, all remaining pairs will still be adjacent to their mate. So, you never need to store more than 1 read in memory.

                      To run it, download BBMap and us use this:
                      bbsplitpairs.sh in=broken.fq out=pairs.fq outsingle=singletons.fq fixpairs

                      It reads/writes fasta or fastq, raw or gzipped. It can also do trimming and quality filtering, for example:

                      bbsplitpairs.sh in=read1.fq in2=read2.fq out=trimmed1.fq out2=trimmed2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlen=40
                      This will quality-trim the right and left sides of reads to Q10 and discard anything that ends up less than 40bp. For pairs, if one read is shorter than 40bp and the mate is longer, the longer one will be written to the 'outsingle' destination and the shorter will be discarded. in2 and out2 are only needed for paired 2-file input. If a single file is input, then whether it is interleaved or single-ended will be autodetected based on read names, but this can be overridden with the 'int=t' or 'int=f' flag.
                      Last edited by Brian Bushnell; 02-24-2014, 10:29 AM.

                      Comment


                      • #26
                        Originally posted by Brian Bushnell View Post
                        I also have a tool for fixing paired reads from interleaved files. It's extremely fast (over 200MB/s) and uses little memory, but cannot fix 2-file input, only single-file interleaved. It's based on the fact that if you corrupt an interleaved file by throwing away some of the reads, all remaining pairs will still be adjacent to their mate. So, you never need to store more than 1 read in memory.

                        To run it, download BBMap and us use this:
                        bbsplitpairs.sh in=broken.fq out=pairs.fq outsingle=singletons.fq fixpairs

                        It reads/writes fasta or fastq, raw or gzipped. It can also do trimming and quality filtering, for example:

                        bbsplitpairs.sh in=read1.fq in2=read2.fq out=trimmed1.fq out2=trimmed2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlen=40
                        This will quality-trim the right and left sides of reads to Q10 and discard anything that ends up less than 40bp. For pairs, if one read is shorter than 40bp and the mate is longer, the longer one will be written to the 'outsingle' destination and the shorter will be discarded. in2 and out2 are only needed for paired 2-file input. If a single file is input, then whether it is interleaved or single-ended will be autodetected based on read names, but this can be overridden with the 'int=t' or 'int=f' flag.
                        I tried the command above, and I can't get it to run. Here is the output:

                        Code:
                        $ ./bbsplitpairs.sh in=bbsplittest_trimmed_cat.fastq out=bbsplit_test_p.fq outsingle=bbsplit_test_s.fq fixpairs
                        : invalid option
                        Usage:	/bin/bash [GNU long option] [option] ...
                        	/bin/bash [GNU long option] [option] script-file ...
                        GNU long options:
                        	--debug
                        	--debugger
                        	--dump-po-strings
                        	--dump-strings
                        	--help
                        	--init-file
                        	--login
                        	--noediting
                        	--noprofile
                        	--norc
                        	--posix
                        	--protected
                        	--rcfile
                        	--rpm-requires
                        	--restricted
                        	--verbose
                        	--version
                        	--wordexp
                        Shell options:
                        	-irsD or -c command or -O shopt_option		(invocation only)
                        	-abefhkmnptuvxBCHP or -o option
                        If I try to execute with bash:

                        Code:
                        $ bash bbsplitpairs.sh in=bbsplittest_trimmed_cat.fastq out=bbsplit_test_p.fq outsingle=bbsplit_test_s.fq fixpairs
                        : command not foundne 3: 
                        : command not foundne 6: 
                        'bsplitpairs.sh: line 8: syntax error near unexpected token `{
                        'bsplitpairs.sh: line 8: `calcXmx () {
                        As a side note, you might want to put your scripts in directory because people get kind of annoyed when you untar a project and it inflates a bunch of scripts.

                        Comment


                        • #27
                          Originally posted by SES View Post
                          As a side note, you might want to put your scripts in directory because people get kind of annoyed when you untar a project and it inflates a bunch of scripts.
                          Ahh, sorry about that; I'll change it.

                          The shellscripts are for bash, so you may have to say "bash bbsplitpairs.sh". If your environment does not have bash, then instead of the shellscript, you would do this:

                          java -Xmx200m -cp /path/to/current/ jgi.SplitPairsAndSingles in=broken.fq out=pairs.fq outsingle=singletons.fq fixpairs

                          Comment


                          • #28
                            Originally posted by Brian Bushnell View Post
                            The shellscripts are for bash, so you may have to say "bash bbsplitpairs.sh". If your environment does not have bash, then instead of the shellscript, you would do this:

                            java -Xmx200m -cp /path/to/current/ jgi.SplitPairsAndSingles in=broken.fq out=pairs.fq outsingle=singletons.fq fixpairs
                            Thanks for the response. As you can see in my post above, that is how I executed the script ("bash bbsplitpairs.sh"; on Linux with bash 3.2.25) and it produced the error I posted. As far as the Java command, I will experiment and see what I find...

                            Comment


                            • #29
                              Originally posted by Brian Bushnell View Post
                              Ahh, sorry about that; I'll change it.

                              The shellscripts are for bash, so you may have to say "bash bbsplitpairs.sh". If your environment does not have bash, then instead of the shellscript, you would do this:

                              java -Xmx200m -cp /path/to/current/ jgi.SplitPairsAndSingles in=broken.fq out=pairs.fq outsingle=singletons.fq fixpairs
                              Okay, I have tried the Java command and I got it run but it did not produce any pairs, whereas Pairfq and the Python script mentioned above produced the same result. What is the expected input?

                              Comment


                              • #30
                                Can you post the top 16 or so lines from the input file?

                                The expected input is something like this:

                                @blahA /1
                                ACGT
                                +
                                ????
                                @blahA /2
                                ACGT
                                +
                                ????
                                @blahB /2
                                ACGT
                                +
                                ????
                                @blahC /1
                                ACGT
                                +
                                ????
                                @blahC /2
                                ACGT
                                +
                                ????


                                In this case, "blahA /1" and "blahA /2" would be output as a pair, as would "blahC /1" and "blahC /2", while "blahB /2" would be output to singletons.

                                If the reads don't contain a " /1" and " /2" or a " 1:" and a " 2:", it won't work.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                66 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X