Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple Pair-end Reads

    Hey guys,

    I'm completely new to de novo assembling and I have some questions about how to deal with Illumina sequencing reads.

    So we have a set of multiple paired end reads for a strain of Streptomyces we sequenced from Illumina HiSeq. I'm wondering if there's a way to merge these pairs into a single pair so that we can assemble them together in one run. I've heard that you can just concatenate the reads using cat command in Linux, but I'm not sure if that's the best way to merge the reads.

    Any suggestions on this issue would be greatly appreciated.

  • #2
    I am not 100% sure what you are asking but the assembly software is going to put them together eventually. Look into SPADes.

    If you are asking about collapsing a read pair into a single extended read then that is only possible where the insert size is appropriate and you have a certain expectation that the reads are going to overlap in the middle. My hunch is that your HiSeq reads are probably not long enough to allow this.

    FLASH, BBMerge are some examples of software packages that can overlap and collapse PE reads.

    Comment


    • #3
      Originally posted by GenoMax View Post
      I am not 100% sure what you are asking but the assembly software is going to put them together eventually. Look into SPADes.

      If you are asking about collapsing a read pair into a single extended read then that is only possible where the insert size is appropriate and you have a certain expectation that the reads are going to overlap in the middle. My hunch is that your HiSeq reads are probably not long enough to allow this.

      FLASH, BBMerge are some examples of software packages that can overlap and collapse PE reads.
      Thanks for the response!

      So I did try using FLASH. I'm just not too sure what should I do with the output that it gives. There are 3 fastq files that it outputted:
      - out.extendedFrags.fastq
      - out.notCombined_1.fastq
      - out.notCombined_2.fastq

      Which files should I take to assemble?

      Comment


      • #4
        If your reads are really overlapping (see the caveat from my last post in this thread) then the extendedFrags file should be significantly large compared to the other two. Is that the case?

        What kind of PE data do you have (2x100 or 2x150)? You can use the two reads files as input for spades.

        Comment


        • #5
          Originally posted by GenoMax View Post
          If your reads are really overlapping (see the caveat from my last post in this thread) then the extendedFrags file should be significantly large compared to the other two. Is that the case?

          What kind of PE data do you have (2x100 or 2x150)? You can use the two reads files as input for spades.
          You're right, the extendedFrags was not a as big as as the other two. I guess it doesn't really work well for my case.

          Anyways, I think I might of not articulated my question really well (I'm a newbie at this!). So we sent the genome to Illumina for HiSeq (I believe it is 2x150). And they sent us back 3 pairs of PE reads (see picture). I'm just not sure what should I do with them. Should I assemble each pair individually, or should I somehow concatenate them together into just 1 pair?

          Thanks again!
          Attached Files

          Comment


          • #6
            Thanks for posting that image. That was informative.

            You received data for your sample where respective reads (R1 and R2) were split into three pieces.

            In this case you can "cat" the respective pieces together to make a single combined file for each of the two reads. The two combined data files (R1 and R2) can then go into an assembler.

            Code:
            $ cat 31_TTAGGC_L001_R1_001.fastq.gz 31_TTAGGC_L001_R1_002.fastq.gz 31_TTAGGC_L001_R1_003.fastq.gz > 31_TTAGGC_L001_R1_combined.fastq.gz
            Code:
            $ cat 31_TTAGGC_L001_R2_001.fastq.gz 31_TTAGGC_L001_R2_002.fastq.gz 31_TTAGGC_L001_R2_003.fastq.gz > 31_TTAGGC_L001_R2_combined.fastq.gz
            Last edited by GenoMax; 07-08-2014, 07:50 AM.

            Comment


            • #7
              Originally posted by GenoMax View Post
              Thanks for posting that image. That was informative.

              You received data for your sample where respective reads (R1 and R2) were split into three pieces.

              In this case you can "cat" the respective pieces together to make a single combined file for each of the two reads. The two combined data files (R1 and R2) can then go into an assembler.

              Code:
              $ cat 31_TTAGGC_L001_R1_001.fastq.gz 31_TTAGGC_L001_R1_002.fastq.gz 31_TTAGGC_L001_R1_003.fastq.gz > 31_TTAGGC_L001_R1_combined.fastq.gz
              Code:
              $ cat 31_TTAGGC_L001_R2_001.fastq.gz 31_TTAGGC_L001_R2_002.fastq.gz 31_TTAGGC_L001_R2_003.fastq.gz > 31_TTAGGC_L001_R2_combined.fastq.gz
              Thanks!
              Also, another question if you don't mind.
              We have a pipeline system that uses multiple assembly software ((Abyss, Velvet and SOAP). Therefore, it comes out as 3 different assemblies. Right now we are just using the one that has the largest contigs. Is there any good way to perhaps post-assembly process these 3 files into one large final list?

              Comment


              • #8
                phynex,

                Dedupe was written explicitly for the purpose of combining multiple assemblies from the same reads into a single assembly. Another option is Minimus. The main difference is that Dedupe is much faster and will never create chimeric contigs (since it only discards redundant contigs), while Minimus merges contigs that overlap. So Dedupe is safer, but Minimus will result in a smaller end result. For multiple assemblies, my group always runs Dedupe then Minimus, to reduce the data volume Minumus has to process.

                Sample command line:

                dedupe.sh in=assembly1.fa,assembly2.fa,assembly3.fa out=merged.fa edits=10 minoverlap=200

                Comment


                • #9
                  Originally posted by phynex92 View Post
                  We have a pipeline system that uses multiple assembly software ((Abyss, Velvet and SOAP). Therefore, it comes out as 3 different assemblies. Right now we are just using the one that has the largest contigs. Is there any good way to perhaps post-assembly process these 3 files into one large final list?
                  Perhaps you can try suggestions from a recent thread: http://seqanswers.com/forums/showthread.php?t=44768 (GAP5 and SSPACE).

                  You should also look at the quality of your assemblies QUAST.

                  Comment


                  • #10
                    Originally posted by Brian Bushnell View Post
                    phynex,

                    Dedupe was written explicitly for the purpose of combining multiple assemblies from the same reads into a single assembly. Another option is Minimus. The main difference is that Dedupe is much faster and will never create chimeric contigs (since it only discards redundant contigs), while Minimus merges contigs that overlap. So Dedupe is safer, but Minimus will result in a smaller end result. For multiple assemblies, my group always runs Dedupe then Minimus, to reduce the data volume Minumus has to process.

                    Sample command line:

                    dedupe.sh in=assembly1.fa,assembly2.fa,assembly3.fa out=merged.fa edits=10 minoverlap=200
                    Hey Brian,

                    Thanks for the advice.
                    So I ran the command using

                    sh ./dedupe.sh -Xmx2g in=masurca_contigs.fasta,velvet_contigs.fasta,spades_contigs.fasta,idba_contigs.fasta out=merged.fasta edits=10 minoverlap=200

                    But it came out with an error:

                    ./dedupe.sh: 92: ./dedupe.sh: Bad substitution
                    ./dedupe.sh: 100: ./dedupe.sh: [[: not found
                    ./dedupe.sh: 100: ./dedupe.sh: [[: not found
                    ./dedupe.sh: 106: ./dedupe.sh: source: not found
                    ./dedupe.sh: 107: ./dedupe.sh: parseXmx: not found
                    ./dedupe.sh: 108: ./dedupe.sh: [[: not found
                    ./dedupe.sh: 111: ./dedupe.sh: freeRam: not found
                    java -ea -Xmxm -Xmsm -cp /home/yongl/bbmap/current/ jgi.Dedupe -Xmx2g
                    Invalid maximum heap size: -Xmxm
                    Error: Could not create the Java Virtual Machine.
                    Error: A fatal exception has occurred. Program will exit.

                    I've checked the Java on my computer and it is installed properly (java7). I'm wondering if you have any insights to why it's not working properly.

                    Thanks

                    Comment


                    • #11
                      phynex,

                      The shellscript only works in bash. So if you're using a different shell, you can either try "bash" instead of "sh", or this command:

                      java -Xmx2g -cp /path/to/bbmap/current/ jgi.Dedupe in=masurca_contigs.fasta,velvet_contigs.fasta,spades_contigs.fasta,idba_contigs.fasta out=merged.fasta edits=10 minoverlap=200

                      -Brian

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Recent Advances in Sequencing Analysis Tools
                        by seqadmin


                        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                        05-06-2024, 07:48 AM
                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 07:03 AM
                      0 responses
                      15 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 05-10-2024, 06:35 AM
                      0 responses
                      37 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 05-09-2024, 02:46 PM
                      0 responses
                      43 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 05-07-2024, 06:57 AM
                      0 responses
                      39 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X