Seqanswers Leaderboard Ad

**GenoMax** · 07-07-2014, 02:58 PM

I am not 100% sure what you are asking but the assembly software is going to put them together eventually. Look into SPADes.

If you are asking about collapsing a read pair into a single extended read then that is only possible where the insert size is appropriate and you have a certain expectation that the reads are going to overlap in the middle. My hunch is that your HiSeq reads are probably not long enough to allow this.

FLASH, BBMerge are some examples of software packages that can overlap and collapse PE reads.

**phynex92** · 07-07-2014, 10:09 PM

Originally posted by GenoMax View Post

I am not 100% sure what you are asking but the assembly software is going to put them together eventually. Look into SPADes.

If you are asking about collapsing a read pair into a single extended read then that is only possible where the insert size is appropriate and you have a certain expectation that the reads are going to overlap in the middle. My hunch is that your HiSeq reads are probably not long enough to allow this.

FLASH, BBMerge are some examples of software packages that can overlap and collapse PE reads.

Thanks for the response!

So I did try using FLASH. I'm just not too sure what should I do with the output that it gives. There are 3 fastq files that it outputted:
- out.extendedFrags.fastq
- out.notCombined_1.fastq
- out.notCombined_2.fastq

Which files should I take to assemble?

**GenoMax** · 07-08-2014, 05:28 AM

If your reads are really overlapping (see the caveat from my last post in this thread) then the extendedFrags file should be significantly large compared to the other two. Is that the case?

What kind of PE data do you have (2x100 or 2x150)? You can use the two reads files as input for spades.

**phynex92** · 07-08-2014, 06:53 AM

Originally posted by GenoMax View Post

If your reads are really overlapping (see the caveat from my last post in this thread) then the extendedFrags file should be significantly large compared to the other two. Is that the case?

What kind of PE data do you have (2x100 or 2x150)? You can use the two reads files as input for spades.

You're right, the extendedFrags was not a as big as as the other two. I guess it doesn't really work well for my case.

Anyways, I think I might of not articulated my question really well (I'm a newbie at this!). So we sent the genome to Illumina for HiSeq (I believe it is 2x150). And they sent us back 3 pairs of PE reads (see picture). I'm just not sure what should I do with them. Should I assemble each pair individually, or should I somehow concatenate them together into just 1 pair?

Thanks again!

Attached Files

Capture.JPG (37.2 KB, 11 views)

**GenoMax** · 07-08-2014, 07:48 AM

Thanks for posting that image. That was informative.

You received data for your sample where respective reads (R1 and R2) were split into three pieces.

In this case you can "cat" the respective pieces together to make a single combined file for each of the two reads. The two combined data files (R1 and R2) can then go into an assembler.

Code:

$ cat 31_TTAGGC_L001_R1_001.fastq.gz 31_TTAGGC_L001_R1_002.fastq.gz 31_TTAGGC_L001_R1_003.fastq.gz > 31_TTAGGC_L001_R1_combined.fastq.gz

Code:

$ cat 31_TTAGGC_L001_R2_001.fastq.gz 31_TTAGGC_L001_R2_002.fastq.gz 31_TTAGGC_L001_R2_003.fastq.gz > 31_TTAGGC_L001_R2_combined.fastq.gz

**phynex92** · 07-08-2014, 09:41 AM

Originally posted by GenoMax View Post

Thanks for posting that image. That was informative.

You received data for your sample where respective reads (R1 and R2) were split into three pieces.

In this case you can "cat" the respective pieces together to make a single combined file for each of the two reads. The two combined data files (R1 and R2) can then go into an assembler.

Code:

$ cat 31_TTAGGC_L001_R1_001.fastq.gz 31_TTAGGC_L001_R1_002.fastq.gz 31_TTAGGC_L001_R1_003.fastq.gz > 31_TTAGGC_L001_R1_combined.fastq.gz

Code:

$ cat 31_TTAGGC_L001_R2_001.fastq.gz 31_TTAGGC_L001_R2_002.fastq.gz 31_TTAGGC_L001_R2_003.fastq.gz > 31_TTAGGC_L001_R2_combined.fastq.gz

Thanks!
Also, another question if you don't mind.
We have a pipeline system that uses multiple assembly software ((Abyss, Velvet and SOAP). Therefore, it comes out as 3 different assemblies. Right now we are just using the one that has the largest contigs. Is there any good way to perhaps post-assembly process these 3 files into one large final list?

**Brian Bushnell** · 07-08-2014, 09:56 AM

phynex,

Dedupe was written explicitly for the purpose of combining multiple assemblies from the same reads into a single assembly. Another option is Minimus. The main difference is that Dedupe is much faster and will never create chimeric contigs (since it only discards redundant contigs), while Minimus merges contigs that overlap. So Dedupe is safer, but Minimus will result in a smaller end result. For multiple assemblies, my group always runs Dedupe then Minimus, to reduce the data volume Minumus has to process.

Sample command line:

dedupe.sh in=assembly1.fa,assembly2.fa,assembly3.fa out=merged.fa edits=10 minoverlap=200

**GenoMax** · 07-08-2014, 10:00 AM

Originally posted by phynex92 View Post

We have a pipeline system that uses multiple assembly software ((Abyss, Velvet and SOAP). Therefore, it comes out as 3 different assemblies. Right now we are just using the one that has the largest contigs. Is there any good way to perhaps post-assembly process these 3 files into one large final list?

Perhaps you can try suggestions from a recent thread: http://seqanswers.com/forums/showthread.php?t=44768 (GAP5 and SSPACE).

You should also look at the quality of your assemblies QUAST.

**phynex92** · 07-08-2014, 10:42 AM

Originally posted by Brian Bushnell View Post

phynex,

Dedupe was written explicitly for the purpose of combining multiple assemblies from the same reads into a single assembly. Another option is Minimus. The main difference is that Dedupe is much faster and will never create chimeric contigs (since it only discards redundant contigs), while Minimus merges contigs that overlap. So Dedupe is safer, but Minimus will result in a smaller end result. For multiple assemblies, my group always runs Dedupe then Minimus, to reduce the data volume Minumus has to process.

Sample command line:

dedupe.sh in=assembly1.fa,assembly2.fa,assembly3.fa out=merged.fa edits=10 minoverlap=200

Hey Brian,

Thanks for the advice.
So I ran the command using

sh ./dedupe.sh -Xmx2g in=masurca_contigs.fasta,velvet_contigs.fasta,spades_contigs.fasta,idba_contigs.fasta out=merged.fasta edits=10 minoverlap=200

But it came out with an error:

./dedupe.sh: 92: ./dedupe.sh: Bad substitution
./dedupe.sh: 100: ./dedupe.sh: [[: not found
./dedupe.sh: 100: ./dedupe.sh: [[: not found
./dedupe.sh: 106: ./dedupe.sh: source: not found
./dedupe.sh: 107: ./dedupe.sh: parseXmx: not found
./dedupe.sh: 108: ./dedupe.sh: [[: not found
./dedupe.sh: 111: ./dedupe.sh: freeRam: not found
java -ea -Xmxm -Xmsm -cp /home/yongl/bbmap/current/ jgi.Dedupe -Xmx2g
Invalid maximum heap size: -Xmxm
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

I've checked the Java on my computer and it is installed properly (java7). I'm wondering if you have any insights to why it's not working properly.

Thanks

**Brian Bushnell** · 07-08-2014, 11:22 AM

phynex,

The shellscript only works in bash. So if you're using a different shell, you can either try "bash" instead of "sh", or this command:

java -Xmx2g -cp /path/to/bbmap/current/ jgi.Dedupe in=masurca_contigs.fasta,velvet_contigs.fasta,spades_contigs.fasta,idba_contigs.fasta out=merged.fasta edits=10 minoverlap=200

-Brian

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, Yesterday, 07:03 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 37 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 43 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 39 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

Multiple Pair-end Reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News