Seqanswers Leaderboard Ad

**cmbetts** · 03-15-2016, 11:42 AM

The quick answer for the first question is that the sequencer runs as many cycles as you tell it to, and that's how long the reads come out. If the insert is shorter than the read length, it reads into the adapter on the opposite side, and gibberish (mostly As) beyond that. The bases in the adapter need to be removed by sequence identity, not quality.

I can't answer the second question, as I've never need to do a genome assembly.

**sfh838t** · 03-15-2016, 12:50 PM

thank you. I can see the nonsense part . and yes, it was 36nt after adapter removal.

**MU Core** · 03-16-2016, 05:30 AM

If using bcl2fastq for adapter trimming, I believe default minimum-trimmed-read-length is set to 35. If trimming would cut a read down to less than 35 bases then the bases between the end of the trimmed read and position 35 are “masked” by replacing them with N’s. So the remaining adapter after 20 bases would be masked. Our group has set the minimum-trimmed-read-length to 10 for small RNA data sets. This may not be your situation but thought it worth mentioning.

**sfh838t** · 03-16-2016, 05:53 AM

I used cutadapt for adapter removal which best that I can tell will remove all parts of the search string no matter where they occur.
It still puzzles me though why I can find reads that align (regardless of read length) covering literally my whole ref seq but can only come up with contigs covering 1kb of nearly 8 kb. I know, aligning and assembling are two different things/algorithms, but still.
If anyone has an idea where else I could maybe ask this question?

**GenoMax** · 03-16-2016, 06:02 AM

Originally posted by sfh838t View Post

It still puzzles me though why I can find reads that align (regardless of read length) covering literally my whole ref seq but can only come up with contigs covering 1kb of nearly 8 kb.

Let me see if I am understanding this right.

If you align you can find reads covering the entire reference (8kb?) but if you try to assemble those reads then you can only get contigs that represent just 1 kb of the 8kb reference?

Sequence assembly is a hard problem. If there are repeats in your reference (coupled with the short reads in your dataset) then that result is not surprising.

**sfh838t** · 03-16-2016, 06:12 AM

yes, you did understand correctly.
I used either BWA or bowtie2 to align reads to ref seq, then go through the samtools steps to filter out only reads that align, convert back to fastq, then run velvet or ABySS and get mostly nothing, depending on read depth.
I have three plant samples with apparently varying degrees of virus infections, assembled contig coverage increases from 1kb, to 2 and 6kb of 8kb total virus length with increasing read depth. However, for each sample I can use IGV to look at and bedtools to give me numbers for the read alignments and if I use all reads regardless of their length I have coverage of the entire target virus minus 1 to 6 nts.

**fanli** · 03-16-2016, 06:14 AM

Is there something particular about your virus that you'd be trying to do assembly with really short reads? I don't think a lot of the assemblers out there are optimized for this...

**sfh838t** · 03-16-2016, 06:17 AM

looking for variants, maybe strain identification etc.
velvet seems to be commonly used for this, any suggestions for a different assembler?

**GenoMax** · 03-16-2016, 06:23 AM

Is what we are discussing now unrelated to the original question or is this an ssRNA virus? (I can split later posts into a new thread if that is so).

Is there a reason you are trying to assemble the virus (when you have a reference)? (Edit: Loks like @fanli already asked this question while I was typing this).

If you have some time take a look at tadpole.sh from BBMap. It may provide a fresh option. I would also look into BBSplit to separate the viral reads before doing the assembly with tadpole.

**sfh838t** · 03-16-2016, 07:16 AM

it was the second question, so I don't know if it should be split.
I will look into tadpole and the other suggestions, thanks!

Topics	Statistics	Last Post
The Adaptation of the Cell Cycle in Multiciliated Cells by seqadmin Started by seqadmin, Today, 06:58 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 06:58 AM
New Method for DNA Sequence Amplification by seqadmin Started by seqadmin, Yesterday, 08:18 AM	0 responses 19 views 0 likes	Last Post by seqadmin Yesterday, 08:18 AM
New Tools Enhance Single-Molecule DNA Analysis with Minimal Samples by seqadmin Started by seqadmin, Yesterday, 08:04 AM	0 responses 18 views 0 likes	Last Post by seqadmin Yesterday, 08:04 AM
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, 06-03-2024, 06:55 AM	0 responses 13 views 0 likes	Last Post by seqadmin 06-03-2024, 06:55 AM

Seqanswers Leaderboard Ad

Announcement

why are sRNA output reads longer than siRNA?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News