Seqanswers Leaderboard Ad

**GenoMax** · 08-03-2015, 06:18 AM

I am going to guess that "bases not mapping to reference genome" that the authors are referring to are likely contamination (e.g. adapter) that is present in a fraction of the reads, especially if one has inserts smaller than the length of the read.

There are many trimming programs (trimmomatic and bbduk.sh (part of BBMap), search for their threads here) are popular ones. If you have paired end data then you would want to use a trimming program that is "PE-aware" to keep the reads in sync in R1/R2 files.

**antifolate** · 08-03-2015, 08:58 AM

Thank you for the reply. I've tried trimmomatic before and have just tried to use it with the included adapters but for no result. FastQC still shows an error in kmer content and sequence duplication, although trimming the first 14 bases from each read fixed the GC content distribution. Do you have further advice?

**GenoMax** · 08-03-2015, 09:09 AM

FastQC flagging a particular metric (with a warning or error) does not automatically indicate a problem with the data. FastQC uses internal ranges for various metrics (which are not universally applicable for all kinds on data but Simon had to start somewhere) and the warning flags are based on those values.

The "14 base at beginning" GC content feature is observed with most RNAseq/Nextera datasets due to the "random primers not being so random" and "transposon showing some sequence bias for insertions". These sequences will align fine and generally do not require trimming.

If you encountered an error with trimmomatic post the command line you used. BBduk.sh is more straightforward to use/understand. You may want to try that as well.

**Brian Bushnell** · 08-03-2015, 09:17 AM

Originally posted by antifolate View Post

I tried using prinseq- and other trimming software- with guessed parameters but when I get to the assembly part, velvet produces empty files. I'm guessing that's because of my poor pre-processing.

Velvet is fairly robust against low-quality input. If it produced empty output, then probably you were running it incorrectly, or it crashed, or you trimmed your input down to literally nothing, or your coverage is way too low. So -

1) How big / what taxa is the organism?
2) How much data do you have, and what kind?
3) What were your command lines?
4) How much data remained after processing?
5) What was the console output of Velvet?

**antifolate** · 08-03-2015, 09:23 AM

I'm using fastsqc as a preliminary test. The actual criteria that tells me whether the filtering/trimming worked is whether the new sequences can be assembled with velvet. So far, velvet keeps producing empty files.

Here is the last trimmomatic command I used:

java -jar trimmomatic-0.33.jar PE SRR1145661.fastq SRR1145662.fastq F_paired.fg.gz F_unpaired.fg.gz R_paired.fq.gz R_unpaired.fq.gz ILLUMINACLIP:adapters/TruSeq3-PE.fa:2:30:10 HEADCROP:14 TRAILING:4 AVGQUAL:20

And the velvet commands:

./velveth FINAL 65 -shortPaired -separate -fastq.gz F_paired.fq.gz R_paired.fq.gz

./velvetg FINAL -exp_cov auto -cov_cutoff auto

Could the problem be from velvet?

**Brian Bushnell** · 08-03-2015, 09:25 AM

Originally posted by antifolate View Post

Could the problem be from velvet?

It could be from anywhere. But without knowing the answers to questions 1, 2, 4, and 5, it would be very difficult to say.

**GenoMax** · 08-03-2015, 09:32 AM

@antifolate: I am not sure what exactly you are trying to do.

Based on the SRA acccession #'s in your trimmomatic command line, you are trying to use two "separate" yeast RNAseq samples in your trimming command. They seem to be single-end reads to boot.

No wonder this is not working.

**antifolate** · 08-03-2015, 09:33 AM

@Brian Bushnell
1) Saccharomyces cerevisiae
2) I only have two fastq files about 3 gigs each, one left and one right.
3)Prinseq commands I tried before:

perl prinseq-lite.pl -fastq SRR1145661.fastq -fastq2 SRR1145662.fastq -out_format 3 -qual_noscale -trim_left 15 -trim_right 4 -min_qual_score 20

4)Each file became ~0.5 gigs.
5)Output:

./velveth FINAL 65 -shortPaired -separate -fastq.gz F_paired.fq.gz R_paired.fq.gz

[0.000000] Reading FastQ file /home/zhenguolin/Trimmomatic-0.33/F_paired.fq.gz;
[0.000058] Reading FastQ file /home/zhenguolin/Trimmomatic-0.33/R_paired.fq.gz;
[80.711071] 20786476 sequences found in total in the paired sequence files
[80.711089] Done
[80.711169] Reading read set file FINAL/Sequences;
[83.945761] 20786476 sequences found
[103.790458] Done
[103.790485] 20786476 sequences in total.
[103.790570] Writing into roadmap file FINAL/Roadmaps...
[112.211672] Inputting sequences...
[112.211699] Inputting sequence 0 / 20786476
[112.693165] Inputting sequence 1000000 / 20786476
[113.181203] Inputting sequence 2000000 / 20786476
[113.666087] Inputting sequence 3000000 / 20786476
[114.154794] Inputting sequence 4000000 / 20786476
[114.660421] Inputting sequence 5000000 / 20786476
[115.149249] Inputting sequence 6000000 / 20786476
[115.644338] Inputting sequence 7000000 / 20786476
[116.141600] Inputting sequence 8000000 / 20786476
[116.632185] Inputting sequence 9000000 / 20786476
[117.109180] Inputting sequence 10000000 / 20786476
[117.627538] Inputting sequence 11000000 / 20786476
[118.146591] Inputting sequence 12000000 / 20786476
[118.663219] Inputting sequence 13000000 / 20786476
[119.182846] Inputting sequence 14000000 / 20786476
[119.715876] Inputting sequence 15000000 / 20786476
[120.235799] Inputting sequence 16000000 / 20786476
[120.755655] Inputting sequence 17000000 / 20786476
[121.276797] Inputting sequence 18000000 / 20786476
[121.813735] Inputting sequence 19000000 / 20786476
[122.331596] Inputting sequence 20000000 / 20786476
[122.737475] === Sequences loaded in 11.848269 s
[122.887945] Done inputting sequences
[122.887971] Destroying splay table
[122.887988] Splay table destroyed

./velvetg FINAL -exp_cov auto -cov_cutoff auto

[0.000000] Reading roadmap file FINAL/Roadmaps
[15.924738] 20786476 roadmaps read
[15.926398] Creating insertion markers
[16.370955] Ordering insertion markers
[16.794351] Counting preNodes
[17.030447] 0 preNodes counted, creating them now
[18.270607] Sequence 1000000 / 20786476
[19.500979] Sequence 2000000 / 20786476
[20.738540] Sequence 3000000 / 20786476
[21.972517] Sequence 4000000 / 20786476
[23.209002] Sequence 5000000 / 20786476
[24.446846] Sequence 6000000 / 20786476
[25.683393] Sequence 7000000 / 20786476
[26.919510] Sequence 8000000 / 20786476
[28.153803] Sequence 9000000 / 20786476
[29.389264] Sequence 10000000 / 20786476
[30.618777] Sequence 11000000 / 20786476
[31.849972] Sequence 12000000 / 20786476
[33.088403] Sequence 13000000 / 20786476
[34.318037] Sequence 14000000 / 20786476
[35.551025] Sequence 15000000 / 20786476
[36.783885] Sequence 16000000 / 20786476
[38.014411] Sequence 17000000 / 20786476
[39.241511] Sequence 18000000 / 20786476
[40.476229] Sequence 19000000 / 20786476
[41.715549] Sequence 20000000 / 20786476
[42.704830] Adjusting marker info...
[42.704849] Connecting preNodes
[42.719462] Connecting 1000000 / 20786476
[42.733764] Connecting 2000000 / 20786476
[42.748063] Connecting 3000000 / 20786476
[42.762352] Connecting 4000000 / 20786476
[42.776638] Connecting 5000000 / 20786476
[42.790994] Connecting 6000000 / 20786476
[42.805293] Connecting 7000000 / 20786476
[42.819559] Connecting 8000000 / 20786476
[42.833855] Connecting 9000000 / 20786476
[42.848158] Connecting 10000000 / 20786476
[42.862433] Connecting 11000000 / 20786476
[42.876719] Connecting 12000000 / 20786476
[42.891000] Connecting 13000000 / 20786476
[42.905284] Connecting 14000000 / 20786476
[42.919563] Connecting 15000000 / 20786476
[42.933853] Connecting 16000000 / 20786476
[42.948162] Connecting 17000000 / 20786476
[42.962456] Connecting 18000000 / 20786476
[42.976740] Connecting 19000000 / 20786476
[42.991028] Connecting 20000000 / 20786476
[43.002293] Cleaning up memory
[43.003039] Done creating preGraph
[43.003047] Concatenation...
[43.003050] Renumbering preNodes
[43.003052] Initial preNode count 0
[43.003088] Destroyed 0 preNodes
[43.003092] Concatenation over!
[43.003094] Clipping short tips off preGraph
[43.003097] Concatenation...
[43.003099] Renumbering preNodes
[43.003101] Initial preNode count 0
[43.003104] Destroyed 0 preNodes
[43.003108] Concatenation over!
[43.003111] 0 tips cut off
[43.003113] 0 nodes left
[43.022506] Writing into pregraph file FINAL/PreGraph...
[43.022778] Reading read set file FINAL/Sequences;
[46.057763] 20786476 sequences found
[66.791762] Done
[74.097462] Reading pre-graph file FINAL/PreGraph
[74.097519] Graph has 0 nodes and 20786476 sequences
[76.249032] Correcting graph with cutoff 0.200000
[76.249119] Determining eligible starting points
[76.249126] Done listing starting nodes
[76.249128] Initializing todo lists
[76.249130] Done with initilization
[76.249134] Activating arc lookup table
[76.249136] Done activating arc lookup table
[76.249139] Concatenation...
[76.249156] Renumbering nodes
[76.249159] Initial node count 0
[76.249163] Removed 0 null nodes
[76.249165] Concatenation over!
[76.249167] Clipping short tips off graph, drastic
[76.249170] Concatenation...
[76.249172] Renumbering nodes
[76.249173] Initial node count 0
[76.249176] Removed 0 null nodes
[76.249178] Concatenation over!
[76.249180] 0 nodes left
[76.249261] Writing into graph file FINAL/Graph2...
[76.249429] Measuring median coverage depth...
[76.249433] EMPTY GRAPH
[76.249439] Removing contigs with coverage < 0.000000...
[76.249443] Concatenation...
[76.249445] Renumbering nodes
[76.249447] Initial node count 0
[76.249449] Removed 0 null nodes
[76.249452] Concatenation over!
[76.249454] Concatenation...
[76.249455] Renumbering nodes
[76.249457] Initial node count 0
[76.249460] Removed 0 null nodes
[76.249462] Concatenation over!
[76.249466] Clipping short tips off graph, drastic
[76.249468] Concatenation...
[76.249470] Renumbering nodes
[76.249472] Initial node count 0
[76.249474] Removed 0 null nodes
[76.249476] Concatenation over!
[76.249478] 0 nodes left
[76.249480] WARNING: NO EXPECTED COVERAGE PROVIDED
[76.249482] Velvet will be unable to resolve any repeats
[76.249485] See manual for instructions on how to set the expected coverage parameter
[76.249487] Concatenation...
[76.249489] Renumbering nodes
[76.249491] Initial node count 0
[76.249493] Removed 0 null nodes
[76.249495] Concatenation over!
[76.249497] Removing reference contigs with coverage < 0.000000...
[76.249500] Concatenation...
[76.249502] Renumbering nodes
[76.249504] Initial node count 0
[76.249506] Removed 0 null nodes
[76.249508] Concatenation over!
[76.391317] Writing contigs into FINAL/contigs.fa...
[76.391372] Writing into stats file FINAL/stats.txt...
[76.391511] Writing into graph file FINAL/LastGraph...
[76.391556] Estimated Coverage = 0.000000
[76.391566] Estimated Coverage cutoff = 0.000000
[76.468917] EMPTY GRAPH
Final graph has 0 nodes and n50 of 0, max 0, total 0, using 0/20786476 reads

**antifolate** · 08-03-2015, 09:43 AM

Originally posted by GenoMax View Post

@antifolate: I am not sure what exactly you are trying to do.

Based on the SRA acccession #'s in your trimmomatic command line, you are trying to use two "separate" yeast RNAseq samples in your trimming command. They seem to be single-end reads to boot.

No wonder this is not working.

Now that would be just awesome

. This is the first time I've ever used genbank to access data. Now that I think about, I totally guessed that the reads are paired-end. Would you mind taking a look at the files? http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE54433

Thanks.

**Brian Bushnell** · 08-03-2015, 09:45 AM

Originally posted by GenoMax View Post

@antifolate: I am not sure what exactly you are trying to do.

Based on the SRA acccession #'s in your trimmomatic command line, you are trying to use two "separate" yeast RNAseq samples in your trimming command. They seem to be single-end reads to boot.

No wonder this is not working.

This is an excellent point! But I think it would still fail even if the correct files were being used.

You're trimming down to possibly almost nothing (3GB -> 0.5GB, which could just be the headers). How long are the reads before and after? I'm guessing that after they are ending up shorter than your kmer length of 65, so that nothing can assemble... considering that Velvet said it used 0 reads.

Posting the fastQC report of before and after trimming may shed some light on this - it includes useful info like read lengths.

**GenoMax** · 08-03-2015, 10:00 AM

Originally posted by antifolate View Post

Now that would be just awesome

. This is the first time I've ever used genbank to access data. Now that I think about, I totally guessed that the reads are paired-end. Would you mind taking a look at the files? http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE54433

Thanks.

It looks like you have two independent single-end yeast RNAseq samples.

Code:

GSM1315085 	DY1
GSM1315086 	BY4741

The description appears to indicate that they were mapped using TopHat (which would be logical since this is RNAseq data).

So if you are trying to replicate the published analysis then you should

1. Trim the two samples independently (use trimmomatic SE or bbduk.sh).
2. Align using TopHat (or BBmap). Former if you are exactly trying to replicate the analysis in the paper.
3. Followed by cufflinks analysis.

No assembly needed (unless you want to try and assemble the data for other reasons).

**antifolate** · 08-03-2015, 10:10 AM

Here's the fastqc report. I didn't know how to post it on here so I uploaded it on dropbox

Dropbox - Error - Simplify your life

https://www.dropbox.com/sh/vtwv5kqqa6xuvr2/AAAuYzN3Hmm7xkQMyRygcMtCa?dl=0

EDIT:

Thank you @GenoMax! I have been struggling with this for more than 2 weeks. The reason I want to assemble the genome is to follow the method in the paper (http://gbe.oxfordjournals.org/content/6/9/2516.long , under Genome Assembly). They used velvet then CLC Workbench. I was thinking maybe I could use bowtie instead of CLC. That's probably gonna have its own thread though

Thanks again everybody.

**GenoMax** · 08-03-2015, 10:48 AM

Originally posted by antifolate View Post

The reason I want to assemble the genome is to follow the method in the paper (http://gbe.oxfordjournals.org/content/6/9/2516.long , under Genome Assembly). They used velvet then CLC Workbench. I was thinking maybe I could use bowtie instead of CLC. That's probably gonna have its own thread though

It looks like there are (should be) two independent datasets here.

1. RNASeq - http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE54433

Other is supposed to be for WGS (which is what was used for assembly). It is not clear where this is located right away. Will take a look.

BTW: bowtie is only for alignment so you will need to try other assemblers if you don't have access to CLC.

**GenoMax** · 08-03-2015, 11:10 AM

@antifolate: You are going to have to contact the authors about the WGS dataset. Even though it is supposed to have been submitted under PRJNA169002, it is not linked from there. It is also not under S. cerevisiae genomes (http://www.ncbi.nlm.nih.gov/genome/genomes/15)

Topics	Statistics	Last Post
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, Yesterday, 10:49 AM	0 responses 18 views 0 likes	Last Post by seqadmin Yesterday, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM

Seqanswers Leaderboard Ad

Announcement

Trimming bases if the reads didn't map to the reference genome

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News