Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimming bases if the reads didn't map to the reference genome

    Hello everyone,

    I'm fairly new to bioinformatics as I've only recently started training. My PI gives me papers and asks me to follow their methods so I would learn. Currently, I'm learning hybrid de novo/reference guided assembly. Before getting into assembly and alignment, I have to pre-process the raw reads.

    The authors say they assessed the quality of the data using fastqc, which I get. My problem is that they say they trimmed the first and last few bases of the reads that didn't map to the reference genome using prinseq. How could they do that? Prinseq doesn't seem to have this feature.

    I tried using prinseq- and other trimming software- with guessed parameters but when I get to the assembly part, velvet produces empty files. I'm guessing that's because of my poor pre-processing.

    I would really appreciate it if you could help me understand the method mentioned in the paper.

  • #2
    I am going to guess that "bases not mapping to reference genome" that the authors are referring to are likely contamination (e.g. adapter) that is present in a fraction of the reads, especially if one has inserts smaller than the length of the read.

    There are many trimming programs (trimmomatic and bbduk.sh (part of BBMap), search for their threads here) are popular ones. If you have paired end data then you would want to use a trimming program that is "PE-aware" to keep the reads in sync in R1/R2 files.

    Comment


    • #3
      Thank you for the reply. I've tried trimmomatic before and have just tried to use it with the included adapters but for no result. FastQC still shows an error in kmer content and sequence duplication, although trimming the first 14 bases from each read fixed the GC content distribution. Do you have further advice?
      Last edited by antifolate; 08-03-2015, 09:09 AM.

      Comment


      • #4
        FastQC flagging a particular metric (with a warning or error) does not automatically indicate a problem with the data. FastQC uses internal ranges for various metrics (which are not universally applicable for all kinds on data but Simon had to start somewhere) and the warning flags are based on those values.

        The "14 base at beginning" GC content feature is observed with most RNAseq/Nextera datasets due to the "random primers not being so random" and "transposon showing some sequence bias for insertions". These sequences will align fine and generally do not require trimming.

        If you encountered an error with trimmomatic post the command line you used. BBduk.sh is more straightforward to use/understand. You may want to try that as well.

        Comment


        • #5
          Originally posted by antifolate View Post
          I tried using prinseq- and other trimming software- with guessed parameters but when I get to the assembly part, velvet produces empty files. I'm guessing that's because of my poor pre-processing.
          Velvet is fairly robust against low-quality input. If it produced empty output, then probably you were running it incorrectly, or it crashed, or you trimmed your input down to literally nothing, or your coverage is way too low. So -

          1) How big / what taxa is the organism?
          2) How much data do you have, and what kind?
          3) What were your command lines?
          4) How much data remained after processing?
          5) What was the console output of Velvet?

          Comment


          • #6
            I'm using fastsqc as a preliminary test. The actual criteria that tells me whether the filtering/trimming worked is whether the new sequences can be assembled with velvet. So far, velvet keeps producing empty files.

            Here is the last trimmomatic command I used:

            java -jar trimmomatic-0.33.jar PE SRR1145661.fastq SRR1145662.fastq F_paired.fg.gz F_unpaired.fg.gz R_paired.fq.gz R_unpaired.fq.gz ILLUMINACLIP:adapters/TruSeq3-PE.fa:2:30:10 HEADCROP:14 TRAILING:4 AVGQUAL:20

            And the velvet commands:

            ./velveth FINAL 65 -shortPaired -separate -fastq.gz F_paired.fq.gz R_paired.fq.gz

            ./velvetg FINAL -exp_cov auto -cov_cutoff auto

            Could the problem be from velvet?

            Comment


            • #7
              Originally posted by antifolate View Post
              Could the problem be from velvet?
              It could be from anywhere. But without knowing the answers to questions 1, 2, 4, and 5, it would be very difficult to say.

              Comment


              • #8
                @antifolate: I am not sure what exactly you are trying to do.

                Based on the SRA acccession #'s in your trimmomatic command line, you are trying to use two "separate" yeast RNAseq samples in your trimming command. They seem to be single-end reads to boot.

                No wonder this is not working.

                Comment


                • #9
                  @Brian Bushnell
                  1) Saccharomyces cerevisiae
                  2) I only have two fastq files about 3 gigs each, one left and one right.
                  3)Prinseq commands I tried before:

                  perl prinseq-lite.pl -fastq SRR1145661.fastq -fastq2 SRR1145662.fastq -out_format 3 -qual_noscale -trim_left 15 -trim_right 4 -min_qual_score 20


                  4)Each file became ~0.5 gigs.
                  5)Output:

                  ./velveth FINAL 65 -shortPaired -separate -fastq.gz F_paired.fq.gz R_paired.fq.gz

                  [0.000000] Reading FastQ file /home/zhenguolin/Trimmomatic-0.33/F_paired.fq.gz;
                  [0.000058] Reading FastQ file /home/zhenguolin/Trimmomatic-0.33/R_paired.fq.gz;
                  [80.711071] 20786476 sequences found in total in the paired sequence files
                  [80.711089] Done
                  [80.711169] Reading read set file FINAL/Sequences;
                  [83.945761] 20786476 sequences found
                  [103.790458] Done
                  [103.790485] 20786476 sequences in total.
                  [103.790570] Writing into roadmap file FINAL/Roadmaps...
                  [112.211672] Inputting sequences...
                  [112.211699] Inputting sequence 0 / 20786476
                  [112.693165] Inputting sequence 1000000 / 20786476
                  [113.181203] Inputting sequence 2000000 / 20786476
                  [113.666087] Inputting sequence 3000000 / 20786476
                  [114.154794] Inputting sequence 4000000 / 20786476
                  [114.660421] Inputting sequence 5000000 / 20786476
                  [115.149249] Inputting sequence 6000000 / 20786476
                  [115.644338] Inputting sequence 7000000 / 20786476
                  [116.141600] Inputting sequence 8000000 / 20786476
                  [116.632185] Inputting sequence 9000000 / 20786476
                  [117.109180] Inputting sequence 10000000 / 20786476
                  [117.627538] Inputting sequence 11000000 / 20786476
                  [118.146591] Inputting sequence 12000000 / 20786476
                  [118.663219] Inputting sequence 13000000 / 20786476
                  [119.182846] Inputting sequence 14000000 / 20786476
                  [119.715876] Inputting sequence 15000000 / 20786476
                  [120.235799] Inputting sequence 16000000 / 20786476
                  [120.755655] Inputting sequence 17000000 / 20786476
                  [121.276797] Inputting sequence 18000000 / 20786476
                  [121.813735] Inputting sequence 19000000 / 20786476
                  [122.331596] Inputting sequence 20000000 / 20786476
                  [122.737475] === Sequences loaded in 11.848269 s
                  [122.887945] Done inputting sequences
                  [122.887971] Destroying splay table
                  [122.887988] Splay table destroyed

                  ./velvetg FINAL -exp_cov auto -cov_cutoff auto

                  [0.000000] Reading roadmap file FINAL/Roadmaps
                  [15.924738] 20786476 roadmaps read
                  [15.926398] Creating insertion markers
                  [16.370955] Ordering insertion markers
                  [16.794351] Counting preNodes
                  [17.030447] 0 preNodes counted, creating them now
                  [18.270607] Sequence 1000000 / 20786476
                  [19.500979] Sequence 2000000 / 20786476
                  [20.738540] Sequence 3000000 / 20786476
                  [21.972517] Sequence 4000000 / 20786476
                  [23.209002] Sequence 5000000 / 20786476
                  [24.446846] Sequence 6000000 / 20786476
                  [25.683393] Sequence 7000000 / 20786476
                  [26.919510] Sequence 8000000 / 20786476
                  [28.153803] Sequence 9000000 / 20786476
                  [29.389264] Sequence 10000000 / 20786476
                  [30.618777] Sequence 11000000 / 20786476
                  [31.849972] Sequence 12000000 / 20786476
                  [33.088403] Sequence 13000000 / 20786476
                  [34.318037] Sequence 14000000 / 20786476
                  [35.551025] Sequence 15000000 / 20786476
                  [36.783885] Sequence 16000000 / 20786476
                  [38.014411] Sequence 17000000 / 20786476
                  [39.241511] Sequence 18000000 / 20786476
                  [40.476229] Sequence 19000000 / 20786476
                  [41.715549] Sequence 20000000 / 20786476
                  [42.704830] Adjusting marker info...
                  [42.704849] Connecting preNodes
                  [42.719462] Connecting 1000000 / 20786476
                  [42.733764] Connecting 2000000 / 20786476
                  [42.748063] Connecting 3000000 / 20786476
                  [42.762352] Connecting 4000000 / 20786476
                  [42.776638] Connecting 5000000 / 20786476
                  [42.790994] Connecting 6000000 / 20786476
                  [42.805293] Connecting 7000000 / 20786476
                  [42.819559] Connecting 8000000 / 20786476
                  [42.833855] Connecting 9000000 / 20786476
                  [42.848158] Connecting 10000000 / 20786476
                  [42.862433] Connecting 11000000 / 20786476
                  [42.876719] Connecting 12000000 / 20786476
                  [42.891000] Connecting 13000000 / 20786476
                  [42.905284] Connecting 14000000 / 20786476
                  [42.919563] Connecting 15000000 / 20786476
                  [42.933853] Connecting 16000000 / 20786476
                  [42.948162] Connecting 17000000 / 20786476
                  [42.962456] Connecting 18000000 / 20786476
                  [42.976740] Connecting 19000000 / 20786476
                  [42.991028] Connecting 20000000 / 20786476
                  [43.002293] Cleaning up memory
                  [43.003039] Done creating preGraph
                  [43.003047] Concatenation...
                  [43.003050] Renumbering preNodes
                  [43.003052] Initial preNode count 0
                  [43.003088] Destroyed 0 preNodes
                  [43.003092] Concatenation over!
                  [43.003094] Clipping short tips off preGraph
                  [43.003097] Concatenation...
                  [43.003099] Renumbering preNodes
                  [43.003101] Initial preNode count 0
                  [43.003104] Destroyed 0 preNodes
                  [43.003108] Concatenation over!
                  [43.003111] 0 tips cut off
                  [43.003113] 0 nodes left
                  [43.022506] Writing into pregraph file FINAL/PreGraph...
                  [43.022778] Reading read set file FINAL/Sequences;
                  [46.057763] 20786476 sequences found
                  [66.791762] Done
                  [74.097462] Reading pre-graph file FINAL/PreGraph
                  [74.097519] Graph has 0 nodes and 20786476 sequences
                  [76.249032] Correcting graph with cutoff 0.200000
                  [76.249119] Determining eligible starting points
                  [76.249126] Done listing starting nodes
                  [76.249128] Initializing todo lists
                  [76.249130] Done with initilization
                  [76.249134] Activating arc lookup table
                  [76.249136] Done activating arc lookup table
                  [76.249139] Concatenation...
                  [76.249156] Renumbering nodes
                  [76.249159] Initial node count 0
                  [76.249163] Removed 0 null nodes
                  [76.249165] Concatenation over!
                  [76.249167] Clipping short tips off graph, drastic
                  [76.249170] Concatenation...
                  [76.249172] Renumbering nodes
                  [76.249173] Initial node count 0
                  [76.249176] Removed 0 null nodes
                  [76.249178] Concatenation over!
                  [76.249180] 0 nodes left
                  [76.249261] Writing into graph file FINAL/Graph2...
                  [76.249429] Measuring median coverage depth...
                  [76.249433] EMPTY GRAPH
                  [76.249439] Removing contigs with coverage < 0.000000...
                  [76.249443] Concatenation...
                  [76.249445] Renumbering nodes
                  [76.249447] Initial node count 0
                  [76.249449] Removed 0 null nodes
                  [76.249452] Concatenation over!
                  [76.249454] Concatenation...
                  [76.249455] Renumbering nodes
                  [76.249457] Initial node count 0
                  [76.249460] Removed 0 null nodes
                  [76.249462] Concatenation over!
                  [76.249466] Clipping short tips off graph, drastic
                  [76.249468] Concatenation...
                  [76.249470] Renumbering nodes
                  [76.249472] Initial node count 0
                  [76.249474] Removed 0 null nodes
                  [76.249476] Concatenation over!
                  [76.249478] 0 nodes left
                  [76.249480] WARNING: NO EXPECTED COVERAGE PROVIDED
                  [76.249482] Velvet will be unable to resolve any repeats
                  [76.249485] See manual for instructions on how to set the expected coverage parameter
                  [76.249487] Concatenation...
                  [76.249489] Renumbering nodes
                  [76.249491] Initial node count 0
                  [76.249493] Removed 0 null nodes
                  [76.249495] Concatenation over!
                  [76.249497] Removing reference contigs with coverage < 0.000000...
                  [76.249500] Concatenation...
                  [76.249502] Renumbering nodes
                  [76.249504] Initial node count 0
                  [76.249506] Removed 0 null nodes
                  [76.249508] Concatenation over!
                  [76.391317] Writing contigs into FINAL/contigs.fa...
                  [76.391372] Writing into stats file FINAL/stats.txt...
                  [76.391511] Writing into graph file FINAL/LastGraph...
                  [76.391556] Estimated Coverage = 0.000000
                  [76.391566] Estimated Coverage cutoff = 0.000000
                  [76.468917] EMPTY GRAPH
                  Final graph has 0 nodes and n50 of 0, max 0, total 0, using 0/20786476 reads

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    @antifolate: I am not sure what exactly you are trying to do.

                    Based on the SRA acccession #'s in your trimmomatic command line, you are trying to use two "separate" yeast RNAseq samples in your trimming command. They seem to be single-end reads to boot.

                    No wonder this is not working.
                    Now that would be just awesome . This is the first time I've ever used genbank to access data. Now that I think about, I totally guessed that the reads are paired-end. Would you mind taking a look at the files? http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE54433

                    Thanks.

                    Comment


                    • #11
                      Originally posted by GenoMax View Post
                      @antifolate: I am not sure what exactly you are trying to do.

                      Based on the SRA acccession #'s in your trimmomatic command line, you are trying to use two "separate" yeast RNAseq samples in your trimming command. They seem to be single-end reads to boot.

                      No wonder this is not working.
                      This is an excellent point! But I think it would still fail even if the correct files were being used.

                      You're trimming down to possibly almost nothing (3GB -> 0.5GB, which could just be the headers). How long are the reads before and after? I'm guessing that after they are ending up shorter than your kmer length of 65, so that nothing can assemble... considering that Velvet said it used 0 reads.

                      Posting the fastQC report of before and after trimming may shed some light on this - it includes useful info like read lengths.

                      Comment


                      • #12
                        Originally posted by antifolate View Post
                        Now that would be just awesome . This is the first time I've ever used genbank to access data. Now that I think about, I totally guessed that the reads are paired-end. Would you mind taking a look at the files? http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE54433

                        Thanks.
                        It looks like you have two independent single-end yeast RNAseq samples.

                        Code:
                        GSM1315085 	DY1
                        GSM1315086 	BY4741
                        The description appears to indicate that they were mapped using TopHat (which would be logical since this is RNAseq data).

                        So if you are trying to replicate the published analysis then you should

                        1. Trim the two samples independently (use trimmomatic SE or bbduk.sh).
                        2. Align using TopHat (or BBmap). Former if you are exactly trying to replicate the analysis in the paper.
                        3. Followed by cufflinks analysis.

                        No assembly needed (unless you want to try and assemble the data for other reasons).

                        Comment


                        • #13
                          Here's the fastqc report. I didn't know how to post it on here so I uploaded it on dropbox



                          EDIT:

                          Thank you @GenoMax! I have been struggling with this for more than 2 weeks. The reason I want to assemble the genome is to follow the method in the paper (http://gbe.oxfordjournals.org/content/6/9/2516.long , under Genome Assembly). They used velvet then CLC Workbench. I was thinking maybe I could use bowtie instead of CLC. That's probably gonna have its own thread though

                          Thanks again everybody.
                          Last edited by antifolate; 08-03-2015, 10:17 AM.

                          Comment


                          • #14
                            Originally posted by antifolate View Post
                            The reason I want to assemble the genome is to follow the method in the paper (http://gbe.oxfordjournals.org/content/6/9/2516.long , under Genome Assembly). They used velvet then CLC Workbench. I was thinking maybe I could use bowtie instead of CLC. That's probably gonna have its own thread though
                            It looks like there are (should be) two independent datasets here.

                            1. RNASeq - http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE54433

                            Other is supposed to be for WGS (which is what was used for assembly). It is not clear where this is located right away. Will take a look.

                            BTW: bowtie is only for alignment so you will need to try other assemblers if you don't have access to CLC.

                            Comment


                            • #15
                              @antifolate: You are going to have to contact the authors about the WGS dataset. Even though it is supposed to have been submitted under PRJNA169002, it is not linked from there. It is also not under S. cerevisiae genomes (http://www.ncbi.nlm.nih.gov/genome/genomes/15)
                              Last edited by GenoMax; 08-03-2015, 11:15 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 10:49 AM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-25-2024, 11:49 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-24-2024, 08:47 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              62 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X