Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HiSeq2000 adapter question

    Hello all,

    I have a question regarding Illumina adapters. I'm a grad student coming into a previously uncompleted de novo assembly project. The sequencing was done commercially over a year ago so I have limited access to the specifics of what was ordered. What I do know is that I have 24 million paired end reads of 100bp length each. From what I can tell just purusing the sequencing company's website is that they use a HiSeq2000, though I can't be 100% confident this is the machine used to generate my reads.

    My question concerns the adapters and whether or not trimming is neccessary. My understanding is that sequencing of the DNA fragment begins after the adapter and so the only way to have actual adapter sequence in my reads is if the read length was greater than the fragment size and read into the adapter on the opposite end. Is this a correct assumption? Is this assumption valid for all Illumina machines (MiSeq, GA, etc).

    Basically I'm just trying to figure out if I need to do any adapter trimming before attempting assembly. The FastQC report provided with my reads does not show any overrepresented sequence, though I don't know if the company used a TruSeq kit which I beleive are the adapters FastQC looks for. Any help / insight would be greatly appreciated.

  • #2
    If FASTQC report does not show any overrepresented sequences, you do not need adapter trimming. Anyway your assumption is correct. If you try to sequencing microRNAs, you get adapter sequences in every read.

    Comment


    • #3
      Thanks for the reply... but would FastQC show overrepresented sequences even though a TruSeq kit wasn't used. I thought it only looked for certain adapter sequences, so if a kit was used that it didnt know what adapters were used it wouldnt be able to identify them.

      Also, is there any way I can know for sure that my fragment size is bigger than 100bp to ensure no adapters were sequenced? How does one go about determing what their fragment size is if they were not present during the library preparation? Are their tools for this?

      Thanks for your help, sorry to be paranoid.

      Comment


      • #4
        Give trimmomatic a try. It comes with the TruSeq adapters (there are 2 versions so give both a try) and you can see if it actually removes any reads. If it does not then you are probably good to go forward with the assembly.

        Are you not able to ask the sequence provider (or the original sample preparer) what kit (if any) was used?

        Comment


        • #5
          There are 2 ways to get an estimate for the average fragment length. The first one is to ask the people who did the library prep. The second way is to let the software calculate a value. Some assemblers, like velvet for example, will calculate the average fragment length based on a preliminary assembly of the reads.

          I don't think FastQC will necessarily tell you whether you have adapter sequences in your reads or not. The over-represented sequences listed by FastQC are sequences from the first 50 bases at the 5' end of the reads. Most adapter sequences occur towards the 3' end of the reads, as you observed above, when you read into the adapter sequence because your fragment length was shorter than the read length.

          Comment


          • #6
            Thanks for all of the replies.

            So I used a tool that comes with the SGA assembler called preQC that gives some useful information about the reads prior to assembly, including a estimated fragment size histogram. the peak in that graph was around 125 to 150, which seems a little low, though still long enough for the reads to not go into the adapter sequences

            I'm in the process of talking with the sequencing center to see what kit and insert size they used. in the mean time i did download trimmomatic as GenoMax suggested. when running that program you have to select the encoding for the quality scores, which I assume is phred-33 since in the FastQC report it says Sanger / Illuminia 1.9. Is that correct.

            Also what quality score cut-off is generally used. the manual suggest '3' which would be 50% by my calculations. Is that the norm or should I be more stringent?

            Thanks again for the help.

            Comment


            • #7
              Yes, Illumina 1.9 uses -phred33.

              The quality cutoff of '3' is used to remove 'N' bases from the ends of the reads, you may want to set a higher cutoff value to remove other low quality bases.

              To remove reads with low quality regions in the middle of the reads, you may want to try trimmomatic's SLIDING WINDOW function.

              The peak at 125-150 in the fragment-length histogram shows that most of your reads shouldn't have adapter sequences, but obviously if there are any fragments shorter than 100 then those will give reads with some adapter sequence, and you may also get a few reads that are adapter dimers or polymers with no insert.

              Comment


              • #8
                So I am unable to get adapter sequences from the sequencing center. I believe they may have used a NebNext kit but I can't find those adapter sequences to have trimmomatic screen for them. Since my coverage is relatively high I'm thinking I will just use trimmomatic to trim the 3' ends and ideally remove any adapter fragments. I'm wondering how many base pairs I should trim to ensure most of them are removed. How big is the average adapter? I know the actual sequence with PCR primers and all is much bigger but I'm referring to what could actually be in my read assuming a worst case scenario of a very short ~50bp fragment size. The TruSeq adapter files included with trimmomatic are all around 30 base pairs each. Assuming worse case where the reads go through the entire adapter, should I just trim 30 base pairs off the end? Is there anything on the other side of the adapters that could potentially be in my reads? My calculated coverage would only go down to about 45x if I did this so since I can't find the actual sequences this seems like my best option. Thoughts appreciated.

                Comment


                • #9
                  Is there a reason why you are so worried about the adapters? Just go ahead and try the assembly. Hopefully you have a good reference (or a close relative available) so you can look at the assembled sequences later on to see if there are unexplained sequences in there.

                  BTW: NEBNext adapters for illumina are on page 5 of this manual. https://www.neb.com/~/media/Catalog/...anualE7335.pdf and the indexes are on page 8.

                  Comment


                  • #10
                    I've already attempted the assembly and it didn't come up very well. It's a fungal genome with an expected size of around 40Mb. By my calculations we had around 60x coverage of very high quality reads. (I was thinking maybe it was too much coverage which was another reason I'm thinking about blindly trimming the reads.) Anyway the assembly came back with ~100k contigs, very low N50, not very good. So I'm trying to make sure everything is as accurate as possible before retrying. We only have one library and I know having multiples ones would help (preferably mate pair) but it is what it is.

                    So I really just want to know would be a good starting point for how many base pairs to trim off the ends. Like I said I think I have high enough coverage to do, I'm just wondering where to start. Those NebNext adapters are very long compared to the ones screened for in Trimmomatic, which are around 30bp. I know that some of the adapter sequence is used to bind to flow cell but it still seems like a lot. Right now I'm leaning to trim to around 30 base pairs off the 3' end.

                    Comment


                    • #11
                      Just want to jump in here -- 99% of the time the adapter sequences are the same regardless of the library prep kit unless you are doing something fancy like amplicon sequencing where it requires custom sequencing oligos.

                      AFAIK, the main difference between TruSeq and NebNext is that the index is added during the ligation process. PCR is performed by using primers complementary to the outer regions (hence to the flow cell). In the NebNext kit, you ligate with short "stub adapters" like in the older Illumina Multiplexing kits, and the indexes are added in during PCR with a very long primer. Either way, this results in exactly the same structure that goes into the flow cell.

                      Regardless of what you do, I think you're not hurting anything by using adapter trim before you assemble. At the very worse you lose some reads, but you will cut out a lot of adapter sequence that will really kill your assembly performance.

                      Comment


                      • #12
                        You are unlikely to get an answer that would be guaranteed to work since every data set has different characteristics. Go ahead and try the 30 bp trim and see if that improves things. You may also want to subsample the data and try the assembly with half the sequence.

                        You do not say what assembler you have tried. If it is just one so far then there are other programs you will have to try. De novo assembly is a hard problem, any which way you look at it.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          The Impact of AI in Genomic Medicine
                          by seqadmin



                          Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                          02-26-2024, 02:07 PM
                        • seqadmin
                          Multiomics Techniques Advancing Disease Research
                          by seqadmin


                          New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                          A major leap in the field has
                          ...
                          02-08-2024, 06:33 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 02-28-2024, 06:12 AM
                        0 responses
                        28 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 02-23-2024, 04:11 PM
                        0 responses
                        74 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 02-21-2024, 08:52 AM
                        0 responses
                        85 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 02-20-2024, 08:57 AM
                        0 responses
                        69 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X