Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • What to do with Illumina 76bp data with no reference?

    Sorry for the length of the post, long-time bioinfo man, total noob for NGS.

    I am trying to help a lab that sequenced human fecal samples from various individuals with two distinct phenotypes. They used a company that used Illumina GA to generate 76bp reads (non-paired) and I have the data as 4 pools, 2 pools for phenotype A and 2 for B. Each pool is in fact a pool of RNA extracted from 4 individuals. Sequences are in fact only 72bp long as there was some individual-tag removed by the company.

    Now from what I can understand communicating with the company is that the lab jumped the gun and went straight to metabolome-analyses without doing any of the prerequisite analyses such as 16S profiling and paired-end "deep" sequencing (they have obtained between 3-4 million sequences per pool with the current data files after cleaning (by the company)).

    I can match about 10% of the data to human RNAs (using UCSC known genes or the genome assemblies as a reference) but only 3-5% against NCBI bacterial genomes. I have also downloaded tons of data from other meta-bacterial sequencing projects and while I can get more hits (still < 10%), there is nothing in the way of annotations on these sequences and I have in no way helped matters using these publicly available scaffolds.

    My goal is to do something with the data. At least get a dendogram that shows that pool 1 and 2 are separable from pools 3 and 4. But since the majority of the data is just raw short sequences and I cannot align it to a reference nor assemble it, I do not really know what I am supposed to do with it.

    Can someone please point me in the right direction? I have read quite a bit of this forum today and there is a ton of info, maybe too much in terms of the different programs available, but my problem just does not appear to fit in any of the categories and does not seem to be solvable based on the descriptions of the 10-15 programs I have read about.

    Thanks in advance

  • #2
    8 controls and 8 cases just like an early micro-array experiment.
    Consult a statistician to diagnose how much power was in the experiment.
    Then report the species of the bacterial genomes in the 3-5% of the reads that aligned to NCBI and that the statistician lets you call significant.


    • #3
      I'm not sure if this would lead toward your goal, but you can BLAST all unmapped reads against the nt database, then use something like MEGAN to taxonomically evaluate the BLAST results. I use this approach for samples with unknown contents. It is computationally taxing to BLAST millions of queries against such a large database but it works. this references the technique.


      • #4
        I am very interested how many more human seq the authors managed to remove by blastn against human and exactly what do they mean by human ambiguous sequences in "a BLAST homology search against human genomic DNA and human ambiguous sequences extracted from the nt database"

        But I have yet to receive a reply

        MEGAN looks interesting will explore that!


        • #5
          Thanks for the replies so far. I have setup a Mosaik run to pass each pool through Human cDNA, then Human Genomic and then the NCBI bacterial genomes to collect and filter the results. But I know that I am gonna end up being left with > 60% of the data in the nowhere bin. I guess I can use quality trimmer and then BLAST nr but deciphering that output sounds like a nightmare at those expect values.

          I will definitely need to look at MEGAN (thanks for the info)

          The first thing I did was to merge all the full length (72mers) sequences into 1 database which was stored in a hash to have 1 instance per sequence and I got counts for the number of times the same sequence was found in each pool. What I would really like would be a program that could use the quality information and allow mismatches and would generate a rough estimate of frequencies per pool so that I could see if there is a clear separation between pool 1 & 2 versus pool 3 & 4.

          If anyone knows of such a tool, or a way to use an alignment program such as Mosaik or something to accomplish this goal, your suggestions would be appreciated.


          • #6

            I'm curious, how many counts did you get of redundant sequences did you get? (They were perfectly redundant?) Theoretically you get very few of these with Illumina sequencing, so the strategy you described of looking for similar reads unassembled I don't think would be an effective one...

            MOSAIK is a good choice for what you're trying to do compared to maq.


            • #7
              The hashing program would check the hash, if not already seen, it would reverse complement and recheck, if still not seen insert. Out of the 4 times ~3 million reads, the highest count is like only 5 or something. I actually thought that strange, but I am happy to hear that it is normal.

              For Mosaik, should I make a reference out of all the pools combined and then align that same sequence set to itself? Would that be a valid way to find "equivalent" reads with sequencing errors? I have read the Mosaik docs (excellent by the way) but do not see any downstream application that I could use to make the next step (after sorting the alignment results). Any suggestions?


              • #8
                Do you have a number of duplicate reads found total, rather than max number of duplicates found for any one read? I'm curious now.

                Interesting idea with mapping reads to reads. That wouldn't take too long to run if you want to try it. A few thoughts. You have "long" reads, so I'd additionally look for reads that are identical but shifted a few bases in either direction. You could do this a few ways. One would be to use the full length reads in your artificial chromosomes, then map trimmed reads (shave just a few bases from either end) to those. Another would be to write your own script that searches pairwise through all the reads for longest common substrings with lengths of at least x with an edit distance of x-n. Basically the same thing. Finally, an assembler will do much the same thing. I believe the output format of Velvet is a FASTA file of contigs with a header that lists the number of reads used to make the contig--in essence a measure of how many related sequences you have.

                I wouldn't underestimate using BLAST (dc-megablast) with the nr/nt database. That is well annotated; each sequence as a taxon id associated with it that you can use to parse your data. Handling data from GenBank is sometimes a pain though. The tools with BLAST+ are useful (blastdbcmd allows you to make queries that, in conjunction with awk commands, will e.g. extract all sequences for a certain taxon id), though you can do the same things online.

                Hope that helps. I'm dealing with a similar issue, except I'll be handling this data on a regular basis. I've thought about this a lot, but don't necessarily have the best answers yet. Please let me know how this goes for you.

                One other thought. Reads I'm working with now are from libraries that were poorly prepared. The insert lengths were << than the length of the reads, resulting in sequencing into adapter sequences. This prevents mapping by standard algorithms. You might grep your reads for the first ~12 bases of the Illumina adapter sequence in case they are present, since it sounds like this company used somewhat of a modified/homebrew protocol.


                • #9
                  Thanks zbjorn, you have given me lots to think about.

                  I have spent a lot of time on these adapters so far and can summarize that for you now.
                  The company gave me these sequences:
                  >5-end adapter
                  >3-end adapter
                  5’- CCTTGGCTGTCACTCACTGCGA-(dT25)-3
                  dT25 means the adapter is followed by 25 Ts

                  We first got "raw" data, then a few days later "cleaned" reads. While I had the raw data, I wrote a program to take out reads with adapters with mismatches. My program took them out on either strand and when I got the cleaned data, I re-ran this program on that too and was still masking/removing lots of sequence. I was finding many instances where the 5end adapter was on the reversecomplement strand exactly adjacent to an instance of the 3end adapter on the sense strand. I emailed them and gave them some examples and also asked about the mismatches (as when I searched NCBI bug genomes with these adapter sequences with up to 5 mismatches, I could not find them) why they do not remove instances with mismatches since they seem obvious to me to be adapters with sequencing errors. After a couple of days they replied and sent back a third set of files, claiming that their cleaning program is not very customizable.

                  This experience has led me to become quite suspect of the company. And with my discovery of this forum, I have begun to become acquainted with the "tools of the trade" and right now am interested in using things like Mosaik, which take into account the quality data, more than my blast/blat pipeline because even this 3rd round of cleaned data still has lots of sequences with big stretches of 2s in the quality file and I have now the impression that what I was doing initially, treating identical sequences as identical (without taking into consideration the quality info) was wrong.

                  I will read up on Velvet today as that is a program I have not yet seen.

                  Meanwhile, maybe you could take a look at this question too:

                  Thanks again.


                  • #10
                    Hi there,

                    I'm a little unclear what you mean about the 5' end adjacent to the 3'. Do you mean you had 5'end-3'end adapter heterodimers? If the libraries were size selected properly, I think these are not supposed to be present. Or there's some modification on the oligo(s) to make them not form. I forget.

                    Ideally you should have *no* adapter sequences in your reads. If you do, depending on your application you should trim them off. That's all I meant, as a possible reason why your reads aren't assembling or mapping as well as you'd like. It's also an indicator of whether or not your libraries were prepared well.

                    If I understand you correctly, you're removing entire reads based on presence of mismatches in the adapter sequences you're seeing in your reads? I wouldn't use errors in the adapter sequences as an indicator of the total read quality. The read quality decreases with length, so if there are errors they are more likely to be in the 5' end where you'd see the adapter. You can map your PhiX control (hopefully the company ran one) all the way out to 76 bp instead of just the default 25 to get an idea of how the flow cell performed on the whole, as you get into deeper cycles.

                    This is in response to the question in the link you sent (moderators, this is a related question anyway so hope it's not an issue posting in this thread). I'm having troubles logging in to that branch of the site, and my reply is more discussion-like than definitive answer-like. (I'll reply to the post above in a bit.)
                    I have a similar question with Illumina tech support right now (mine was more open-ended--do you guys evaluate reads somehow and filter on that metric). I don't filter reads at all after running the Illumina pipeline (which does some of its own filtering), considering that the most commonly used alignment algorithms (maq being the pioneer) take into consideration the quality scores when mapping and generating the consensus, as you mentioned. This hasn't resulted in any noticeable problems for me, but I can't say whether or not filtering would improve your analysis. For sure, if you're writing your own algorithm that doesn't consider quality scores of each base or each read, then filtering your input file would be wise. It may be the case that reads with an average score of 2 are like TN_(75), which algorithms should not do anything with anyway. (And yes, 40 is the max on the Illumina scale; -5 is the lowest.) I'd say your parameters are reasonable.

                    I forget the Mosaik workflow... check that you are processing the Illumina quality scores properly in your first step. I usually go Illumina-->FASTQ first. If the quality scores are still Illumina scale, that will result in problems. However, this sounds like another problem.

                    You might check with this company how they are "cleaning" the reads. Are they using something other than the Illumina pipeline? You can't use the Illumina pipeline in an iterative manner very easily... If they are using something than the Illumina pipeline, I would be questioning. That thing was well engineered. Then again, there are open-source alternatives that are purportedly better at at least base calling, but I'm not familiar with them at all.

                    -- it's a little late, if I missed anything don't hesitate to restate.



                    Latest Articles


                    • seqadmin
                      The Impact of AI in Genomic Medicine
                      by seqadmin

                      Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                      02-26-2024, 02:07 PM
                    • seqadmin
                      Multiomics Techniques Advancing Disease Research
                      by seqadmin

                      New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                      A major leap in the field has
                      02-08-2024, 06:33 AM





                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:12 AM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 02-23-2024, 04:11 PM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 02-21-2024, 08:52 AM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 02-20-2024, 08:57 AM
                    0 responses
                    Last Post seqadmin