Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Where to start for sequence analysis of 24 virus Illumina Miseq

    Hello Folks

    We have run 24 influenza virus samples on our MiSeq and obtained 48 FASTQ files (R1 + R2 for each). We also got 24 Contig files.
    We didn't run any standard for them during analysis.
    Now how we can proceed with the analysis. Which software we should use? Should we combine R1 and R2 first. Or we can align R1 R2 to reference from the beginning. I do not have any background of bioinformatics. A detailed reply for a beginner will be appreciated.

  • #2
    What was the goal of sequencing them in the first place? This might inform any potential answers...

    Comment


    • #3
      If aim is to call SNP's using a reference then using `bbmap.sh` followed by `callvariants.sh` from BBMap Suite is one path.

      You could also use tadpole.sh to de novo assemble the viral genomes. Followed by alignments to the reference.

      In either case, scanning/trimming with bbduk.sh should be the first step to remove any extraneous sequence present in read data. bbmerge.sh can also be tried to see if the reads are overlapping (which may indicate short inserts).

      All these tools have their own threads here which can be looked up. Usage guides are also available in the "docs" directory of BBMap software.

      Comment


      • #4
        Thanks for the reply. We want to get full genome sequencing of influenza virus sampled from our region.
        Thanks for your advice. They are useful. Can you make them step wise, 1, 2, 3. and also for each step refer a tool and a command that we can use. thank again.

        Comment


        • #5
          If you got 24 contig files, then aren't the sequences of your strains in those? How many contigs did you get per sample?

          --
          Phillip

          Comment


          • #6
            Hi Philip

            Yes I got 24 Contig. 1 for each. I think they will also be having my sequences.
            I was also wondering if I should work on contigs or on R1.FASTQ R2.FASTQ.
            One fellow here was suggested me to use BWA aligner to align R1.FASTQ R2.FASTQ ti the reference.
            I have no idea where to start.

            Comment


            • #7
              The .fastq data would be the raw reads. "contigs" implies that some program has been used to combine the reads. I would suggest you start with the contig files. They will probably be in a simple format like fasta. So you can even just copy and paste the text in the file into a web blast query.

              --
              Phillip

              Comment


              • #8
                Dear Philip thanks.
                Which program should I use then? BWA?
                I think our first step will be alignment.

                Comment


                • #9
                  If you really have contig files then it signifies that you have assembled data i.e. sequence data that came from the sequencer has been processed and assembled. Is that the case? Assuming the analysis was done right you should not have to worry about your original fastq files. Save them for reference.

                  You can use influenza virus resource that NCBI has to do blast searches against known strains of flu.

                  If your data is still raw i.e. in fastq format then you will need to do a lot more work. If you are very new at this then I suggest that you take a look at chapters in this WikiBook to get started.
                  Last edited by GenoMax; 01-24-2018, 07:22 AM.

                  Comment


                  • #10
                    I agree with GenoMax. If you have contig files then it is likely that most of the work is done and all you need to do is take your contig file and blast it against the resource that GenoMax links to.
                    But can you take a look at the contigs file and see if you can read it by eye? Ideally each one would contain about 13.6 thousand bases of sequence in 8 segments.

                    --
                    Phillip

                    Comment


                    • #11
                      Hello Folks
                      As suggested by GenoMax and Philips. I went to see contigs in my seq
                      They are lying in MiSeqOutput > Data > Intensities > BaseCalls > Alignment & Aligment2 Folders. Each Alignment & Aligment2 Folder has 24 Contigs files that looks like copied below. As Philip suggested they are not 13.6 thousand bases of sequence in 8 segments.

                      >NODE_726_length_56_cov_25.964285
                      GCATACGAGATTCGCTTTAGTCTCGTGGGCGCGGAGATTTGTAGAAGAGACAGATCCCACAGTGTCTCTGTTTACACCACAAAAGG
                      >NODE_1383_length_73_cov_1.000000
                      AGAATGGGAGACCTTCCCTACCTCCAGAGCCGAAATGCTGGCTCTTATACCCCTCTCCGAGCCCAAGAGACTCAGGCGCAAATCGTATGCCGTCTTCTGCTTT
                      >NODE_2134_length_64_cov_1.000000
                      AGTGCACCAGTTGACTAGCTTAGTGACTCCACCTTGGACCCATGCAACGGTATTTCTCTTTTTTGCTTCTTGTATAGTTTTACTGCTCTATCCA
                      >NODE_2206_length_62_cov_1.000000
                      ATAGTTGGAGAAATTTCACCATTACCTCCTATTAAAGGACATACTTTTGAGGATGTCAAAACTGCACTTGGGGTCCTCATCGGAGGACTTGA
                      >NODE_2254_length_32_cov_151.625000
                      GTGGGCTCGGAGATGTGAATAAAAGACAGGATCAGTAGAAACAAGGGTGTTTTTTATCATTA
                      >NODE_2284_length_34_cov_1.117647
                      AGAAATGAGAAGTGGCGGGGACAATTTGTGCAGCAAATTTGGGGAAAAAAGGGGGTTATTTGAG
                      >NODE_2285_length_39_cov_1.025641
                      AAATTTGGGGAAAAAAGGGGGTTATTTGAGGCAAAAGGGCCAGATTGTAAGCGACAGAGAAAAGGTTTG
                      >NODE_2746_length_45_cov_1.066667
                      AGCGTAGACGCTTTATCCAAAATGCTCTAACTGGGAATGGGGACGCGAACAACATGGATCGAGCAGTTAAACTAT

                      Comment


                      • #12
                        While some of those fragments are flu virus they are not of significant length. Especially if your aim is to put together reasonably complete genomes.

                        You will almost certainly need to do the assembly outside the software available on the sequencer. If you use BaseSpace then you could use alignments to standard flu genome to see what the coverage looks like in your data to get an idea of how complete any assemblies you try are going to be.

                        I suggest that you start looking at the wikibooks links if you have not done this before.

                        Comment


                        • #13
                          ok thanks. will start from that book and return if have some question. goodbye

                          Comment


                          • #14
                            Originally posted by musohail View Post
                            Hello Folks
                            As suggested by GenoMax and Philips. I went to see contigs in my seq
                            They are lying in MiSeqOutput > Data > Intensities > BaseCalls > Alignment & Aligment2 Folders. Each Alignment & Aligment2 Folder has 24 Contigs files that looks like copied below. As Philip suggested they are not 13.6 thousand bases of sequence in 8 segments.

                            >NODE_726_length_56_cov_25.964285
                            GCATACGAGATTCGCTTTAGTCTCGTGGGCGCGGAGATTTGTAGAAGAGACAGATCCCACAGTGTCTCTGTTTACACCACAAAAGG
                            >NODE_1383_length_73_cov_1.000000
                            AGAATGGGAGACCTTCCCTACCTCCAGAGCCGAAATGCTGGCTCTTATACCCCTCTCCGAGCCCAAGAGACTCAGGCGCAAATCGTATGCCGTCTTCTGCTTT
                            >NODE_2134_length_64_cov_1.000000
                            AGTGCACCAGTTGACTAGCTTAGTGACTCCACCTTGGACCCATGCAACGGTATTTCTCTTTTTTGCTTCTTGTATAGTTTTACTGCTCTATCCA
                            >NODE_2206_length_62_cov_1.000000
                            ATAGTTGGAGAAATTTCACCATTACCTCCTATTAAAGGACATACTTTTGAGGATGTCAAAACTGCACTTGGGGTCCTCATCGGAGGACTTGA
                            >NODE_2254_length_32_cov_151.625000
                            GTGGGCTCGGAGATGTGAATAAAAGACAGGATCAGTAGAAACAAGGGTGTTTTTTATCATTA
                            >NODE_2284_length_34_cov_1.117647
                            AGAAATGAGAAGTGGCGGGGACAATTTGTGCAGCAAATTTGGGGAAAAAAGGGGGTTATTTGAG
                            >NODE_2285_length_39_cov_1.025641
                            AAATTTGGGGAAAAAAGGGGGTTATTTGAGGCAAAAGGGCCAGATTGTAAGCGACAGAGAAAAGGTTTG
                            >NODE_2746_length_45_cov_1.066667
                            AGCGTAGACGCTTTATCCAAAATGCTCTAACTGGGAATGGGGACGCGAACAACATGGATCGAGCAGTTAAACTAT
                            Those looks like they are contigs created by the program SPADEs. SPADEs is a very good de novo assembler that should have been able to easily assemble an influenza genome.

                            That said, the extremely short short contig lengths and low kmer coverages displayed in the headers suggest that these are just "junk" contigs. Maybe just the last 8 sequences in the contig file? You want to look at the first 8-20 contigs in these files.

                            What method was used to create the Illumina libraries that were sequenced in this MiSeq run?

                            --
                            Phillip

                            Comment


                            • #15
                              I wrote a pipeline for (avian) influenza typing for the the CLC workbench. (commercial program)

                              Instead of doing de novo assembly. It's maps the reads to list of distinct subtypes. In next step I extract consensus sequence and to confirm there are no weird artifacts, I re-map the reads to consensus.

                              This approach works very well and one advantage is that you get full-length fragments including the repeats (that are always hard to assemble).
                              Another advantage is that because it's based an annotated reference I can just transfer the annotation from the reference to consensus (with an additional check if the CDS has a valid ORF.)

                              In case of stalk deletions the reads won't map properly to consensus and then it will perform de novo assembly of the breakpoint.

                              This approach is relatively fast, one sample (10,000x coverage) takes less than 5 minutes on a laptop. it was used in 2016/2017 avian flu outbreak in the Netherlands and it typed a couple of hundred samples without major problems or manual intervention.

                              Unfortunately, I can't give away the pipeline (i wrote it for a customer).

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X