Seqanswers Leaderboard Ad

**Bukowski** · 01-22-2018, 01:58 AM

What was the goal of sequencing them in the first place? This might inform any potential answers...

**GenoMax** · 01-22-2018, 05:53 AM

If aim is to call SNP's using a reference then using `bbmap.sh` followed by `callvariants.sh` from BBMap Suite is one path.

You could also use tadpole.sh to de novo assemble the viral genomes. Followed by alignments to the reference.

In either case, scanning/trimming with bbduk.sh should be the first step to remove any extraneous sequence present in read data. bbmerge.sh can also be tried to see if the reads are overlapping (which may indicate short inserts).

All these tools have their own threads here which can be looked up. Usage guides are also available in the "docs" directory of BBMap software.

**musohail** · 01-22-2018, 08:53 PM

Thanks for the reply. We want to get full genome sequencing of influenza virus sampled from our region.
Thanks for your advice. They are useful. Can you make them step wise, 1, 2, 3. and also for each step refer a tool and a command that we can use. thank again.

**pmiguel** · 01-23-2018, 01:47 PM

If you got 24 contig files, then aren't the sequences of your strains in those? How many contigs did you get per sample?

--
Phillip

**musohail** · 01-23-2018, 09:45 PM

Hi Philip

Yes I got 24 Contig. 1 for each. I think they will also be having my sequences.
I was also wondering if I should work on contigs or on R1.FASTQ R2.FASTQ.
One fellow here was suggested me to use BWA aligner to align R1.FASTQ R2.FASTQ ti the reference.
I have no idea where to start.

**pmiguel** · 01-24-2018, 04:21 AM

The .fastq data would be the raw reads. "contigs" implies that some program has been used to combine the reads. I would suggest you start with the contig files. They will probably be in a simple format like fasta. So you can even just copy and paste the text in the file into a web blast query.

--
Phillip

**musohail** · 01-24-2018, 04:37 AM

Dear Philip thanks.
Which program should I use then? BWA?
I think our first step will be alignment.

**GenoMax** · 01-24-2018, 04:45 AM

If you really have contig files then it signifies that you have assembled data i.e. sequence data that came from the sequencer has been processed and assembled. Is that the case? Assuming the analysis was done right you should not have to worry about your original fastq files. Save them for reference.

You can use influenza virus resource that NCBI has to do blast searches against known strains of flu.

If your data is still raw i.e. in fastq format then you will need to do a lot more work. If you are very new at this then I suggest that you take a look at chapters in this WikiBook to get started.

**pmiguel** · 01-24-2018, 06:48 AM

I agree with GenoMax. If you have contig files then it is likely that most of the work is done and all you need to do is take your contig file and blast it against the resource that GenoMax links to.
But can you take a look at the contigs file and see if you can read it by eye? Ideally each one would contain about 13.6 thousand bases of sequence in 8 segments.

--
Phillip

**musohail** · 01-24-2018, 11:17 PM

Hello Folks
As suggested by GenoMax and Philips. I went to see contigs in my seq
They are lying in MiSeqOutput > Data > Intensities > BaseCalls > Alignment & Aligment2 Folders. Each Alignment & Aligment2 Folder has 24 Contigs files that looks like copied below. As Philip suggested they are not 13.6 thousand bases of sequence in 8 segments.

>NODE_726_length_56_cov_25.964285
GCATACGAGATTCGCTTTAGTCTCGTGGGCGCGGAGATTTGTAGAAGAGACAGATCCCACAGTGTCTCTGTTTACACCACAAAAGG
>NODE_1383_length_73_cov_1.000000
AGAATGGGAGACCTTCCCTACCTCCAGAGCCGAAATGCTGGCTCTTATACCCCTCTCCGAGCCCAAGAGACTCAGGCGCAAATCGTATGCCGTCTTCTGCTTT
>NODE_2134_length_64_cov_1.000000
AGTGCACCAGTTGACTAGCTTAGTGACTCCACCTTGGACCCATGCAACGGTATTTCTCTTTTTTGCTTCTTGTATAGTTTTACTGCTCTATCCA
>NODE_2206_length_62_cov_1.000000
ATAGTTGGAGAAATTTCACCATTACCTCCTATTAAAGGACATACTTTTGAGGATGTCAAAACTGCACTTGGGGTCCTCATCGGAGGACTTGA
>NODE_2254_length_32_cov_151.625000
GTGGGCTCGGAGATGTGAATAAAAGACAGGATCAGTAGAAACAAGGGTGTTTTTTATCATTA
>NODE_2284_length_34_cov_1.117647
AGAAATGAGAAGTGGCGGGGACAATTTGTGCAGCAAATTTGGGGAAAAAAGGGGGTTATTTGAG
>NODE_2285_length_39_cov_1.025641
AAATTTGGGGAAAAAAGGGGGTTATTTGAGGCAAAAGGGCCAGATTGTAAGCGACAGAGAAAAGGTTTG
>NODE_2746_length_45_cov_1.066667
AGCGTAGACGCTTTATCCAAAATGCTCTAACTGGGAATGGGGACGCGAACAACATGGATCGAGCAGTTAAACTAT

**GenoMax** · 01-25-2018, 04:54 AM

While some of those fragments are flu virus they are not of significant length. Especially if your aim is to put together reasonably complete genomes.

You will almost certainly need to do the assembly outside the software available on the sequencer. If you use BaseSpace then you could use alignments to standard flu genome to see what the coverage looks like in your data to get an idea of how complete any assemblies you try are going to be.

I suggest that you start looking at the wikibooks links if you have not done this before.

**musohail** · 01-25-2018, 07:28 AM

ok thanks. will start from that book and return if have some question. goodbye

**pmiguel** · 01-25-2018, 08:29 AM

Originally posted by musohail View Post

Hello Folks
As suggested by GenoMax and Philips. I went to see contigs in my seq
They are lying in MiSeqOutput > Data > Intensities > BaseCalls > Alignment & Aligment2 Folders. Each Alignment & Aligment2 Folder has 24 Contigs files that looks like copied below. As Philip suggested they are not 13.6 thousand bases of sequence in 8 segments.

>NODE_726_length_56_cov_25.964285
GCATACGAGATTCGCTTTAGTCTCGTGGGCGCGGAGATTTGTAGAAGAGACAGATCCCACAGTGTCTCTGTTTACACCACAAAAGG
>NODE_1383_length_73_cov_1.000000
AGAATGGGAGACCTTCCCTACCTCCAGAGCCGAAATGCTGGCTCTTATACCCCTCTCCGAGCCCAAGAGACTCAGGCGCAAATCGTATGCCGTCTTCTGCTTT
>NODE_2134_length_64_cov_1.000000
AGTGCACCAGTTGACTAGCTTAGTGACTCCACCTTGGACCCATGCAACGGTATTTCTCTTTTTTGCTTCTTGTATAGTTTTACTGCTCTATCCA
>NODE_2206_length_62_cov_1.000000
ATAGTTGGAGAAATTTCACCATTACCTCCTATTAAAGGACATACTTTTGAGGATGTCAAAACTGCACTTGGGGTCCTCATCGGAGGACTTGA
>NODE_2254_length_32_cov_151.625000
GTGGGCTCGGAGATGTGAATAAAAGACAGGATCAGTAGAAACAAGGGTGTTTTTTATCATTA
>NODE_2284_length_34_cov_1.117647
AGAAATGAGAAGTGGCGGGGACAATTTGTGCAGCAAATTTGGGGAAAAAAGGGGGTTATTTGAG
>NODE_2285_length_39_cov_1.025641
AAATTTGGGGAAAAAAGGGGGTTATTTGAGGCAAAAGGGCCAGATTGTAAGCGACAGAGAAAAGGTTTG
>NODE_2746_length_45_cov_1.066667
AGCGTAGACGCTTTATCCAAAATGCTCTAACTGGGAATGGGGACGCGAACAACATGGATCGAGCAGTTAAACTAT

Those looks like they are contigs created by the program SPADEs. SPADEs is a very good de novo assembler that should have been able to easily assemble an influenza genome.

That said, the extremely short short contig lengths and low kmer coverages displayed in the headers suggest that these are just "junk" contigs. Maybe just the last 8 sequences in the contig file? You want to look at the first 8-20 contigs in these files.

What method was used to create the Illumina libraries that were sequenced in this MiSeq run?

--
Phillip

**Patrick Dekker** · 01-29-2018, 06:27 AM

I wrote a pipeline for (avian) influenza typing for the the CLC workbench. (commercial program)

Instead of doing de novo assembly. It's maps the reads to list of distinct subtypes. In next step I extract consensus sequence and to confirm there are no weird artifacts, I re-map the reads to consensus.

This approach works very well and one advantage is that you get full-length fragments including the repeats (that are always hard to assemble).
Another advantage is that because it's based an annotated reference I can just transfer the annotation from the reference to consensus (with an additional check if the CDS has a valid ORF.)

In case of stalk deletions the reads won't map properly to consensus and then it will perform de novo assembly of the breakpoint.

This approach is relatively fast, one sample (10,000x coverage) takes less than 5 minutes on a laptop. it was used in 2016/2017 avian flu outbreak in the Netherlands and it typed a couple of hundred samples without major problems or manual intervention.

Unfortunately, I can't give away the pipeline (i wrote it for a customer).

Topics	Statistics	Last Post
Mechanical Forces in DNA Transcription Uncovered by Clemson Researchers by seqadmin Started by seqadmin, 10-02-2024, 04:51 AM	0 responses 13 views 0 likes	Last Post by seqadmin 10-02-2024, 04:51 AM
New Epigenetic Clock Links Cheek Cells to Mortality Risk by seqadmin Started by seqadmin, 10-01-2024, 07:10 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-01-2024, 07:10 AM
AI-Powered Blood Test Shows Promise for Early Ovarian Cancer Detection by seqadmin Started by seqadmin, 09-30-2024, 08:33 AM	0 responses 25 views 0 likes	Last Post by seqadmin 09-30-2024, 08:33 AM
Stem Cell Research Suggests Human Cells May Enter Developmental Pause by seqadmin Started by seqadmin, 09-26-2024, 12:57 PM	0 responses 18 views 0 likes	Last Post by seqadmin 09-26-2024, 12:57 PM

Seqanswers Leaderboard Ad

Announcement

Where to start for sequence analysis of 24 virus Illumina Miseq

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News