processing the outputs of illumina hi seq

sklages replied

11-23-2011, 09:44 AM
yep ... why simply saying 'passed' if you can say 'not failed' ... keep things complicated ;-)

SCNR,
Sven
Leave a comment:
GenoMax replied

11-23-2011, 09:33 AM
CASAVA v.1.8.2 FASTQ files only contain reads that passed filtering (unless you run the analysis "--with-failed-reads" option which then includes reads that would normally be filtered out).

"N" here means the sequence is *not* filtered i.e. it is good quality.

Originally posted by kalyankpy View Post

Hi,

We have a control run on new HiSeq Machine installed recently. The fastq files extracted from CASAVA 1.8.2 has a different format (Pasted below)

@HWI-ST1072:1440BVUACXX:2:1101:1242:2124 1:N:0:
CGGTTTTTATTAAACATATAAACAATTCTTACAGATTGACATCGTACGAGC
+
;@@DDD++<CD:2:A<<a@F:333<3AFAC9+1**1:C**11CE0?DGF

The manual says that when sequences are filtered they will have "Y" in the header. However, all my sequences (100%) are having "N". I have run FASTQC on these sequences and it shows the quality to be EXCELLENT. I am also attaching the picture of the read quality. Presence of "N" worries me and I want to know if this is good sequence of bad! What actually does "N" mean here!

Last edited by GenoMax; 11-23-2011, 09:36 AM. Reason: added info
Leave a comment:
kalyankpy replied

11-23-2011, 08:56 AM
CASAVA .bcl to fastq output

Hi,

We have a control run on new HiSeq Machine installed recently. The fastq files extracted from CASAVA 1.8.2 has a different format (Pasted below)

@HWI-ST1072:1440BVUACXX:2:1101:1242:2124 1:N:0:
CGGTTTTTATTAAACATATAAACAATTCTTACAGATTGACATCGTACGAGC
+
;@@DDD++<CD:2:A<<a@F:333<3AFAC9+1**1:C**11CE0?DGF

The manual says that when sequences are filtered they will have "Y" in the header. However, all my sequences (100%) are having "N". I have run FASTQC on these sequences and it shows the quality to be EXCELLENT. I am also attaching the picture of the read quality. Presence of "N" worries me and I want to know if this is good sequence of bad! What actually does "N" mean here!
Attached Files

per_base_quality.png (8.5 KB, 64 views)

per_sequence_quality.png (19.5 KB, 36 views)
Leave a comment:
GenoMax replied

08-31-2011, 08:29 AM
Originally posted by kjaja View Post

I have a question related to using galaxy. I have tires to map one sample to the reference using BWA and it took few hours to do that!! Is that normal?

Yes. That is normal for galaxy. Remember you are sharing the site with tens of other users and jobs. Even if you do this locally on your own hardware, it will take on the order of couple three hours per sample to do alignments for large genomes (human).

Originally posted by kjaja View Post

How do people go about processing many samples, would galaxy be the tool to use? can we use command lines or scripts to process data using galaxy?

Some do not have access to local computer hardware infrastructure so for them galaxy (or galaxy on Amazon cloud) is a good (or only) option.

If you are comfortable with command line and have access to local compute infrastructure then you do not need public galaxy. But if you still want to use the easy web interface of galaxy then consider setting up a local instance of galaxy (http://wiki.g2.bx.psu.edu/) and use it that way.
Leave a comment:
swbarnes2 replied

08-31-2011, 08:20 AM
Originally posted by kjaja View Post

Thank you all, that was helpful.

I have a question related to using galaxy. I have tires to map one sample to the reference using BWA and it took few hours to do that!! Is that normal? How do people go about processing many samples, would galaxy be the tool to use? can we use command lines or scripts to process data using galaxy?

Yes, a few hours to align tens or hundreds of millions of reads to a mammalian genome is normal. If you can use multiple processors (with the -t option in bwa), that'll speed things up.
Leave a comment:
kjaja replied

08-31-2011, 06:37 AM
Thank you all, that was helpful.

I have a question related to using galaxy. I have tires to map one sample to the reference using BWA and it took few hours to do that!! Is that normal? How do people go about processing many samples, would galaxy be the tool to use? can we use command lines or scripts to process data using galaxy?
Leave a comment:
swbarnes2 replied

08-29-2011, 10:30 AM
Originally posted by kjaja View Post

thanks GenoMax for the input

It looks like I will be getting the raw data probably “.bcl” format. Based on reading some papers, I can use CASAVA to convert into ” fastq” format and then use BWA to align against the reference. I have seen other paper use “Maq” or “ELAND”, does anyone know the difference between BWA, Maq or ELAND?
It terms of using an online tool such as Galaxy, I have never used it before, is there an online tutorial on how to use it ?

thanks

There are a number of aligners out there. ELAND is Illumina's. Maq is pretty old school, bwa is a Burrows-Wheeler Transform algorithm, which for a while have been the preferred algorithm. Speed doesn't matter if you are working on small genomes, like bacteria, but anything larger than 10's of megabases, you will be better off with a bw algorithm. You want your output to be in sam or bam format (bam is binary, compressed sam). This is becoming the standard, it's a file where every line is one read, and all the information about where and how well that read mapped. You are then going to want to do stuff to the bam files. SAMTools is one suite of programs that can help, as is the Broads Genome Analysis ToolKit (GATK). SAMTools is a lot less complex. There are a few different tools for visualization, like Galaxy, and IGV. BEDTools can be useful too. For exome capture, you probably want to align to the whole genome, then filter for just the reads that overlap your exons. BEDTools can do that.
Leave a comment:
GenoMax replied

08-29-2011, 09:42 AM
kjaja,

I doubt you are going to get data in the BCL format. You will need the Illumina pipeline software to process the raw data in BCL format. Last I checked this software was not freely available. If you were doing this only for one experiment then you would not want to spend time on installing CASAVA (assuming you got your hands on a copy).

In general bcl --> fastq conversion step is generally performed by the facility where you will get your sequence from. Depending on what their policy is, you can request that your sequences be aligned to your "reference" genome using ELAND. ELAND is Illumina's version of short sequence alignment tool. Most commonly used aligners are bwa, bowtie, SOAP (this site has a long list of software for NGS data analysis: http://seqanswers.com/wiki/Software/list).

Galaxy has tutorials available at the links below for RNA-seq analysis:

Galaxy

http://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise

Galaxy is a community-driven web-based analysis platform for life science research.

Galaxy

http://usegalaxy.org/u/jeremy/p/transcriptome-analysis-faq

Galaxy is a community-driven web-based analysis platform for life science research.

They also have video tutorials ("live quickies") on the main page of Galaxy (http://main.g2.bx.psu.edu/) to get you started.

Originally posted by kjaja View Post

thanks GenoMax for the input

It looks like I will be getting the raw data probably “.bcl” format. Based on reading some papers, I can use CASAVA to convert into ” fastq” format and then use BWA to align against the reference. I have seen other paper use “Maq” or “ELAND”, does anyone know the difference between BWA, Maq or ELAND?
It terms of using an online tool such as Galaxy, I have never used it before, is there an online tutorial on how to use it ?

thanks
Leave a comment:
kjaja replied

08-29-2011, 09:20 AM
thanks GenoMax for the input

It looks like I will be getting the raw data probably “.bcl” format. Based on reading some papers, I can use CASAVA to convert into ” fastq” format and then use BWA to align against the reference. I have seen other paper use “Maq” or “ELAND”, does anyone know the difference between BWA, Maq or ELAND?
It terms of using an online tool such as Galaxy, I have never used it before, is there an online tutorial on how to use it ?

thanks
Leave a comment:
GenoMax replied

08-29-2011, 07:09 AM
The output of the sequencer will be fastq files. If the facility where you are getting these from uses the new version (v.1.8) of illumina pipeline, each sample may have multiple gzip-archived files that you will need to merge (or analyze in parallel and then merge). The quality values in the fastq files will be in the "sanger" format (http://en.wikipedia.org/wiki/FASTQ_format). Files are going to be ready to analysis (starting with some QC).

Are you planning to analyze the data using local computing infrastructure or with an online tool like galaxy.
Leave a comment:
kjaja started a topic processing the outputs of illumina hi seq

08-29-2011, 06:59 AM
processing the outputs of illumina hi seq

Hi,

We will be using Illumina HiSeq 2000 to sequence exomes . I have not received the data yet, and I am looking to put a plan together on the steps for analysis.

Does anyone know what type of files I will be starting with ( the output from the illumine sequencer), would it be in a "fastq" format? is there an outline on how to process the files up to the analysis stage.

thanks
Tags: None

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: