Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • sklages
    replied
    yep ... why simply saying 'passed' if you can say 'not failed' ... keep things complicated ;-)

    SCNR,
    Sven

    Leave a comment:


  • GenoMax
    replied
    CASAVA v.1.8.2 FASTQ files only contain reads that passed filtering (unless you run the analysis "--with-failed-reads" option which then includes reads that would normally be filtered out).

    "N" here means the sequence is *not* filtered i.e. it is good quality.


    Originally posted by kalyankpy View Post
    Hi,

    We have a control run on new HiSeq Machine installed recently. The fastq files extracted from CASAVA 1.8.2 has a different format (Pasted below)


    @HWI-ST1072:1440BVUACXX:2:1101:1242:2124 1:N:0:
    CGGTTTTTATTAAACATATAAACAATTCTTACAGATTGACATCGTACGAGC
    +
    ;@@DDD++<CD:2:A<<a@F:333<3AFAC9+1**1:C**11CE0?DGF

    The manual says that when sequences are filtered they will have "Y" in the header. However, all my sequences (100%) are having "N". I have run FASTQC on these sequences and it shows the quality to be EXCELLENT. I am also attaching the picture of the read quality. Presence of "N" worries me and I want to know if this is good sequence of bad! What actually does "N" mean here!
    Last edited by GenoMax; 11-23-2011, 09:36 AM. Reason: added info

    Leave a comment:


  • kalyankpy
    replied
    CASAVA .bcl to fastq output

    Hi,

    We have a control run on new HiSeq Machine installed recently. The fastq files extracted from CASAVA 1.8.2 has a different format (Pasted below)


    @HWI-ST1072:1440BVUACXX:2:1101:1242:2124 1:N:0:
    CGGTTTTTATTAAACATATAAACAATTCTTACAGATTGACATCGTACGAGC
    +
    ;@@DDD++<CD:2:A<<a@F:333<3AFAC9+1**1:C**11CE0?DGF

    The manual says that when sequences are filtered they will have "Y" in the header. However, all my sequences (100%) are having "N". I have run FASTQC on these sequences and it shows the quality to be EXCELLENT. I am also attaching the picture of the read quality. Presence of "N" worries me and I want to know if this is good sequence of bad! What actually does "N" mean here!
    Attached Files

    Leave a comment:


  • GenoMax
    replied
    Originally posted by kjaja View Post
    I have a question related to using galaxy. I have tires to map one sample to the reference using BWA and it took few hours to do that!! Is that normal?
    Yes. That is normal for galaxy. Remember you are sharing the site with tens of other users and jobs. Even if you do this locally on your own hardware, it will take on the order of couple three hours per sample to do alignments for large genomes (human).

    Originally posted by kjaja View Post
    How do people go about processing many samples, would galaxy be the tool to use? can we use command lines or scripts to process data using galaxy?
    Some do not have access to local computer hardware infrastructure so for them galaxy (or galaxy on Amazon cloud) is a good (or only) option.

    If you are comfortable with command line and have access to local compute infrastructure then you do not need public galaxy. But if you still want to use the easy web interface of galaxy then consider setting up a local instance of galaxy (http://wiki.g2.bx.psu.edu/) and use it that way.

    Leave a comment:


  • swbarnes2
    replied
    Originally posted by kjaja View Post
    Thank you all, that was helpful.

    I have a question related to using galaxy. I have tires to map one sample to the reference using BWA and it took few hours to do that!! Is that normal? How do people go about processing many samples, would galaxy be the tool to use? can we use command lines or scripts to process data using galaxy?
    Yes, a few hours to align tens or hundreds of millions of reads to a mammalian genome is normal. If you can use multiple processors (with the -t option in bwa), that'll speed things up.

    Leave a comment:


  • kjaja
    replied
    Thank you all, that was helpful.

    I have a question related to using galaxy. I have tires to map one sample to the reference using BWA and it took few hours to do that!! Is that normal? How do people go about processing many samples, would galaxy be the tool to use? can we use command lines or scripts to process data using galaxy?

    Leave a comment:


  • swbarnes2
    replied
    Originally posted by kjaja View Post
    thanks GenoMax for the input

    It looks like I will be getting the raw data probably “.bcl” format. Based on reading some papers, I can use CASAVA to convert into ” fastq” format and then use BWA to align against the reference. I have seen other paper use “Maq” or “ELAND”, does anyone know the difference between BWA, Maq or ELAND?
    It terms of using an online tool such as Galaxy, I have never used it before, is there an online tutorial on how to use it ?

    thanks
    There are a number of aligners out there. ELAND is Illumina's. Maq is pretty old school, bwa is a Burrows-Wheeler Transform algorithm, which for a while have been the preferred algorithm. Speed doesn't matter if you are working on small genomes, like bacteria, but anything larger than 10's of megabases, you will be better off with a bw algorithm. You want your output to be in sam or bam format (bam is binary, compressed sam). This is becoming the standard, it's a file where every line is one read, and all the information about where and how well that read mapped. You are then going to want to do stuff to the bam files. SAMTools is one suite of programs that can help, as is the Broads Genome Analysis ToolKit (GATK). SAMTools is a lot less complex. There are a few different tools for visualization, like Galaxy, and IGV. BEDTools can be useful too. For exome capture, you probably want to align to the whole genome, then filter for just the reads that overlap your exons. BEDTools can do that.

    Leave a comment:


  • GenoMax
    replied
    kjaja,

    I doubt you are going to get data in the BCL format. You will need the Illumina pipeline software to process the raw data in BCL format. Last I checked this software was not freely available. If you were doing this only for one experiment then you would not want to spend time on installing CASAVA (assuming you got your hands on a copy).

    In general bcl --> fastq conversion step is generally performed by the facility where you will get your sequence from. Depending on what their policy is, you can request that your sequences be aligned to your "reference" genome using ELAND. ELAND is Illumina's version of short sequence alignment tool. Most commonly used aligners are bwa, bowtie, SOAP (this site has a long list of software for NGS data analysis: http://seqanswers.com/wiki/Software/list).

    Galaxy has tutorials available at the links below for RNA-seq analysis:

    Galaxy is a community-driven web-based analysis platform for life science research.

    Galaxy is a community-driven web-based analysis platform for life science research.


    They also have video tutorials ("live quickies") on the main page of Galaxy (http://main.g2.bx.psu.edu/) to get you started.

    Originally posted by kjaja View Post
    thanks GenoMax for the input

    It looks like I will be getting the raw data probably “.bcl” format. Based on reading some papers, I can use CASAVA to convert into ” fastq” format and then use BWA to align against the reference. I have seen other paper use “Maq” or “ELAND”, does anyone know the difference between BWA, Maq or ELAND?
    It terms of using an online tool such as Galaxy, I have never used it before, is there an online tutorial on how to use it ?

    thanks

    Leave a comment:


  • kjaja
    replied
    thanks GenoMax for the input

    It looks like I will be getting the raw data probably “.bcl” format. Based on reading some papers, I can use CASAVA to convert into ” fastq” format and then use BWA to align against the reference. I have seen other paper use “Maq” or “ELAND”, does anyone know the difference between BWA, Maq or ELAND?
    It terms of using an online tool such as Galaxy, I have never used it before, is there an online tutorial on how to use it ?

    thanks

    Leave a comment:


  • GenoMax
    replied
    The output of the sequencer will be fastq files. If the facility where you are getting these from uses the new version (v.1.8) of illumina pipeline, each sample may have multiple gzip-archived files that you will need to merge (or analyze in parallel and then merge). The quality values in the fastq files will be in the "sanger" format (http://en.wikipedia.org/wiki/FASTQ_format). Files are going to be ready to analysis (starting with some QC).

    Are you planning to analyze the data using local computing infrastructure or with an online tool like galaxy.

    Leave a comment:


  • kjaja
    started a topic processing the outputs of illumina hi seq

    processing the outputs of illumina hi seq

    Hi,

    We will be using Illumina HiSeq 2000 to sequence exomes . I have not received the data yet, and I am looking to put a plan together on the steps for analysis.

    Does anyone know what type of files I will be starting with ( the output from the illumine sequencer), would it be in a "fastq" format? is there an outline on how to process the files up to the analysis stage.

    thanks

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-25-2024, 11:49 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
18 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
62 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X