Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Several questions about our RNA-seq results

    Hello,

    I have several questions;

    1-) We did a SE-50bp sequencing at Illumina platform. I am trying to analyze them in Galaxy Server. After uploading the fastq files, I saw dots in some of the reads and they are in a pattern with other reads containing dots (I mean some set of reads have dots in 33th and 34th position; another set at somewhere else but in same locations. - and it seems like reads containing dots constitutes up %10 of the all reads)

    After grooming, those did'nt change, but I still did the TopHat; I am not sure if I need to change the dots with "N"s to be able to use that reads (do I?). If I need, how should I do that? (I am not using linux, I'll be happy if you can give a solution by using Galaxy)

    2-) One of the fastqs give overrepresented sequence which is something like that:
    Sequence Count Percentage Possible Source
    GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTTCCGTATCTCGTAT 34465 0.11089473793609078 TruSeq Adapter, Index 8 (97% over 37bp)

    Should I need to remove those reads? Because they won't map at the end and shouldn't be a problem.

    3-) I guess per base quality graph like below is good and I don't need any trimming or quality cut off?

    Galaxy is a community-driven web-based analysis platform for life science research.
    Last edited by sazz; 01-27-2013, 02:57 AM.

  • #2
    1) I haven't encountered dots in Illumina fastq files (only in SOLiD output) but I suppose you may need to convert them to N, yes.

    2) You are right, you probably don't need to remove them. (You would need to if you were doing de novo assembly).

    3) It is great.

    Comment


    • #3
      Thanks for the answers kopi-o,

      I have converted the dots into "N" and now doing alignment for both of them to see if there is any difference between their alignments.

      I have some other questions now. For SE alignments on TopHat, I can't get a detailed statistics at the end (with flagstat). I can only see mapped read number and it also shows me as %100 of the reads are mapped. But I also want to know how many are discarded or which are uniquely mapped or mapped twice etc. (as I have used default settings which lets a read to align max 20 times) Is there any program showing that? Also why 20, isn't it a little bit high?

      Additionally, in default settings of tophat, max mismatch is 2 but as I have been doing expression analysis (comparing 2 samples), should I let it to be more than 2 or is it fine?

      I will be happy if you can give any other suggestions about TopHat parameters.

      Comment


      • #4
        I have found it can be more informative to use bam_stat.py from the RSeQC package or Picard's CollectAlignmentSummaryMetrics to get detailed information about the alignment statistics.

        Why it is 20 is anyone's guess - you could always change it :-)

        I would think the max mismatch can be increased considerably if you have reads of length 100 or similar, but in fact I have never touched this parameter myself. I don't think it has been in many older version of TopHat. The versions I have used have had a max mismatch parameter for each segment (sub-read) but not for the whole read.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          Yesterday, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        59 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        57 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        48 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Working...
        X