Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trim the paired-end data

    I am a beginner at bioinformatics, please forgive me if I asked silly questions.

    I am trying to do the alignment for some paired-end Illumina data. I used Fastx-toolkit to do the trimming of my data. And then I tried to use bowtie to do the alignment. I found out after trimming, the number of reads in Read 1 file is different from the number of reads in Read 2 file. So bowtie cannot find any matches. If I use bowtie 2, it will give me an error msg "Error, fewer reads in file specified with -2 than in file specified with -1 ".
    I guess I have the following options to solve this problem:
    1. go with the raw data file, skip trimming the data. Just use bowtie to do the alignment. (I tried this, it worked because the read number is same in read 1 and read 2 raw data file. I got around 70% matching rate. It is not that satisfactory)
    2. use some software to match the read 1 and read 2 file after trimming? Can anyone suggest any software to me?
    3. maybe there are some better methods I could use to do the alignment for this kind of paired-end data?

  • #2
    If you use trimmomatic to trim your paired-end data, it will give you
    separate files with the reads that end up unpaired after trimming.

    Comment


    • #3
      Originally posted by mastal View Post
      If you use trimmomatic to trim your paired-end data, it will give you
      separate files with the reads that end up unpaired after trimming.

      http://www.usadellab.org/cms/?page=trimmomatic
      So you mean I can use trimmomatic to trim my paired-end read 1 file and read 2 file. Then I will get two trimmed files but they are not paired so that I can run the bowtie alignment on each them?

      By the way, trimmomatic can also output paired files after trimming, are the number of reads in the paired output files the same?
      Thank you.
      Last edited by zhoujiayi; 09-16-2013, 04:50 AM.

      Comment


      • #4
        trimmomatic will give you 4 output files, in fastq format:
        R1_paired, R1_unpaired, R2_paired, and R2_unpaired.

        R1_paired and R2_paired will have the same number of reads,
        in the same order, just like the untrimmed Illumina data, except that
        the reads where R1 or R2 was removed by the trimming process will be removed from both files.

        Bowtie doesn't do mixtures of paired and unpaired reads, so you will
        have to run the R1_paired, R2_paired as one run, and the unpaired files as a separate run.

        Hope this makes sense.
        Maria

        Comment


        • #5
          Originally posted by mastal View Post
          trimmomatic will give you 4 output files, in fastq format:
          R1_paired, R1_unpaired, R2_paired, and R2_unpaired.

          R1_paired and R2_paired will have the same number of reads,
          in the same order, just like the untrimmed Illumina data, except that
          the reads where R1 or R2 was removed by the trimming process will be removed from both files.

          Bowtie doesn't do mixtures of paired and unpaired reads, so you will
          have to run the R1_paired, R2_paired as one run, and the unpaired files as a separate run.

          Hope this makes sense.
          Maria
          Thank you for your soonest reply.
          By the way, can I consider that it is better to use trimmed paired files to do the alignment when your raw data files are paired-end? Then what is the point to do the alignment for trimmed unpaired files while the raw data files are paired-end?

          Comment


          • #6
            It doesn't matter whether your data is single-end or paired-end, it is always better to do QC first, and then trim the reads if the QC indicates that you have low quality regions or adapter sequences.

            Comment


            • #7
              Originally posted by mastal View Post
              It doesn't matter whether your data is single-end or paired-end, it is always better to do QC first, and then trim the reads if the QC indicates that you have low quality regions or adapter sequences.
              Sorry for my poor English. I guess I didn't make my point clearly.
              I know it is always better to do QC first.
              For example:
              I have two fastq files (R1.fastq R2.fastq), which are paried-end data.
              After I use Trimmomatic to do the trimming, I can get R1_trimmed_paired.fastq,R1_trimmed_unpaired.fastq, R2_trimmed_paired.fastq,R2_trimmed_unpaired.fastq.
              Then,
              1. I can run bowtie with R1_trimmed_paired.fastq and R2_trimmed_paired.fastq as paired-end data to get the alignment file say R1R2.sam.
              2. Or I can run bowtie with R1_trimmed_unpaired.fastq or R2_trimmed_unpaired.fastq seperately to get two alignment files say R1.sam or R2.sam.

              As my understanding, it make sense for me to do the above step 1, because we are processing paired-end files. Then I am wondering why we can do step 2? Step 2 seems to process the paired-end files as single-end files, if we can do that, why don't we just treat all the files as single-end and process them?

              Comment


              • #8
                Originally posted by zhoujiayi View Post
                Sorry for my poor English. I guess I didn't make my point clearly.
                I know it is always better to do QC first.
                For example:
                I have two fastq files (R1.fastq R2.fastq), which are paried-end data.
                After I use Trimmomatic to do the trimming, I can get R1_trimmed_paired.fastq,R1_trimmed_unpaired.fastq, R2_trimmed_paired.fastq,R2_trimmed_unpaired.fastq.
                Then,
                1. I can run bowtie with R1_trimmed_paired.fastq and R2_trimmed_paired.fastq as paired-end data to get the alignment file say R1R2.sam.
                2. Or I can run bowtie with R1_trimmed_unpaired.fastq or R2_trimmed_unpaired.fastq seperately to get two alignment files say R1.sam or R2.sam.

                As my understanding, it make sense for me to do the above step 1, because we are processing paired-end files. Then I am wondering why we can do step 2? Step 2 seems to process the paired-end files as single-end files, if we can do that, why don't we just treat all the files as single-end and process them?
                Doing 1. makes total sense and does what you describe. Doing 2. may or may not be worthwhile (in my experience, at least, aligning an R2_trimmed_unpaired file is usually not worthwhile). The reads in the unpaired files are not the same as those in the paired file. In brief, if one read of a pair has terrible quality, is mostly adapter, or something else that results in it being trimmed to short for use, then its mate is written to the appropriate unpaired file. These, then are single-end reads, because their mates aren't useful for anything. In general, paired-end reads will give you a little more certain alignment (they can also more easily be used for determining structural variations and other things, if that's your goal).

                Comment


                • #9
                  Because there are advantages to using paired-end reads.

                  When you are doing alignment or assembly, it is easier to map the reads correctly if you know that R2 should map within so many bases from R1.

                  Comment


                  • #10
                    Memory Space issues and Unpaired Reads

                    Hello.

                    I finished trimming my data and also have paired end reads and unpaired ended reads.

                    I have limited space and want to delete the unpaired reads. In order to be sure I do not need the unpaired data, if I did a FastQC report on the trimmed paired data, will this suffice in letting me delete the unpaired data if I know that the paired reads that are trimmed have good quality?

                    thank you

                    Comment


                    • #11
                      I guess it depends how many of your reads are paired-end and how many are single-end after trimming.

                      I would also run FastQC on the single-end reads, to see how the quality compares with that of the trimmed paired-end reads. Then decide whether you want to delete them or not.

                      Comment


                      • #12
                        Deleting the Unpaired Reads

                        Hello. Thank you for your reply.

                        I may not have time to compare each paired trimmed and unpaired trim for each sample. I have too many.

                        So if my paired Trimmed data passes the FastQC, it would make sense to use only the paired end data. Comparing is not time efficient.


                        Especially if a lot of folks are writing

                        "Doing 1. makes total sense and does what you describe. Doing 2. may or may not be worthwhile (in my experience, at least, aligning an R2_trimmed_unpaired file is usually not worthwhile). "

                        Thank you.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin


                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                          Today, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        37 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        39 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        35 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        54 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X