Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Code to split paired end MiSeq data?

    Hello all,

    I have some paired end MiSeq data that I have cleaned/trimmed. However the software I used to clean everything up requires that the paired end data files be concatenated into one fastq with all reads and their respective mates, and now I need to split them back into two fastq's each containing one read from a paired end set (R1 and R2).

    My fastq file with all paired end reads follows a typical fastq format, example from "head" of original fastq file:

    @M00763:6:000000000-A1U80:1:1101:12620:1732 1:N:0:1
    TTATACTC
    +
    @A@AA@A@
    @M00763:6:000000000-A1U80:1:1101:12620:1732 2:N:0:1
    T
    +
    E

    Where the bolded "1" or "2" indicate which member of the pair the read is, so I am trying to get all 1's into a separate file and all 2's into a separate file. I have a bit of perl code that a lab mate passed along to me that is supposed to split the reads in this way, and it runs and creates two separate files, but it is not splitting the data correctly based on the 1 and 2 in each header and I get a mixture in each file the program creates. I was wondering if anyone had any idea how to fix this? The code is below.

    Code:
    use strict;
    use warnings;
    
    my $readdata = $ARGV[0];
    open(FILE, "<$readdata") || die "cannot open $readdata\n"; 
    open(OUT1, ">$readdata\_1") || die "cannot open $readdata\_1\n"; 
    open(OUT2, ">$readdata\_2") || die "cannot open $readdata\_2\n";
    
    while(<FILE>){
    	chomp;
    	print OUT1 "$_\/1\n";
    	print OUT2 "$_\/2\n";
    	
    	my $newline = <FILE>; chomp($newline); 
    	print OUT1 substr($newline, 0, length($newline)/2)."\n";
    	print OUT2 substr($newline, length($newline)/2, length($newline)/2)."\n";
    	
    	$newline = <FILE>; chomp($newline); 
    	print OUT1 "$newline\/1\/n";
    	print OUT2 "$newline\/2\/n";
    	
    	$newline = <FILE>; chomp($newline);
    	print OUT1 substr($newline, 0, length($newline)/2)."\n";
    	print OUT2 substr($newline, length($newline)/2, length($newline)/2)."\n";
    	
    }
    
    close(FILE)
    Thanks so much for your help,

    ~Ana

  • #2
    Originally posted by akjones View Post
    Where the bolded "1" or "2" indicate which member of the pair the read is, so I am trying to get all 1's into a separate file and all 2's into a separate file.
    Hi Ana,

    This is not perl but it should work, assuming you are on a Linux/Unix machine:

    Code:
    paste - - - - < test.fq \
    | tee >(awk 'BEGIN{FS="\t"; OFS="\n"} {if (match($1, " 1:N")) print $1,$2,$3,$4}' > test.r1.fq ) \
    | awk 'BEGIN{FS="\t"; OFS="\n"} {if (match($1, " 2:N")) print $1,$2,$3,$4}' > test.r2.fq
    test.fq is your merged input file. test.r1.fq and test.r2.fq are the split files. I think the issue is to set the matching pattern to be specific enough to correctly separate read 1 from read 2 given the read name. Here I set the patterns to " 1:N" and " 2:N" for read 1 and 2 respectively.

    Loosely related to this question, take care that most programs expect the paired fastq files to have read 1 and read 2 in the same order (e.g. most aligners) and your pipeline above seems to break this requirement.

    All the best
    Dario

    Comment


    • #3
      Hi Dario,

      Thanks very much, your code works! What do you mean by
      Loosely related to this question, take care that most programs expect the paired fastq files to have read 1 and read 2 in the same order (e.g. most aligners) and your pipeline above seems to break this requirement.
      are you referring to the direction of the reads (5'-3' or 3'-5') or the order of the reads themselves, as in read 1 followed by read 2 of a pair? Sorry if that is a silly question, I just want to make sure I am understanding you correctly.

      Thanks again,
      ~Ana

      Comment


      • #4
        Originally posted by akjones View Post
        Hi Dario,

        Thanks very much, your code works! What do you mean by are you referring to the direction of the reads (5'-3' or 3'-5') or the order of the reads themselves, as in read 1 followed by read 2 of a pair? Sorry if that is a silly question, I just want to make sure I am understanding you correctly.

        Thanks again,
        ~Ana
        Hi- The fastq files mate 1 should have the same number and order of reads as in the fastq for mate 2. So, if a read is missing from one file, its mate should be removed from the other file as well to keep the correct pairing.

        E.g. Say this is your file for mate 1:
        read_1.fq:
        Code:
        @read1 1:N
        ACTG
        +
        IIII
        @read2 1:N
        ACTG
        +
        IIII
        @read3 1:N
        ACTG
        +
        IIII
        This file for mate 2 would be ok:
        Code:
        @read1 2:N
        AAAA
        +
        IIII
        @read2 2:N
        TTTT
        +
        IIII
        @read3 2:N
        CCCC
        +
        IIII
        This file for mate 2 would be "wrong" as one read is missing:
        Code:
        @read1 2:N
        ACTG
        +
        IIII
        @read3 2:N
        ACTG
        +
        IIII
        Last edited by dariober; 12-04-2013, 09:18 AM. Reason: mate corrected

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Choosing Between NGS and qPCR
          by seqadmin



          Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
          10-18-2024, 07:11 AM
        • seqadmin
          Non-Coding RNA Research and Technologies
          by seqadmin




          Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

          Nobel Prize for MicroRNA Discovery
          This week,...
          10-07-2024, 08:07 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 11-01-2024, 06:09 AM
        0 responses
        18 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-30-2024, 05:31 AM
        0 responses
        18 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-24-2024, 06:58 AM
        0 responses
        24 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-23-2024, 08:43 AM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Working...
        X