Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Questions for barcode file splitting, forward/reverse data sorting

    HI, i am a pretty new for sequence analysis and totally new comer here. My questions might be too basic for you to answer but it will help me start my sequence analysis work with a good beginning. Anyone can help me? Very appreciate!!!

    All I have are three fastq format separate raw data (each has more than 40 million read sequence) which are forward read data1, barcode read data2 and reverse read data 3. All of them three are corresponding each other from beginning with the same order. I used extra barcode file (6 different barcode) to split data 2 into 6 groups of different files (by galaxy barcode splitter). Now, I was stuck here and can't keep going until I figure out the following questions,

    1. How can I sort the forward data1 and reverse data2 using my 6 files generated by barcode splitter. Is there software to do this? By the way, I do not have much bioinformatics background, any good suggestion?

    2. How do I know where is the adapt sequences or if there are adapt sequences in the forward/reverse sequence from data 1 and 3 because this is very helpful for me to do adapt trim from original sequence?

    following is just one example I extracted from my original data . All I have are only following 3 data with a separate barcode file. I do not have extra information like how is the barcode been designed, library construction or other..



    Data 1 (forward read)

    @IPAR1:2:1:4029:1196:1#0/1 ATTTTGCCACATACAAAAGAATCTACGTTCTTCTCAGCACCTCATGGAATCTTCTCTAAAATATATCATATAATAGGACACAAAAGAA
    + BHGHHHHHHHHGDDFHHHGGDGHFHFHHHHGD>GEEG>GFHHHHFHBBHFHHHHEHHHHHHBAFHHBBEHHHFEHGBECEHFHHFAHF


    Data 2 (barcode read; TGACCTTG is the barcode tag and do not know yet what is ATCTCGT after tag)

    @IPAR1:2:1:4029:1196:1#0/2
    TGACCTTGATCTCGT
    +
    HIHIIGIIIH8CCDC


    Data 3 (reverse read)

    @IPAR1:2:1:4029:1196:1#0/3 GATATAATGGATGGGATTATTTCAATCTTTTATCTATTGAGGCTTCTTTTGTGTCCTATTATATGATATATTTTAGAGAAGATTCCAT
    + IIHIIIIHIIDEGGGEBG>GIIFIHHIHIIIIIFIDE4G@GG<GGEGBGG?AACCIIBIIBDIIIFDII>IIIIDIH@DFIGBI@IEE

  • #2
    Originally posted by rzeng View Post
    I used extra barcode file (6 different barcode) to split data 2 into 6 groups of different files (by galaxy barcode splitter). Now, I was stuck here and can't keep going until I figure out the following questions,

    1. How can I sort the forward data1 and reverse data2 using my 6 files generated by barcode splitter. Is there software to do this? By the way, I do not have much bioinformatics background, any good suggestion?

    2. How do I know where is the adapt sequences or if there are adapt sequences in the forward/reverse sequence from data 1 and 3 because this is very helpful for me to do adapt trim from original sequence?
    If you managed to get the forward and reverse reads into separate files for each sample then you have made good progress. At this stage you probably want to do some QC on the files.

    Here is a link for some practical info to get you started: http://en.wikibooks.org/wiki/Next_Ge...Pre-processing

    As for some of other questions use the "search" functionality on this site along with clever combinations of key words. You will find many past threads that have the answers you need (and also additional info you may not have thought about to ask).

    Comment


    • #3
      Reply GenoMax

      GenoMax,

      Thanks much for your answer. However, my problem now is how to manage and separate the forward/reverse reads by using my separate barcode files (generated by data 2), considering more than 40millions read sequences in both forward/reverse reads.






      Originally posted by GenoMax View Post
      If you managed to get the forward and reverse reads into separate files for each sample then you have made good progress. At this stage you probably want to do some QC on the files.

      Here is a link for some practical info to get you started: http://en.wikibooks.org/wiki/Next_Ge...Pre-processing

      As for some of other questions use the "search" functionality on this site along with clever combinations of key words. You will find many past threads that have the answers you need (and also additional info you may not have thought about to ask).

      Comment


      • #4
        Can you provide some additional information as to where (and what format) did you get the original data? What did the names of the files look like? Generally a provider will de-multiplex the samples for you (using the illumina pipeline software). It is much simpler to do it that way.

        You may be able to use the script from the Qiime package: http://qiime.org/scripts/split_libraries_fastq.html to do the demultiplexing as suggested in this thread: http://seqanswers.com/forums/showthread.php?t=24215

        Comment


        • #5
          Thank you GenoMax,

          Pretty sad is, I took over someone's project without know much about the background/information of the original data (I can not get contact with that guy who prepare these data even).

          These Fastq data are consisted of three data (forward sequence data1, barcode sequence data2 and reverse sequence data3) with each data contains more than 40,000,000 different sequences . the name/ID for these sequences in three data are very similar except the XXXX as showed as belows

          Name/ID of sequence data

          @IPAR1:2:1:XXXX:XXXX:1#0/1 forward read data 1
          @IPAR1:2:1:XXXX:XXXX:1#0/2 barcode read data 2
          @IPAR1:2:1:XXXX:XXXX:1#0/3 forward read data 3

          All the sequences in each data are organized by the same order

          For example,

          the 18th sequence in data 1 is
          @IPAR1:2:1:4029:1196:1#0/1 ATTTTGCCACATACAAAAGAATCTACGTTCTTCTCAGCACCTCATGGAATCTTCTCTAAAATATATCATATAATAGGACACAAAAGAA
          + BHGHHHHHHHHGDDFHHHGGDGHFHFHHHHGD>GEEG>GFHHHHFHBBHFHHHHEHHHHHHBAFHHBBEHHHFEHGBECEHFHHFAHF

          the 18th sequence (15bp) in the second data 2 is

          @IPAR1:2:1:4029:1196:1#0/2
          TGACCTTGATCTCGT
          +
          HIHIIGIIIH8CCDC

          the 18th sequence in the third data 3 is
          @IPAR1:2:1:4029:1196:1#0/3 GATATAATGGATGGGATTATTTCAATCTTTTATCTATTGAGGCTTCTTTTGTGTCCTATTATATGATATATTTTAGAGAAGATTCCAT
          + IIHIIIIHIIDEGGGEBG>GIIFIHHIHIIIIIFIDE4G@GG<GGEGBGG?AACCIIBIIBDIIIFDII>IIIIDIH@DFIGBI@IEE

          However, because the barcode sequence (15bp) in data 2 is not in the sequence in data 1 and 3. Barcode sequence in the data 2 I can not sort the sequence in data 1 and 3 by using from data 2 directly. However, I have an extra barcode information for splitting 400,000,000 barcode sequence in data 2. this barcode information is 6 different 8mer barcode sequence which is overlap with sequence in data 2. For example, TGACCTTG is overlap with sequence in

          @IPAR1:2:1:4029:1196:1#0/2.
          TGACCTTGATCTCGT
          +
          HIHIIGIIIH8CCDC


          So, I used this barcode information to split data 2 into 6 different file (each represent one sample). At this point, I need to go back forward/reverse sequence of data 1 and 3 and split them into 6 difference files too.

          PS. Sorry for the complicated explain above, but my case is really different from other cases of illumina reads data I can find anywhere.

          Any suggestion will be very appreciate!

          Comment


          • #6
            @IPAR1:2:1:4029:1196:1#0/1
            This is the important part from three files that you need to be looking at. If you see the description for the fastq format (illumina sequence identifiers) that string uniquely identifies a cluster. The /1,/2,/3 on the end signify that these are R1 = forward read, R2 = Tag read and R3= Reverse read (as you have already figured out).

            So for the following tag read:

            @IPAR1:2:1:4029:1196:1#0/2
            TGACCTTGATCTCGT
            +
            HIHIIGIIIH8CCDC

            The two corresponding real reads are in /1 and /3 parts. In illumina pipeline the tag read is automatically taken into consideration and then added to the ID lines of the R1 and R2 (reverse read takes the R2 designation) like so

            @HWUSI-EAS100R:6:73:941:1973#NNNNN/1 (NNN= Tag)
            When you split the files (either with your own script or from qiime) make sure that you add the tag sequence to the ID otherwise it may be difficult to keep track of it later on.

            You should also format the files so they are in the correct fastq format

            @ID
            Sequence goes on this line
            +
            Quality values for corresponding bases on this line

            Comment


            • #7
              Thanks GenoMax, That helps a lot!

              So my data ID lines do NOT have tags on them, is that mean my data has not been processed by the Illumina pipeline?

              Can I ask Illumina company to re-add tags on data ID lines by using Illumina pipeline or I can do it by downloading Illumina pipeline? This makes me confused because the tags are supposed to be added ALREADY when I got the raw data for R1 and R2 according to my understanding, right?

              Comment


              • #8
                Originally posted by rzeng View Post
                Thanks GenoMax, That helps a lot!

                So my data ID lines do NOT have tags on them, is that mean my data has not been processed by the Illumina pipeline?

                Can I ask Illumina company to re-add tags on data ID lines by using Illumina pipeline or I can do it by downloading Illumina pipeline? This makes me confused because the tags are supposed to be added ALREADY when I got the raw data for R1 and R2 according to my understanding, right?
                Your files have been processed by the illumina pipeline but the samples have not been demultiplexed. If the samples were de-multiplexed by your sequence provider then they would have given you just two files (R1 and R2 reads). You seem have three files with the tag in a separate file.

                If you are able to ask the provider to demultiplex the samples that would be the best solution but since this data is old it may not be feasible at this time.

                Comment


                • #9
                  GenoMax,

                  I have splitted 400,000,000 tag reads and grouped them into 6 separate files using OUTER 6 different barcode sequences. I want to confirm with you that the next step is to use my own script or from Qiime to split files in R1 and R2 using EACH of 6 separate files, right? since R1 and R2 do not have the barcode tags (or barcode sequence) but similar ID headline (highlighted as RED as follows).

                  R1 file
                  @IPAR1:2:1:4029:1196:1#0/1

                  Splitted barcode file
                  @IPAR1:2:1:4029:1196:1#0/2

                  R2 file
                  @IPAR1:2:1:4029:1196:1#0/3

                  Can script of Qiime help me to split R1 and R2 just by using these RED hightlight headline?
                  Last edited by rzeng; 08-21-2013, 12:37 PM.

                  Comment


                  • #10
                    It sounds like even though your files were not demultiplexed they were somehow sorted on the tags so that all the corresponding R1 and R3 (based on R2) were in the same order in the original files. If you have managed to separate the samples into 6 files are you able to write a script that will add the "tag" to the ID's of the separated files so that you end up with something that looks like below (see changes marked in red):

                    R1 file
                    @IPAR1:2:1:4029:1196:1#NNNNN/1

                    R2 file
                    @IPAR1:2:1:4029:1196:1#NNNNN/2

                    If you are not able to do this yourself then the better option is below

                    The script included in "qiime" package will take as input the R1 file along with the R2 (tag file) and then will produce separate files for each of your samples (sounds like you have 6). You will then repeat the process with R3 file along with the R2 (tag) file to produce the corresponding files that will contain the paired-end reads.

                    You can run the qiime script as follows (Disclaimer: I have not used qiime script myself but based on the info provided on the help page I expect the script to work as noted below).

                    Code:
                    $ split_libraries_fastq.py -i /path_to/Read1.fastq -b /path_to/Read2.fastq --store_demultiplexed_fastq --barcode-type 8 --sample-id replace_with_sample_name -o output_dir_name
                    Repeat for Paired-read
                    Code:
                    $ split_libraries_fastq.py -i /path_to/Read3.fastq -b /path_to/Read2.fastq --store_demultiplexed_fastq --barcode-type 8 --sample-id replace_with_sample_name -o output_dir_name
                    Last edited by GenoMax; 08-21-2013, 01:53 PM.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Exploring the Dynamics of the Tumor Microenvironment
                      by seqadmin




                      The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                      07-08-2024, 03:19 PM
                    • seqadmin
                      Exploring Human Diversity Through Large-Scale Omics
                      by seqadmin


                      In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                      06-25-2024, 06:43 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:53 AM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-10-2024, 07:30 AM
                    0 responses
                    34 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-03-2024, 09:45 AM
                    0 responses
                    204 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-03-2024, 08:54 AM
                    0 responses
                    213 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X