Seqanswers Leaderboard Ad

**GenoMax** · 08-19-2013, 11:53 AM

Originally posted by rzeng View Post

I used extra barcode file (6 different barcode) to split data 2 into 6 groups of different files (by galaxy barcode splitter). Now, I was stuck here and can't keep going until I figure out the following questions,

1. How can I sort the forward data1 and reverse data2 using my 6 files generated by barcode splitter. Is there software to do this? By the way, I do not have much bioinformatics background, any good suggestion?

2. How do I know where is the adapt sequences or if there are adapt sequences in the forward/reverse sequence from data 1 and 3 because this is very helpful for me to do adapt trim from original sequence?

If you managed to get the forward and reverse reads into separate files for each sample then you have made good progress. At this stage you probably want to do some QC on the files.

Here is a link for some practical info to get you started: http://en.wikibooks.org/wiki/Next_Ge...Pre-processing

As for some of other questions use the "search" functionality on this site along with clever combinations of key words. You will find many past threads that have the answers you need (and also additional info you may not have thought about to ask).

**rzeng** · 08-19-2013, 12:35 PM

Reply GenoMax

GenoMax,

Thanks much for your answer. However, my problem now is how to manage and separate the forward/reverse reads by using my separate barcode files (generated by data 2), considering more than 40millions read sequences in both forward/reverse reads.

Originally posted by GenoMax View Post

If you managed to get the forward and reverse reads into separate files for each sample then you have made good progress. At this stage you probably want to do some QC on the files.

Here is a link for some practical info to get you started: http://en.wikibooks.org/wiki/Next_Ge...Pre-processing

As for some of other questions use the "search" functionality on this site along with clever combinations of key words. You will find many past threads that have the answers you need (and also additional info you may not have thought about to ask).

**GenoMax** · 08-19-2013, 05:33 PM

Can you provide some additional information as to where (and what format) did you get the original data? What did the names of the files look like? Generally a provider will de-multiplex the samples for you (using the illumina pipeline software). It is much simpler to do it that way.

You may be able to use the script from the Qiime package: http://qiime.org/scripts/split_libraries_fastq.html to do the demultiplexing as suggested in this thread: http://seqanswers.com/forums/showthread.php?t=24215

**rzeng** · 08-20-2013, 09:42 AM

Thank you GenoMax,

Pretty sad is, I took over someone's project without know much about the background/information of the original data (I can not get contact with that guy who prepare these data even).

These Fastq data are consisted of three data (forward sequence data1, barcode sequence data2 and reverse sequence data3) with each data contains more than 40,000,000 different sequences . the name/ID for these sequences in three data are very similar except the XXXX as showed as belows

Name/ID of sequence data

@IPAR1:2:1:XXXX:XXXX:1#0/1 forward read data 1
@IPAR1:2:1:XXXX:XXXX:1#0/2 barcode read data 2
@IPAR1:2:1:XXXX:XXXX:1#0/3 forward read data 3

All the sequences in each data are organized by the same order

For example,

the 18th sequence in data 1 is
@IPAR1:2:1:4029:1196:1#0/1 ATTTTGCCACATACAAAAGAATCTACGTTCTTCTCAGCACCTCATGGAATCTTCTCTAAAATATATCATATAATAGGACACAAAAGAA
+ BHGHHHHHHHHGDDFHHHGGDGHFHFHHHHGD>GEEG>GFHHHHFHBBHFHHHHEHHHHHHBAFHHBBEHHHFEHGBECEHFHHFAHF

the 18th sequence (15bp) in the second data 2 is

@IPAR1:2:1:4029:1196:1#0/2
TGACCTTGATCTCGT
+
HIHIIGIIIH8CCDC

the 18th sequence in the third data 3 is
@IPAR1:2:1:4029:1196:1#0/3 GATATAATGGATGGGATTATTTCAATCTTTTATCTATTGAGGCTTCTTTTGTGTCCTATTATATGATATATTTTAGAGAAGATTCCAT
+ IIHIIIIHIIDEGGGEBG>GIIFIHHIHIIIIIFIDE4G@GG<GGEGBGG?AACCIIBIIBDIIIFDII>IIIIDIH@DFIGBI@IEE

However, because the barcode sequence (15bp) in data 2 is not in the sequence in data 1 and 3. Barcode sequence in the data 2 I can not sort the sequence in data 1 and 3 by using from data 2 directly. However, I have an extra barcode information for splitting 400,000,000 barcode sequence in data 2. this barcode information is 6 different 8mer barcode sequence which is overlap with sequence in data 2. For example, TGACCTTG is overlap with sequence in

@IPAR1:2:1:4029:1196:1#0/2.
TGACCTTGATCTCGT
+
HIHIIGIIIH8CCDC

So, I used this barcode information to split data 2 into 6 different file (each represent one sample). At this point, I need to go back forward/reverse sequence of data 1 and 3 and split them into 6 difference files too.

PS. Sorry for the complicated explain above, but my case is really different from other cases of illumina reads data I can find anywhere.

Any suggestion will be very appreciate!

**GenoMax** · 08-20-2013, 02:02 PM

@IPAR1:2:1:4029:1196:1#0/1

This is the important part from three files that you need to be looking at. If you see the description for the fastq format (illumina sequence identifiers) that string uniquely identifies a cluster. The /1,/2,/3 on the end signify that these are R1 = forward read, R2 = Tag read and R3= Reverse read (as you have already figured out).

So for the following tag read:

@IPAR1:2:1:4029:1196:1#0/2
TGACCTTGATCTCGT
+
HIHIIGIIIH8CCDC

The two corresponding real reads are in /1 and /3 parts. In illumina pipeline the tag read is automatically taken into consideration and then added to the ID lines of the R1 and R2 (reverse read takes the R2 designation) like so

@HWUSI-EAS100R:6:73:941:1973#NNNNN/1 (NNN= Tag)

When you split the files (either with your own script or from qiime) make sure that you add the tag sequence to the ID otherwise it may be difficult to keep track of it later on.

You should also format the files so they are in the correct fastq format

@ID
Sequence goes on this line
+
Quality values for corresponding bases on this line

**rzeng** · 08-20-2013, 02:41 PM

Thanks GenoMax, That helps a lot!

So my data ID lines do NOT have tags on them, is that mean my data has not been processed by the Illumina pipeline?

Can I ask Illumina company to re-add tags on data ID lines by using Illumina pipeline or I can do it by downloading Illumina pipeline? This makes me confused because the tags are supposed to be added ALREADY when I got the raw data for R1 and R2 according to my understanding, right?

**GenoMax** · 08-20-2013, 03:19 PM

Originally posted by rzeng View Post

Thanks GenoMax, That helps a lot!

So my data ID lines do NOT have tags on them, is that mean my data has not been processed by the Illumina pipeline?

Can I ask Illumina company to re-add tags on data ID lines by using Illumina pipeline or I can do it by downloading Illumina pipeline? This makes me confused because the tags are supposed to be added ALREADY when I got the raw data for R1 and R2 according to my understanding, right?

Your files have been processed by the illumina pipeline but the samples have not been demultiplexed. If the samples were de-multiplexed by your sequence provider then they would have given you just two files (R1 and R2 reads). You seem have three files with the tag in a separate file.

If you are able to ask the provider to demultiplex the samples that would be the best solution but since this data is old it may not be feasible at this time.

**rzeng** · 08-21-2013, 12:33 PM

GenoMax,

I have splitted 400,000,000 tag reads and grouped them into 6 separate files using OUTER 6 different barcode sequences. I want to confirm with you that the next step is to use my own script or from Qiime to split files in R1 and R2 using EACH of 6 separate files, right? since R1 and R2 do not have the barcode tags (or barcode sequence) but similar ID headline (highlighted as RED as follows).

R1 file
@IPAR1:2:1:4029:1196:1#0/1

Splitted barcode file
@IPAR1:2:1:4029:1196:1#0/2

R2 file
@IPAR1:2:1:4029:1196:1#0/3

Can script of Qiime help me to split R1 and R2 just by using these RED hightlight headline?

**GenoMax** · 08-21-2013, 01:15 PM

It sounds like even though your files were not demultiplexed they were somehow sorted on the tags so that all the corresponding R1 and R3 (based on R2) were in the same order in the original files. If you have managed to separate the samples into 6 files are you able to write a script that will add the "tag" to the ID's of the separated files so that you end up with something that looks like below (see changes marked in red):

R1 file
@IPAR1:2:1:4029:1196:1#NNNNN/1

R2 file
@IPAR1:2:1:4029:1196:1#NNNNN/2

If you are not able to do this yourself then the better option is below

The script included in "qiime" package will take as input the R1 file along with the R2 (tag file) and then will produce separate files for each of your samples (sounds like you have 6). You will then repeat the process with R3 file along with the R2 (tag) file to produce the corresponding files that will contain the paired-end reads.

You can run the qiime script as follows (Disclaimer: I have not used qiime script myself but based on the info provided on the help page I expect the script to work as noted below).

Code:

$ split_libraries_fastq.py -i /path_to/Read1.fastq -b /path_to/Read2.fastq --store_demultiplexed_fastq --barcode-type 8 --sample-id replace_with_sample_name -o output_dir_name

Repeat for Paired-read

Code:

$ split_libraries_fastq.py -i /path_to/Read3.fastq -b /path_to/Read2.fastq --store_demultiplexed_fastq --barcode-type 8 --sample-id replace_with_sample_name -o output_dir_name

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Questions for barcode file splitting, forward/reverse data sorting

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News