Hi Guys,
I was following this tutorial and they provided a bunch of paired-end reads.
Forward and Reverse ones.
Automated Goal: Create a script that takes single read, paired-end and mate-pair ends and does a sanity check of a few values:
So I ran BWA and noticed the INED values all over the place. Went back and realized that the files neither have /1 /2 attached to them nor are they sorted and ordered. I'm assuming that's my first problem ?
Is that my correct assumption:
So to sort and append the /1 /2 files I was going to use something like this.
For my sampling I was gonna use this:
For the alignment, I could use BWA, Velvet/QUALMAP/HTSEQ
Ideally I can use python for this (as i'm more comfortable with it). Ok any thoughts and ideas on this from anybody ? Am I on the right track. Am I indeed correct about the non-sorted fastq files being the issues? have you done similar things with your data ?
I was gonna supply the first two reads as an example:
Forward Fastq:
@HWI-ST534_129:2:27:10054:113252:CGATGT
GCGGAGCCGGGTGACTGGCGAGCCGGAACATCAGGCGCCGCCGCAGAGAA
+
EEEECBEED>EEGEF<DA:A<CDDD?5DAAACA=C?D<@CGFEADEDDBE
@HWI-ST534_129:2:62:5677:145482:CGATGT
CGGAACATCAGGCGCCGCCGCAGAGAAGAACTATGGAGGAGCCCTCTGAG
+
HBGFHHHDFGHHGFHHGHFFFHHFHEFE<GEAAABHEDF@EEFFDFAEED
Reverse FastQ:
@HWI-ST534_129:2:24:20503:16510:CGATGT
CTGAGAGCCGGGGAAGCCGGCGGAGCCGGGGGACTGGCGAGCCGGAACAT
+
HHHHHHHHHHEFDDGDDFBFGG>7D4<9;<&?:;<DC>CCDD@?=?A###
@HWI-ST534_129:2:42:2118:9580:CGATGT
GGCGGAGCCGGGTGACTGGCGAGCCGGAACATCAGGCGCCGCCGCAGAGA
+
GEECGGGBGIDF6FFFFEF=IDEFBEE8E8E?EEB@6=9B##########
I was following this tutorial and they provided a bunch of paired-end reads.
Forward and Reverse ones.
Automated Goal: Create a script that takes single read, paired-end and mate-pair ends and does a sanity check of a few values:
- 1. Determines if these are indeed the types of files that are input
- 2. Checks the Paired-end/mate-paired files to see if they are matched/sorted annotated, if not cleans it up.
- 3. Determines the size of the insert and inner distance to compare with the supposed experimental bio-analyzer data
- 4. If all the above is ok, take a sampling of the large data sets, and do the above QC with BWA/Bowtie.
So I ran BWA and noticed the INED values all over the place. Went back and realized that the files neither have /1 /2 attached to them nor are they sorted and ordered. I'm assuming that's my first problem ?
Is that my correct assumption:
So to sort and append the /1 /2 files I was going to use something like this.
For my sampling I was gonna use this:
For the alignment, I could use BWA, Velvet/QUALMAP/HTSEQ
Ideally I can use python for this (as i'm more comfortable with it). Ok any thoughts and ideas on this from anybody ? Am I on the right track. Am I indeed correct about the non-sorted fastq files being the issues? have you done similar things with your data ?
I was gonna supply the first two reads as an example:
Forward Fastq:
@HWI-ST534_129:2:27:10054:113252:CGATGT
GCGGAGCCGGGTGACTGGCGAGCCGGAACATCAGGCGCCGCCGCAGAGAA
+
EEEECBEED>EEGEF<DA:A<CDDD?5DAAACA=C?D<@CGFEADEDDBE
@HWI-ST534_129:2:62:5677:145482:CGATGT
CGGAACATCAGGCGCCGCCGCAGAGAAGAACTATGGAGGAGCCCTCTGAG
+
HBGFHHHDFGHHGFHHGHFFFHHFHEFE<GEAAABHEDF@EEFFDFAEED
Reverse FastQ:
@HWI-ST534_129:2:24:20503:16510:CGATGT
CTGAGAGCCGGGGAAGCCGGCGGAGCCGGGGGACTGGCGAGCCGGAACAT
+
HHHHHHHHHHEFDDGDDFBFGG>7D4<9;<&?:;<DC>CCDD@?=?A###
@HWI-ST534_129:2:42:2118:9580:CGATGT
GGCGGAGCCGGGTGACTGGCGAGCCGGAACATCAGGCGCCGCCGCAGAGA
+
GEECGGGBGIDF6FFFFEF=IDEFBEE8E8E?EEB@6=9B##########