Hi,
just wanted to let you know that it was a matter of the awk installed on my system (Ubuntu).
I just added MAWK, works great (including retrieval of the cluster ID).
Additonally the BAM file I created based on the reformatted fastq file works in all downstream applications (GBrowse/IGV/MACS etc).!
Best
Maxim
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Hi Maxim, I'm glad you solved your issue...
It seems you have many other now :-)
Originally posted by Maxim View PostAwk still fails but I think I should be able to make it work.
Originally posted by Maxim View PostLooks like I have to dive a little more into awk.
Originally posted by Maxim View PostWhat I'm actually aiming for is a gene browser like representation of the coverage. So far I tried Gbrowse2, which did not show any data, perhaps due to the incorrectly formatted BAM files. Similarly IGV refuses to show any data, perhaps this is now getting better.
I've recommended to some people in our institution to use IGV for fast visualization of coverage data. Version 1.4 supports BAM files (actually I don't remember right now if you have to preprocess them, but I don't think so...).
In addition, you may want to output alignment results in BAM on the fly, like this:
Code:$ bowtie [options] -S <ref> <file.fq> | samtools view -ut <st-ref> - | samtools sort - outfile
Code:$ samtools faidx
Originally posted by Maxim View PostJust by chance: do you have a recommendation for an quick and easy way of making the Bowtie/Samtools output available for a common genome browser to look at the coverage (independent of ChIP-se like peak-building models depending on + and - reads like SISSRS/MACS etc)?
Best wishes
Maxim
As IGV, UCSC genome browser and GBrowse support BAM you will get a coverage representation just loading SAM/BAM files...
d
Leave a comment:
-
thanks again!
Awk still fails but I think I should be able to make it work. The goal is not only to parse it to a BED but to get the unique identifier from the run, too, in order to put it in the "name" field of the BED file.
Looks like I have to dive a little more into awk.
Despite of this you are right, for further processing the file is know suitable, MACS started smoothly, but it looks like it won't find any peaks, anyhow it will produce a .wig file!Good!
What I'm actually aiming for is a gene browser like representation of the coverage. So far I tried Gbrowse2, which did not show any data, perhaps due to the incorrectly formatted BAM files. Similarly IGV refuses to show any data, perhaps this is now getting better.
Just by chance: do you have a recommendation for an quick and easy way of making the Bowtie/Samtools output available for a common genome browser to look at the coverage (independent of ChIP-se like peak-building models depending on + and - reads like SISSRS/MACS etc)?
Best wishes
MaximLast edited by Maxim; 11-26-2009, 01:23 PM.
Leave a comment:
-
Originally posted by Maxim View PostHi,
I ran the sed command, but I had to do it like:
Code:sed -e 's/ /_/g' SRR027956_1.fastq > tmpfile
the resulting fastq file looks like:
@SRR027956.1_SL-XAU_2_FC30E0LAAXX:1:9:165:1796_length=76
GGTGTATTATTCTCNCAGAGTTGAACCCTTATTTTGATTGAGAAGTTTGGAAACAGTTTTTTTGTAGAATCTGCAA
+SRR027956.1_SL-XAU_2_FC30E0LAAXX:1:9:165:1796_length=76
<<<<<<<<<<<<<<!<<<;<<<<<<<<<<<34344---------------,,----,----------,,----,,,
Originally posted by Maxim View Post
I ran Bowtie and ran your awk command
Bowtie Output:
Code:@HD VN:1.0 SO:unsorted @SQ SN:chr1 LN:247249719 ..... @SQ SN:chrM LN:16571 @PG ID=Bowtie VN=0.11.3 CL="./bowtie -S -p 4 hg18 reads/tmpfile" SRR027956.3_SL-XAU_2_FC30E0LAAXX:1:9:38:1713_length=76 16 chr18 39746869 255 76M * 0 0 GCCATGGTGGATTCTTATTGAATGTGTGCTGAGTAACAGATGACTTGATTAATATAAAATGNATAATATTTTCCCC +,++,,,+,,,-&-+,,,,--,,-,-,,,,--,,-------43334<<<<<<<<<;<<<<<!<<<<<<<<<<<<<< XA:i:1 MD:Z:61T14 NM:i:1
Code:./samtools view -F 0x0004 SRR027956_1.sam | awk '{OFS="\t"; if (and($2, 16)) print $3,$4,$4+length($10),$1,$5,"-"; else print $3,$4,$4+length($10),$1,$5,"+" }' awk: line 2: function and never defined [bam_header_read] EOF marker is absent. [main_samview] fail to read the header.
gma
Code:$ ./samtools view -S -F 0x0004 SRR027956_1.sam | awk '{OFS="\t"; if (and($2, 16)) print $3,$4,$4+length($10),$1,$5,"-"; else print $3,$4,$4+length($10),$1,$5,"+" }'
Leave a comment:
-
Hi,
I ran the sed command, but I had to do it like:
Code:sed -e 's/ /_/g' SRR027956_1.fastq > tmpfile
the resulting fastq file looks like:
@SRR027956.1_SL-XAU_2_FC30E0LAAXX:1:9:165:1796_length=76
GGTGTATTATTCTCNCAGAGTTGAACCCTTATTTTGATTGAGAAGTTTGGAAACAGTTTTTTTGTAGAATCTGCAA
+SRR027956.1_SL-XAU_2_FC30E0LAAXX:1:9:165:1796_length=76
<<<<<<<<<<<<<<!<<<;<<<<<<<<<<<34344---------------,,----,----------,,----,,,
I ran Bowtie and ran your samtools/awk command as suggested.
Bowtie Output:
Code:@HD VN:1.0 SO:unsorted @SQ SN:chr1 LN:247249719 ..... @SQ SN:chrM LN:16571 @PG ID=Bowtie VN=0.11.3 CL="./bowtie -S -p 4 hg18 reads/tmpfile" SRR027956.3_SL-XAU_2_FC30E0LAAXX:1:9:38:1713_length=76 16 chr18 39746869 255 76M * 0 0 GCCATGGTGGATTCTTATTGAATGTGTGCTGAGTAACAGATGACTTGATTAATATAAAATGNATAATATTTTCCCC +,++,,,+,,,-&-+,,,,--,,-,-,,,,--,,-------43334<<<<<<<<<;<<<<<!<<<<<<<<<<<<<< XA:i:1 MD:Z:61T14 NM:i:1
But samtools fails:
Code:./samtools view -F 0x0004 SRR027956_1.sam | awk '{OFS="\t"; if (and($2, 16)) print $3,$4,$4+length($10),$1,$5,"-"; else print $3,$4,$4+length($10),$1,$5,"+" }' awk: line 2: function and never defined [bam_header_read] EOF marker is absent. [main_samview] fail to read the header.
gmaLast edited by Maxim; 11-26-2009, 12:51 PM.
Leave a comment:
-
Whoops!
Originally posted by dawe View PostCode:$ sed -e s/ /_/g reads/SRR027956_1.fastq | bowtie -S -p 4 -q -c hg18 - > SRR027956_1.sam
Code:$ sed -e s/\ /_/g
Code:$ perl -p -e "s/ /_/g"
d
Leave a comment:
-
Hi again,
Originally posted by Maxim View PostHi,
thank you for taking the time!!!
Originally posted by Maxim View PostI ran:
./bowtie hg18 -S reads/SRR027956_1.fastq -p 4 SRR027956_1.sam
Code:head -n 100 SRR027956_1.sam @HD VN:1.0 SO:unsorted @SQ SN:chr1 LN:247249719 @SQ SN:chr2 LN:242951149 @SQ SN:chr3 LN:199501827 @SQ SN:chr4 LN:191273063 ...... @PG ID=Bowtie VN=0.11.3 CL="./bowtie hg18 -S reads/SRR027956_1.fastq -p 4 SRR027956_1.sam" SRR027956.3 SL-XAU_2_FC30E0LAAXX:1:9:38:1713 length=76 16 chr18 39746869255 76M * 0 0 GCCATGGTGGATTCTTATTGAATGTGTGCTGAGTAACAGATGACTTGATTAATATAAAATGNATAATATTTTCCCC +,++,,,+,,,-&-+,,,,--,,-,-,,,,--,,-------43334<<<<<<<<<;<<<<<!<<<<<<<<<<<<<< XA:i:1 MD:Z:61T14 NM:i:1 SRR027956.1 SL-XAU_2_FC30E0LAAXX:1:9:165:1796 length=76 0 chr19 32642499255 76M * 0 0 GGTGTATTATTCTCNCAGAGTTGAACCCTTATTTTGATTGAGAAGTTTGGAAACAGTTTTTTTGTAGAATCTGCAA <<<<<<<<<<<<<<!<<<;<<<<<<<<<<<34344---------------,,----,----------,,----,,, XA:i:1 MD:Z:14A61 NM:i:1
Maxim
SRR027956.3 SL-XAU_2_FC30E0LAAXX:1:9:38:1713 length=76
is made by
SRR027956.3 SL-XAU_2_FC30E0LAAXX = instrument ID
1 = lane
9 = tile
38, 1713 = cluster coordinates
length=76 additional parameter added when the run has been started
?
I assume it's an illumina run... I believe you should remove spaces in your input. You can try this
Code:$ sed -e s/ /_/g reads/SRR027956_1.fastq | bowtie -S -p 4 -q -c hg18 - > SRR027956_1.sam
Also, bowtie outputs results to stdout so you should always redirect them, I believe it would be better pipe them into samtools to have a BAM and save some space.
If the code above doesn't work, convert spaces to a file and give it to bowtie:
Code:$ sed -e s/ /_/g reads/SRR027956_1.fastq > tmpfile $ bowtie -S -p 4 hg18 tmpfile > SRR027956_1.sam
d
Leave a comment:
-
Hi,
thank you for taking the time!!!
I ran:
./bowtie hg18 -S reads/SRR027956_1.fastq -p 4 SRR027956_1.sam
Code:head -n 100 SRR027956_1.sam @HD VN:1.0 SO:unsorted @SQ SN:chr1 LN:247249719 @SQ SN:chr2 LN:242951149 @SQ SN:chr3 LN:199501827 @SQ SN:chr4 LN:191273063 ...... @PG ID=Bowtie VN=0.11.3 CL="./bowtie hg18 -S reads/SRR027956_1.fastq -p 4 SRR027956_1.sam" SSRR027956.3 SL-XAU_2_FC30E0LAAXX:1:9:38:1713 length=76 16 chr18 39746869 255 76M * 0 0 GCCATGGTGGATTCTTATTGAATGTGTGCTGAGTAACAGATGACTTGATTAATATAAAATGNATAATATTTTCCCC +,++,,,+,,,-&-+,,,,--,,-,-,,,,--,,-------43334<<<<<<<<<;<<<<<!<<<<<<<<<<<<<< XA:i:1 MD:Z:61T14 NM:i:1 SRR027956.1 SL-XAU_2_FC30E0LAAXX:1:9:165:1796 length=76 0 chr19 32642499 255 76M * 0 0 GGTGTATTATTCTCNCAGAGTTGAACCCTTATTTTGATTGAGAAGTTTGGAAACAGTTTTTTTGTAGAATCTGCAA <<<<<<<<<<<<<<!<<<;<<<<<<<<<<<34344---------------,,----,----------,,----,,, XA:i:1 MD:Z:14A61 NM:i:1 SRR027956.2 SL-XAU_2_FC30E0LAAXX:1:9:27:1722 length=76 0 chr2 102083878 255 76M * 0 0 GCATGGTTAATTCTNTAGTGTAAAGAGACAAAGGGTAAACTAACAGAATGTGACACACTGGAAAGCATTTTGGAGA <:<<<<<<<<<<.;!<<<<<6<;<<;<<<<3334,--,--------,--++-$,----*-,*,,-+,,,,,,,,-, XA:i:1 MD:Z:14C19T17T23 NM:i:3 SRR027956.6 SL-XAU_2_FC30E0LAAXX:1:9:26:1619 length=76 16 chrX 100648521 255 76M * 0 0 TGCTTCCAAATTAATTTCAATTTCCTTTTAAAATTACTTTTAATTCCAGATTACTTACAATNCTCTTAACTATTTC ---,,,-,--,,--,,,---,,,--++,,----,,,--,,-44333;;<;99;<<<<<9<<!<<<;<<6<<<<<<< XA:i:1 MD:Z:61T14 NM:i:1 SRR027956.5 SL-XAU_2_FC30E0LAAXX:1:9:9:1703 length=76 16 chr11 15738810 255 76M * 0 0 TAGTATGCTCTTTTCTGAACACCCTAAGCTGCTCTTCTCAGAGGGGGAGAGGCTGGAGCTGNAGTTCCACAAAACC ,-,,,,,-,-,+,,-,----,---,-,--,--+-,,-,---43334<<<<<<<;<<<<<<<!<<<<<<<<<<<<<< XA:i:1 MD:Z:61G14 NM:i:1
Maxim
P.S.: this is basically one fastq file of a paired read. Is it ok to process the runs independently in Bowtie?
The problem with this dataset is that the mates will come from very different positions (or even chromosomes) as they are resuling from ligation reactions after chromosome conformation capture assay. As I do not know whether the standard aligners can handle this I thought to process the runs independently and to load them afterwards in my own database mate them again.Last edited by Maxim; 11-26-2009, 11:43 AM.
Leave a comment:
-
Hi Dawe,
thanks for the suggestions, unfortunately both do not work in my hands:
awk way gives such output (which could be reformatted):
Code:length=76 0 1 SRR027956.80319 chr1 + length=76 16 17 SRR027956.80320 chr2 + length=76 0 1 SRR027956.80321 chr4 + length=76 0 1 SRR027956.80322 chr9 + length=76 16 17 SRR027956.80323 chr1 + length=76 0 1 SRR027956.80324 chrX + length=76 0 1 SRR027956.80315 chr21 +
Code:macs -t Hi-C_HindIII_GM_1_1.sam --format=SAM --wig INFO @ Thu, 26 Nov 2009 17:22:42: # ARGUMENTS LIST: # name = NA # format = SAM # ChIP-seq file = Hi-C_HindIII_GM_1_1.sam # control file = None # effective genome size = 2.70e+09 # tag size = 25 # band width = 300 # model fold = 32 # pvalue cutoff = 1.00e-05 # Ranges for calculating regional lambda are : peak_region,1000,5000,10000 INFO @ Thu, 26 Nov 2009 17:22:42: #1 read tag files... INFO @ Thu, 26 Nov 2009 17:22:42: #1 read treatment tags... Traceback (most recent call last): File "/usr/local/bin/macs", line 273, in <module> main() File "/usr/local/bin/macs", line 57, in main (treat, control) = load_tag_files_options (options) File "/usr/local/bin/macs", line 252, in load_tag_files_options treat = options.build(open2(options.tfile, gzip_flag=options.gzip_flag)) File "/usr/local/lib/python2.6/site-packages/MACS/IO/__init__.py", line 1480, in build_fwtrack (chromosome,fpos,strand) = self.__fw_parse_line(thisline) File "/usr/local/lib/python2.6/site-packages/MACS/IO/__init__.py", line 1500, in __fw_parse_line bwflag = int(thisfields[1]) ValueError: invalid literal for int() with base 10: 'SL-XAU_2_FC30E0LAAXX:1:9:538:1739'
Maxim
Leave a comment:
-
Hi Maxim,
You have at least two options:
1- using MACS with SAM files (which are supported)
2- convert to BED using a simple bash pipeline:
Code:samtools view -F 0x0004 $YOURFILE | awk '{OFS="\t"; if (and($2, 16)) print $3,$4,$4+length($10),$1,$5,"-"; else print $3,$4,$4+length($10),$1,$5,"+" }'
Leave a comment:
-
Next-Gen data and it's processing to BED
Hi,
my Boss wants me to look at publicly available Solexa data and I'm not sure what to do with it in order to visualize the data in a genome browser. I would like to start with a graphical representation of the coverage to identify regions with many reads. In case the data looks suitable I'd like to run MACS in order to identify overrepresented regions in the genome (human hg18).
What I did so far:
1. I ran Bowtie and received a .sam file as output
this data looks like:
Code:@HD VN:1.0 SO:unsorted @SQ SN:chr1 LN:247249719 @SQ SN:chr2 LN:242951149 @SQ SN:chr3 LN:199501827 @SQ SN:chr4 LN:191273063 @SQ SN:chr5 LN:180857866 @SQ SN:chr6 LN:170899992 @SQ SN:chr7 LN:158821424 @SQ SN:chr8 LN:146274826 @SQ SN:chr9 LN:140273252
Code:SRR027956.6443341 SL-XAU_2_FC30E0LAAXX:1:100:1461:151 length=76 4 * 00 * * 0 0 CCATGATCAAGTGGGCTTCATCCCTGGGATGCAAGGCTGGTTCAACATACGAAAATCAAAGATCGGAAGAGCGGTT <<3<<<<<-<6<;06<<<<<<<<<<<<<<<44334,,,---,,,-+,,,,,,,,,,+,,,+,,*+(,,,,++*+&+ XM:i:0
2: I downloaded ConvertToBed utility from Vancouver short read archive and invoke it like
Code:java -jar conversion_util/ConvertToBed.jar -aligner sam -input Hi-C_HindIII_GM_1_1.sam -qualityfilter 10 -output /output
Code:Error: Coundn't create log file : /output/AlignReadsToBed.log
What might be wrong?
I'm on Ubuntu 9.1. I have no clue how to get this script running.
In parallel I tried the samToBed.py I ound on Sourceforge, no luck there, too:
Code:python samToBed.py -s Hi-C_HindIII_GM_1_1.sam Traceback (most recent call last): File "samToBed.py", line 138, in <module> sys.exit(main()) File "samToBed.py", line 135, in main processSAM(samFile, aType) File "samToBed.py", line 47, in processSAM makeBED(samLine, alignType) File "samToBed.py", line 53, in makeBED samFlag = int(samFields[1]) ValueError: invalid literal for int() with base 10: 'VN:1.0'
Suggestions are really appreciated!!
MaximTags: None
Latest Articles
Collapse
-
by seqadmin
The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...-
Channel: Articles
07-08-2024, 03:19 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 07-25-2024, 06:46 AM
|
0 responses
9 views
0 likes
|
Last Post
by seqadmin
07-25-2024, 06:46 AM
|
||
Started by seqadmin, 07-24-2024, 11:09 AM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
07-24-2024, 11:09 AM
|
||
Started by seqadmin, 07-19-2024, 07:20 AM
|
0 responses
161 views
0 likes
|
Last Post
by seqadmin
07-19-2024, 07:20 AM
|
||
Started by seqadmin, 07-16-2024, 05:49 AM
|
0 responses
127 views
0 likes
|
Last Post
by seqadmin
07-16-2024, 05:49 AM
|
Leave a comment: