Seqanswers Leaderboard Ad

**dawe** · 11-26-2009, 01:42 AM

Hi Maxim,
You have at least two options:
1- using MACS with SAM files (which are supported)
2- convert to BED using a simple bash pipeline:

Code:

samtools view -F 0x0004 $YOURFILE | awk '{OFS="\t"; if (and($2, 16)) print $3,$4,$4+length($10),$1,$5,"-"; else print $3,$4,$4+length($10),$1,$5,"+" }'

**Maxim** · 11-26-2009, 08:26 AM

Hi Dawe,

thanks for the suggestions, unfortunately both do not work in my hands:

awk way gives such output (which could be reformatted):

Code:

length=76	0	1	SRR027956.80319	chr1	+
length=76	16	17	SRR027956.80320	chr2	+
length=76	0	1	SRR027956.80321	chr4	+
length=76	0	1	SRR027956.80322	chr9	+
length=76	16	17	SRR027956.80323	chr1	+
length=76	0	1	SRR027956.80324	chrX	+
length=76	0	1	SRR027956.80315	chr21	+

MACS refuses to take the file with :

Code:

 macs -t Hi-C_HindIII_GM_1_1.sam --format=SAM --wig
INFO  @ Thu, 26 Nov 2009 17:22:42: 
# ARGUMENTS LIST:
# name = NA
# format = SAM
# ChIP-seq file = Hi-C_HindIII_GM_1_1.sam
# control file = None
# effective genome size = 2.70e+09
# tag size = 25
# band width = 300
# model fold = 32
# pvalue cutoff = 1.00e-05
# Ranges for calculating regional lambda are : peak_region,1000,5000,10000 
INFO  @ Thu, 26 Nov 2009 17:22:42: #1 read tag files... 
INFO  @ Thu, 26 Nov 2009 17:22:42: #1 read treatment tags... 
Traceback (most recent call last):
  File "/usr/local/bin/macs", line 273, in <module>
    main()
  File "/usr/local/bin/macs", line 57, in main
    (treat, control) = load_tag_files_options (options)
  File "/usr/local/bin/macs", line 252, in load_tag_files_options
    treat = options.build(open2(options.tfile, gzip_flag=options.gzip_flag))
  File "/usr/local/lib/python2.6/site-packages/MACS/IO/__init__.py", line 1480, in build_fwtrack
    (chromosome,fpos,strand) = self.__fw_parse_line(thisline)
  File "/usr/local/lib/python2.6/site-packages/MACS/IO/__init__.py", line 1500, in __fw_parse_line
    bwflag = int(thisfields[1])
ValueError: invalid literal for int() with base 10: 'SL-XAU_2_FC30E0LAAXX:1:9:538:1739'

I really believe my SAM-file is not correct. I'll restart the BOWTIE mapping when I'm on my computer.

Maxim

**dawe** · 11-26-2009, 08:33 AM

Originally posted by Maxim View Post

I really believe my SAM-file is not correct. I'll restart the BOWTIE mapping when I'm on my computer.

Maxim

I agree... can you post the first, say, 2-5 lines of your alingment results?

**Maxim** · 11-26-2009, 11:28 AM

Hi,

thank you for taking the time!!!

I ran:
./bowtie hg18 -S reads/SRR027956_1.fastq -p 4 SRR027956_1.sam

Code:

head -n 100 SRR027956_1.sam
@HD	VN:1.0	SO:unsorted
@SQ	SN:chr1	LN:247249719
@SQ	SN:chr2	LN:242951149
@SQ	SN:chr3	LN:199501827
@SQ	SN:chr4	LN:191273063
......
@PG	ID=Bowtie	VN=0.11.3	CL="./bowtie hg18 -S reads/SRR027956_1.fastq -p 4 SRR027956_1.sam"
SSRR027956.3 SL-XAU_2_FC30E0LAAXX:1:9:38:1713 length=76	16	chr18	39746869	255	76M	*	0	0	GCCATGGTGGATTCTTATTGAATGTGTGCTGAGTAACAGATGACTTGATTAATATAAAATGNATAATATTTTCCCC	+,++,,,+,,,-&-+,,,,--,,-,-,,,,--,,-------43334<<<<<<<<<;<<<<<!<<<<<<<<<<<<<<	XA:i:1	MD:Z:61T14	NM:i:1
SRR027956.1 SL-XAU_2_FC30E0LAAXX:1:9:165:1796 length=76	0	chr19	32642499	255	76M	*	0	0	GGTGTATTATTCTCNCAGAGTTGAACCCTTATTTTGATTGAGAAGTTTGGAAACAGTTTTTTTGTAGAATCTGCAA	<<<<<<<<<<<<<<!<<<;<<<<<<<<<<<34344---------------,,----,----------,,----,,,	XA:i:1	MD:Z:14A61	NM:i:1
SRR027956.2 SL-XAU_2_FC30E0LAAXX:1:9:27:1722 length=76	0	chr2	102083878	255	76M	*	0	0	GCATGGTTAATTCTNTAGTGTAAAGAGACAAAGGGTAAACTAACAGAATGTGACACACTGGAAAGCATTTTGGAGA	<:<<<<<<<<<<.;!<<<<<6<;<<;<<<<3334,--,--------,--++-$,----*-,*,,-+,,,,,,,,-,	XA:i:1	MD:Z:14C19T17T23	NM:i:3
SRR027956.6 SL-XAU_2_FC30E0LAAXX:1:9:26:1619 length=76	16	chrX	100648521	255	76M	*	0	0	TGCTTCCAAATTAATTTCAATTTCCTTTTAAAATTACTTTTAATTCCAGATTACTTACAATNCTCTTAACTATTTC	---,,,-,--,,--,,,---,,,--++,,----,,,--,,-44333;;<;99;<<<<<9<<!<<<;<<6<<<<<<<	XA:i:1	MD:Z:61T14	NM:i:1
SRR027956.5 SL-XAU_2_FC30E0LAAXX:1:9:9:1703 length=76	16	chr11	15738810	255	76M	*	0	0	TAGTATGCTCTTTTCTGAACACCCTAAGCTGCTCTTCTCAGAGGGGGAGAGGCTGGAGCTGNAGTTCCACAAAACC	,-,,,,,-,-,+,,-,----,---,-,--,--+-,,-,---43334<<<<<<<;<<<<<<<!<<<<<<<<<<<<<<	XA:i:1	MD:Z:61G14	NM:i:1

That is how it looks like.
Maxim

P.S.: this is basically one fastq file of a paired read. Is it ok to process the runs independently in Bowtie?
The problem with this dataset is that the mates will come from very different positions (or even chromosomes) as they are resuling from ligation reactions after chromosome conformation capture assay. As I do not know whether the standard aligners can handle this I thought to process the runs independently and to load them afterwards in my own database mate them again.

**dawe** · 11-26-2009, 11:51 AM

Hi again,

Originally posted by Maxim View Post

Hi,

thank you for taking the time!!!

:-)

Originally posted by Maxim View Post

I ran:
./bowtie hg18 -S reads/SRR027956_1.fastq -p 4 SRR027956_1.sam

Code:

head -n 100 SRR027956_1.sam
@HD	VN:1.0	SO:unsorted
@SQ	SN:chr1	LN:247249719
@SQ	SN:chr2	LN:242951149
@SQ	SN:chr3	LN:199501827
@SQ	SN:chr4	LN:191273063
......
@PG	ID=Bowtie	VN=0.11.3	CL="./bowtie hg18 -S reads/SRR027956_1.fastq -p 4 SRR027956_1.sam"
SRR027956.3 SL-XAU_2_FC30E0LAAXX:1:9:38:1713 length=76	16	chr18	39746869255	76M	*	0	0	GCCATGGTGGATTCTTATTGAATGTGTGCTGAGTAACAGATGACTTGATTAATATAAAATGNATAATATTTTCCCC	+,++,,,+,,,-&-+,,,,--,,-,-,,,,--,,-------43334<<<<<<<<<;<<<<<!<<<<<<<<<<<<<<	XA:i:1	MD:Z:61T14	NM:i:1
SRR027956.1 SL-XAU_2_FC30E0LAAXX:1:9:165:1796 length=76	0	chr19	32642499255	76M	*	0	0	GGTGTATTATTCTCNCAGAGTTGAACCCTTATTTTGATTGAGAAGTTTGGAAACAGTTTTTTTGTAGAATCTGCAA	<<<<<<<<<<<<<<!<<<;<<<<<<<<<<<34344---------------,,----,----------,,----,,,	XA:i:1	MD:Z:14A61	NM:i:1

That is how it looks like.
Maxim

At a first look I thought it could be an error in specifying options... But now I wonder read names in your fastq file contain spaces... Could it be that

SRR027956.3 SL-XAU_2_FC30E0LAAXX:1:9:38:1713 length=76

is made by
SRR027956.3 SL-XAU_2_FC30E0LAAXX = instrument ID
1 = lane
9 = tile
38, 1713 = cluster coordinates
length=76 additional parameter added when the run has been started
?
I assume it's an illumina run... I believe you should remove spaces in your input. You can try this

Code:

$ sed -e s/ /_/g reads/SRR027956_1.fastq | bowtie -S -p 4 -q -c hg18 - > SRR027956_1.sam

assuming that '-c' option can make bowtie read from a pipe (I had troubles with this just today o__O ).
Also, bowtie outputs results to stdout so you should always redirect them, I believe it would be better pipe them into samtools to have a BAM and save some space.
If the code above doesn't work, convert spaces to a file and give it to bowtie:

Code:

$ sed -e s/ /_/g reads/SRR027956_1.fastq > tmpfile
$ bowtie -S -p 4 hg18 tmpfile > SRR027956_1.sam

HTH

d

**dawe** · 11-26-2009, 11:56 AM

Whoops!

Originally posted by dawe View Post

Code:

$ sed -e s/ /_/g reads/SRR027956_1.fastq | bowtie -S -p 4 -q -c hg18 - > SRR027956_1.sam

You probably would escape spaces in sed expression:

Code:

$ sed -e s/\ /_/g

you can also use perl do do this (without escaping)

Code:

$ perl -p -e "s/ /_/g"

I've substituted spaces with underscore but you are free to choose whatever you like :-)
d

**Maxim** · 11-26-2009, 12:37 PM

Hi,

I ran the sed command, but I had to do it like:

Code:

sed -e 's/ /_/g' SRR027956_1.fastq > tmpfile

without the ticks it gave an error. But presumably the result differs from what you wanted, see below!

the resulting fastq file looks like:

@SRR027956.1_SL-XAU_2_FC30E0LAAXX:1:9:165:1796_length=76
GGTGTATTATTCTCNCAGAGTTGAACCCTTATTTTGATTGAGAAGTTTGGAAACAGTTTTTTTGTAGAATCTGCAA
+SRR027956.1_SL-XAU_2_FC30E0LAAXX:1:9:165:1796_length=76
<<<<<<<<<<<<<<!<<<;<<<<<<<<<<<34344---------------,,----,----------,,----,,,

I ran Bowtie and ran your samtools/awk command as suggested.

Bowtie Output:

Code:

@HD	VN:1.0	SO:unsorted
@SQ	SN:chr1	LN:247249719
.....
@SQ	SN:chrM	LN:16571
@PG	ID=Bowtie	VN=0.11.3	CL="./bowtie -S -p 4 hg18 reads/tmpfile"
SRR027956.3_SL-XAU_2_FC30E0LAAXX:1:9:38:1713_length=76	16	chr18	39746869	255	76M	*	0	0	GCCATGGTGGATTCTTATTGAATGTGTGCTGAGTAACAGATGACTTGATTAATATAAAATGNATAATATTTTCCCC	+,++,,,+,,,-&-+,,,,--,,-,-,,,,--,,-------43334<<<<<<<<<;<<<<<!<<<<<<<<<<<<<<	XA:i:1	MD:Z:61T14	NM:i:1

Looks fine, or?

But samtools fails:

Code:

./samtools view -F 0x0004 SRR027956_1.sam | awk '{OFS="\t"; if (and($2, 16)) print $3,$4,$4+length($10),$1,$5,"-"; else print $3,$4,$4+length($10),$1,$5,"+" }'
awk: line 2: function and never defined
[bam_header_read] EOF marker is absent.
[main_samview] fail to read the header.

Isn't the EOF "ctrl-d"?

gma

**dawe** · 11-26-2009, 01:00 PM

Originally posted by Maxim View Post

Hi,

I ran the sed command, but I had to do it like:

Code:

sed -e 's/ /_/g' SRR027956_1.fastq > tmpfile

without the ticks it gave an error. But presumably the result differs from what you wanted, see below!

the resulting fastq file looks like:

@SRR027956.1_SL-XAU_2_FC30E0LAAXX:1:9:165:1796_length=76
GGTGTATTATTCTCNCAGAGTTGAACCCTTATTTTGATTGAGAAGTTTGGAAACAGTTTTTTTGTAGAATCTGCAA
+SRR027956.1_SL-XAU_2_FC30E0LAAXX:1:9:165:1796_length=76
<<<<<<<<<<<<<<!<<<;<<<<<<<<<<<34344---------------,,----,----------,,----,,,

Nope, it is exactly what I meant. You see now the sequence IDs in a single word. When awk (or any other tool) tries to divide in column, the seq name is now a single value...

Originally posted by Maxim View Post

I ran Bowtie and ran your awk command

Bowtie Output:

Code:

@HD	VN:1.0	SO:unsorted
@SQ	SN:chr1	LN:247249719
.....
@SQ	SN:chrM	LN:16571
@PG	ID=Bowtie	VN=0.11.3	CL="./bowtie -S -p 4 hg18 reads/tmpfile"
SRR027956.3_SL-XAU_2_FC30E0LAAXX:1:9:38:1713_length=76	16	chr18	39746869	255	76M	*	0	0	GCCATGGTGGATTCTTATTGAATGTGTGCTGAGTAACAGATGACTTGATTAATATAAAATGNATAATATTTTCCCC	+,++,,,+,,,-&-+,,,,--,,-,-,,,,--,,-------43334<<<<<<<<<;<<<<<!<<<<<<<<<<<<<<	XA:i:1	MD:Z:61T14	NM:i:1

But samtools/awk fails:

Code:

./samtools view -F 0x0004 SRR027956_1.sam | awk '{OFS="\t"; if (and($2, 16)) print $3,$4,$4+length($10),$1,$5,"-"; else print $3,$4,$4+length($10),$1,$5,"+" }'
awk: line 2: function and never defined
[bam_header_read] EOF marker is absent.
[main_samview] fail to read the header.

Isn't the EOF "ctrl-d"?

gma

Sorry, I've usually put the samtools/awk after bam generation... You should add '-S' option to your samtools view command.

Code:

$ ./samtools view -S -F 0x0004 SRR027956_1.sam | awk '{OFS="\t"; if (and($2, 16)) print $3,$4,$4+length($10),$1,$5,"-"; else print $3,$4,$4+length($10),$1,$5,"+" }'

If the awk doesn't work... nevermind, you now should have SAM files to feed MACS with :-)

**Maxim** · 11-26-2009, 01:17 PM

thanks again!

Awk still fails but I think I should be able to make it work. The goal is not only to parse it to a BED but to get the unique identifier from the run, too, in order to put it in the "name" field of the BED file.

Looks like I have to dive a little more into awk.

Despite of this you are right, for further processing the file is know suitable, MACS started smoothly, but it looks like it won't find any peaks, anyhow it will produce a .wig file!Good!

What I'm actually aiming for is a gene browser like representation of the coverage. So far I tried Gbrowse2, which did not show any data, perhaps due to the incorrectly formatted BAM files. Similarly IGV refuses to show any data, perhaps this is now getting better.

Just by chance: do you have a recommendation for an quick and easy way of making the Bowtie/Samtools output available for a common genome browser to look at the coverage (independent of ChIP-se like peak-building models depending on + and - reads like SISSRS/MACS etc)?

Best wishes

Maxim

**dawe** · 11-26-2009, 01:51 PM

Hi Maxim, I'm glad you solved your issue...
It seems you have many other now :-)

Originally posted by Maxim View Post

Awk still fails but I think I should be able to make it work.

Strange... is it possible you are not using GNU/awk (i.e. you are running on FreeBSD or Solaris machine)?

Originally posted by Maxim View Post

Looks like I have to dive a little more into awk.

Definitely! awk and other command line utilities are really useful for processing this kind of data... you can perform many simple task by piping text streams to those commands.

Originally posted by Maxim View Post

What I'm actually aiming for is a gene browser like representation of the coverage. So far I tried Gbrowse2, which did not show any data, perhaps due to the incorrectly formatted BAM files. Similarly IGV refuses to show any data, perhaps this is now getting better.

We have a local mirror of UCSC genome browser for data visualization... Essentially for the high annotation level on genomes we are working with. we keep data in bigWig format.
I've recommended to some people in our institution to use IGV for fast visualization of coverage data. Version 1.4 supports BAM files (actually I don't remember right now if you have to preprocess them, but I don't think so...).
In addition, you may want to output alignment results in BAM on the fly, like this:

Code:

$ bowtie [options] -S <ref> <file.fq> | samtools view -ut <st-ref> - | samtools sort - outfile

where st-ref is the fasta reference file indexed with

Code:

$ samtools faidx

You'll save a lot of space!

Originally posted by Maxim View Post

Just by chance: do you have a recommendation for an quick and easy way of making the Bowtie/Samtools output available for a common genome browser to look at the coverage (independent of ChIP-se like peak-building models depending on + and - reads like SISSRS/MACS etc)?

Best wishes

Maxim

I've been trying to pipe "samtools pileup" output to wig2bdg command from MACS distribution (modified to read stdin). Unfortunately it is rather slow and resulting bedGraph (analogous to wig) is not "optimized" for size. It should be said that I usually generate coverage tracks as side product of other analysis, so I rely on MACS or FindPeaks output.
As IGV, UCSC genome browser and GBrowse support BAM you will get a coverage representation just loading SAM/BAM files...

d

**Maxim** · 11-27-2009, 11:00 AM

Hi,

just wanted to let you know that it was a matter of the awk installed on my system (Ubuntu).

I just added MAWK, works great (including retrieval of the cluster ID).
Additonally the BAM file I created based on the reformatted fastq file works in all downstream applications (GBrowse/IGV/MACS etc).!

Best

Maxim

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 23 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Next-Gen data and it's processing to BED

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News