Header Leaderboard Ad

Collapse

Next-Gen data and it's processing to BED

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Next-Gen data and it's processing to BED

    Hi,

    my Boss wants me to look at publicly available Solexa data and I'm not sure what to do with it in order to visualize the data in a genome browser. I would like to start with a graphical representation of the coverage to identify regions with many reads. In case the data looks suitable I'd like to run MACS in order to identify overrepresented regions in the genome (human hg18).

    What I did so far:

    1. I ran Bowtie and received a .sam file as output
    this data looks like:
    Code:
    @HD	VN:1.0	SO:unsorted
    @SQ	SN:chr1	LN:247249719
    @SQ	SN:chr2	LN:242951149
    @SQ	SN:chr3	LN:199501827
    @SQ	SN:chr4	LN:191273063
    @SQ	SN:chr5	LN:180857866
    @SQ	SN:chr6	LN:170899992
    @SQ	SN:chr7	LN:158821424
    @SQ	SN:chr8	LN:146274826
    @SQ	SN:chr9	LN:140273252
    and later:
    Code:
    SRR027956.6443341 SL-XAU_2_FC30E0LAAXX:1:100:1461:151 length=76	4	*	00	*	*	0	0	CCATGATCAAGTGGGCTTCATCCCTGGGATGCAAGGCTGGTTCAACATACGAAAATCAAAGATCGGAAGAGCGGTT	<<3<<<<<-<6<;06<<<<<<<<<<<<<<<44334,,,---,,,-+,,,,,,,,,,+,,,+,,*+(,,,,++*+&+	XM:i:0
    This looks like correct sam-format to me.

    2: I downloaded ConvertToBed utility from Vancouver short read archive and invoke it like
    Code:
    java -jar conversion_util/ConvertToBed.jar -aligner sam -input Hi-C_HindIII_GM_1_1.sam -qualityfilter 10 -output /output
    but get an error saying:

    Code:
    Error: Coundn't create log file : /output/AlignReadsToBed.log
    But that logfile is there, so it has been created.

    What might be wrong?

    I'm on Ubuntu 9.1. I have no clue how to get this script running.
    In parallel I tried the samToBed.py I ound on Sourceforge, no luck there, too:
    Code:
    python samToBed.py -s Hi-C_HindIII_GM_1_1.sam
    Traceback (most recent call last):
      File "samToBed.py", line 138, in <module>
        sys.exit(main())
      File "samToBed.py", line 135, in main
        processSAM(samFile, aType)
      File "samToBed.py", line 47, in processSAM
        makeBED(samLine, alignType)
      File "samToBed.py", line 53, in makeBED
        samFlag = int(samFields[1])
    ValueError: invalid literal for int() with base 10: 'VN:1.0'
    Might it be related to an improperly formated sam-file that I got from Bowtie?
    Suggestions are really appreciated!!
    Maxim

  • #2
    Hi Maxim,
    You have at least two options:
    1- using MACS with SAM files (which are supported)
    2- convert to BED using a simple bash pipeline:

    Code:
    samtools view -F 0x0004 $YOURFILE | awk '{OFS="\t"; if (and($2, 16)) print $3,$4,$4+length($10),$1,$5,"-"; else print $3,$4,$4+length($10),$1,$5,"+" }'

    Comment


    • #3
      Hi Dawe,

      thanks for the suggestions, unfortunately both do not work in my hands:

      awk way gives such output (which could be reformatted):
      Code:
      length=76	0	1	SRR027956.80319	chr1	+
      length=76	16	17	SRR027956.80320	chr2	+
      length=76	0	1	SRR027956.80321	chr4	+
      length=76	0	1	SRR027956.80322	chr9	+
      length=76	16	17	SRR027956.80323	chr1	+
      length=76	0	1	SRR027956.80324	chrX	+
      length=76	0	1	SRR027956.80315	chr21	+
      MACS refuses to take the file with :

      Code:
       macs -t Hi-C_HindIII_GM_1_1.sam --format=SAM --wig
      INFO  @ Thu, 26 Nov 2009 17:22:42: 
      # ARGUMENTS LIST:
      # name = NA
      # format = SAM
      # ChIP-seq file = Hi-C_HindIII_GM_1_1.sam
      # control file = None
      # effective genome size = 2.70e+09
      # tag size = 25
      # band width = 300
      # model fold = 32
      # pvalue cutoff = 1.00e-05
      # Ranges for calculating regional lambda are : peak_region,1000,5000,10000 
      INFO  @ Thu, 26 Nov 2009 17:22:42: #1 read tag files... 
      INFO  @ Thu, 26 Nov 2009 17:22:42: #1 read treatment tags... 
      Traceback (most recent call last):
        File "/usr/local/bin/macs", line 273, in <module>
          main()
        File "/usr/local/bin/macs", line 57, in main
          (treat, control) = load_tag_files_options (options)
        File "/usr/local/bin/macs", line 252, in load_tag_files_options
          treat = options.build(open2(options.tfile, gzip_flag=options.gzip_flag))
        File "/usr/local/lib/python2.6/site-packages/MACS/IO/__init__.py", line 1480, in build_fwtrack
          (chromosome,fpos,strand) = self.__fw_parse_line(thisline)
        File "/usr/local/lib/python2.6/site-packages/MACS/IO/__init__.py", line 1500, in __fw_parse_line
          bwflag = int(thisfields[1])
      ValueError: invalid literal for int() with base 10: 'SL-XAU_2_FC30E0LAAXX:1:9:538:1739'
      I really believe my SAM-file is not correct. I'll restart the BOWTIE mapping when I'm on my computer.

      Maxim

      Comment


      • #4
        Originally posted by Maxim View Post
        I really believe my SAM-file is not correct. I'll restart the BOWTIE mapping when I'm on my computer.

        Maxim
        I agree... can you post the first, say, 2-5 lines of your alingment results?

        Comment


        • #5
          Hi,

          thank you for taking the time!!!

          I ran:
          ./bowtie hg18 -S reads/SRR027956_1.fastq -p 4 SRR027956_1.sam

          Code:
          head -n 100 SRR027956_1.sam
          @HD	VN:1.0	SO:unsorted
          @SQ	SN:chr1	LN:247249719
          @SQ	SN:chr2	LN:242951149
          @SQ	SN:chr3	LN:199501827
          @SQ	SN:chr4	LN:191273063
          ......
          @PG	ID=Bowtie	VN=0.11.3	CL="./bowtie hg18 -S reads/SRR027956_1.fastq -p 4 SRR027956_1.sam"
          SSRR027956.3 SL-XAU_2_FC30E0LAAXX:1:9:38:1713 length=76	16	chr18	39746869	255	76M	*	0	0	GCCATGGTGGATTCTTATTGAATGTGTGCTGAGTAACAGATGACTTGATTAATATAAAATGNATAATATTTTCCCC	+,++,,,+,,,-&-+,,,,--,,-,-,,,,--,,-------43334<<<<<<<<<;<<<<<!<<<<<<<<<<<<<<	XA:i:1	MD:Z:61T14	NM:i:1
          SRR027956.1 SL-XAU_2_FC30E0LAAXX:1:9:165:1796 length=76	0	chr19	32642499	255	76M	*	0	0	GGTGTATTATTCTCNCAGAGTTGAACCCTTATTTTGATTGAGAAGTTTGGAAACAGTTTTTTTGTAGAATCTGCAA	<<<<<<<<<<<<<<!<<<;<<<<<<<<<<<34344---------------,,----,----------,,----,,,	XA:i:1	MD:Z:14A61	NM:i:1
          SRR027956.2 SL-XAU_2_FC30E0LAAXX:1:9:27:1722 length=76	0	chr2	102083878	255	76M	*	0	0	GCATGGTTAATTCTNTAGTGTAAAGAGACAAAGGGTAAACTAACAGAATGTGACACACTGGAAAGCATTTTGGAGA	<:<<<<<<<<<<.;!<<<<<6<;<<;<<<<3334,--,--------,--++-$,----*-,*,,-+,,,,,,,,-,	XA:i:1	MD:Z:14C19T17T23	NM:i:3
          SRR027956.6 SL-XAU_2_FC30E0LAAXX:1:9:26:1619 length=76	16	chrX	100648521	255	76M	*	0	0	TGCTTCCAAATTAATTTCAATTTCCTTTTAAAATTACTTTTAATTCCAGATTACTTACAATNCTCTTAACTATTTC	---,,,-,--,,--,,,---,,,--++,,----,,,--,,-44333;;<;99;<<<<<9<<!<<<;<<6<<<<<<<	XA:i:1	MD:Z:61T14	NM:i:1
          SRR027956.5 SL-XAU_2_FC30E0LAAXX:1:9:9:1703 length=76	16	chr11	15738810	255	76M	*	0	0	TAGTATGCTCTTTTCTGAACACCCTAAGCTGCTCTTCTCAGAGGGGGAGAGGCTGGAGCTGNAGTTCCACAAAACC	,-,,,,,-,-,+,,-,----,---,-,--,--+-,,-,---43334<<<<<<<;<<<<<<<!<<<<<<<<<<<<<<	XA:i:1	MD:Z:61G14	NM:i:1
          That is how it looks like.
          Maxim

          P.S.: this is basically one fastq file of a paired read. Is it ok to process the runs independently in Bowtie?
          The problem with this dataset is that the mates will come from very different positions (or even chromosomes) as they are resuling from ligation reactions after chromosome conformation capture assay. As I do not know whether the standard aligners can handle this I thought to process the runs independently and to load them afterwards in my own database mate them again.
          Last edited by Maxim; 11-26-2009, 11:43 AM.

          Comment


          • #6
            Hi again,

            Originally posted by Maxim View Post
            Hi,

            thank you for taking the time!!!
            :-)

            Originally posted by Maxim View Post
            I ran:
            ./bowtie hg18 -S reads/SRR027956_1.fastq -p 4 SRR027956_1.sam

            Code:
            head -n 100 SRR027956_1.sam
            @HD	VN:1.0	SO:unsorted
            @SQ	SN:chr1	LN:247249719
            @SQ	SN:chr2	LN:242951149
            @SQ	SN:chr3	LN:199501827
            @SQ	SN:chr4	LN:191273063
            ......
            @PG	ID=Bowtie	VN=0.11.3	CL="./bowtie hg18 -S reads/SRR027956_1.fastq -p 4 SRR027956_1.sam"
            SRR027956.3 SL-XAU_2_FC30E0LAAXX:1:9:38:1713 length=76	16	chr18	39746869255	76M	*	0	0	GCCATGGTGGATTCTTATTGAATGTGTGCTGAGTAACAGATGACTTGATTAATATAAAATGNATAATATTTTCCCC	+,++,,,+,,,-&-+,,,,--,,-,-,,,,--,,-------43334<<<<<<<<<;<<<<<!<<<<<<<<<<<<<<	XA:i:1	MD:Z:61T14	NM:i:1
            SRR027956.1 SL-XAU_2_FC30E0LAAXX:1:9:165:1796 length=76	0	chr19	32642499255	76M	*	0	0	GGTGTATTATTCTCNCAGAGTTGAACCCTTATTTTGATTGAGAAGTTTGGAAACAGTTTTTTTGTAGAATCTGCAA	<<<<<<<<<<<<<<!<<<;<<<<<<<<<<<34344---------------,,----,----------,,----,,,	XA:i:1	MD:Z:14A61	NM:i:1
            That is how it looks like.
            Maxim
            At a first look I thought it could be an error in specifying options... But now I wonder read names in your fastq file contain spaces... Could it be that

            SRR027956.3 SL-XAU_2_FC30E0LAAXX:1:9:38:1713 length=76

            is made by
            SRR027956.3 SL-XAU_2_FC30E0LAAXX = instrument ID
            1 = lane
            9 = tile
            38, 1713 = cluster coordinates
            length=76 additional parameter added when the run has been started
            ?
            I assume it's an illumina run... I believe you should remove spaces in your input. You can try this

            Code:
            $ sed -e s/ /_/g reads/SRR027956_1.fastq | bowtie -S -p 4 -q -c hg18 - > SRR027956_1.sam
            assuming that '-c' option can make bowtie read from a pipe (I had troubles with this just today o__O ).
            Also, bowtie outputs results to stdout so you should always redirect them, I believe it would be better pipe them into samtools to have a BAM and save some space.
            If the code above doesn't work, convert spaces to a file and give it to bowtie:

            Code:
            $ sed -e s/ /_/g reads/SRR027956_1.fastq > tmpfile
            $ bowtie -S -p 4 hg18 tmpfile > SRR027956_1.sam
            HTH

            d

            Comment


            • #7
              Whoops!

              Originally posted by dawe View Post
              Code:
              $ sed -e s/ /_/g reads/SRR027956_1.fastq | bowtie -S -p 4 -q -c hg18 - > SRR027956_1.sam
              You probably would escape spaces in sed expression:

              Code:
              $ sed -e s/\ /_/g
              you can also use perl do do this (without escaping)

              Code:
              $ perl -p -e "s/ /_/g"
              I've substituted spaces with underscore but you are free to choose whatever you like :-)
              d

              Comment


              • #8
                Hi,

                I ran the sed command, but I had to do it like:
                Code:
                sed -e 's/ /_/g' SRR027956_1.fastq > tmpfile
                without the ticks it gave an error. But presumably the result differs from what you wanted, see below!

                the resulting fastq file looks like:

                @SRR027956.1_SL-XAU_2_FC30E0LAAXX:1:9:165:1796_length=76
                GGTGTATTATTCTCNCAGAGTTGAACCCTTATTTTGATTGAGAAGTTTGGAAACAGTTTTTTTGTAGAATCTGCAA
                +SRR027956.1_SL-XAU_2_FC30E0LAAXX:1:9:165:1796_length=76
                <<<<<<<<<<<<<<!<<<;<<<<<<<<<<<34344---------------,,----,----------,,----,,,

                I ran Bowtie and ran your samtools/awk command as suggested.

                Bowtie Output:
                Code:
                @HD	VN:1.0	SO:unsorted
                @SQ	SN:chr1	LN:247249719
                .....
                @SQ	SN:chrM	LN:16571
                @PG	ID=Bowtie	VN=0.11.3	CL="./bowtie -S -p 4 hg18 reads/tmpfile"
                SRR027956.3_SL-XAU_2_FC30E0LAAXX:1:9:38:1713_length=76	16	chr18	39746869	255	76M	*	0	0	GCCATGGTGGATTCTTATTGAATGTGTGCTGAGTAACAGATGACTTGATTAATATAAAATGNATAATATTTTCCCC	+,++,,,+,,,-&-+,,,,--,,-,-,,,,--,,-------43334<<<<<<<<<;<<<<<!<<<<<<<<<<<<<<	XA:i:1	MD:Z:61T14	NM:i:1
                Looks fine, or?

                But samtools fails:

                Code:
                ./samtools view -F 0x0004 SRR027956_1.sam | awk '{OFS="\t"; if (and($2, 16)) print $3,$4,$4+length($10),$1,$5,"-"; else print $3,$4,$4+length($10),$1,$5,"+" }'
                awk: line 2: function and never defined
                [bam_header_read] EOF marker is absent.
                [main_samview] fail to read the header.
                Isn't the EOF "ctrl-d"?

                gma
                Last edited by Maxim; 11-26-2009, 12:51 PM.

                Comment


                • #9
                  Originally posted by Maxim View Post
                  Hi,

                  I ran the sed command, but I had to do it like:
                  Code:
                  sed -e 's/ /_/g' SRR027956_1.fastq > tmpfile
                  without the ticks it gave an error. But presumably the result differs from what you wanted, see below!

                  the resulting fastq file looks like:

                  @SRR027956.1_SL-XAU_2_FC30E0LAAXX:1:9:165:1796_length=76
                  GGTGTATTATTCTCNCAGAGTTGAACCCTTATTTTGATTGAGAAGTTTGGAAACAGTTTTTTTGTAGAATCTGCAA
                  +SRR027956.1_SL-XAU_2_FC30E0LAAXX:1:9:165:1796_length=76
                  <<<<<<<<<<<<<<!<<<;<<<<<<<<<<<34344---------------,,----,----------,,----,,,
                  Nope, it is exactly what I meant. You see now the sequence IDs in a single word. When awk (or any other tool) tries to divide in column, the seq name is now a single value...

                  Originally posted by Maxim View Post

                  I ran Bowtie and ran your awk command

                  Bowtie Output:
                  Code:
                  @HD	VN:1.0	SO:unsorted
                  @SQ	SN:chr1	LN:247249719
                  .....
                  @SQ	SN:chrM	LN:16571
                  @PG	ID=Bowtie	VN=0.11.3	CL="./bowtie -S -p 4 hg18 reads/tmpfile"
                  SRR027956.3_SL-XAU_2_FC30E0LAAXX:1:9:38:1713_length=76	16	chr18	39746869	255	76M	*	0	0	GCCATGGTGGATTCTTATTGAATGTGTGCTGAGTAACAGATGACTTGATTAATATAAAATGNATAATATTTTCCCC	+,++,,,+,,,-&-+,,,,--,,-,-,,,,--,,-------43334<<<<<<<<<;<<<<<!<<<<<<<<<<<<<<	XA:i:1	MD:Z:61T14	NM:i:1
                  But samtools/awk fails:

                  Code:
                  ./samtools view -F 0x0004 SRR027956_1.sam | awk '{OFS="\t"; if (and($2, 16)) print $3,$4,$4+length($10),$1,$5,"-"; else print $3,$4,$4+length($10),$1,$5,"+" }'
                  awk: line 2: function and never defined
                  [bam_header_read] EOF marker is absent.
                  [main_samview] fail to read the header.
                  Isn't the EOF "ctrl-d"?

                  gma
                  Sorry, I've usually put the samtools/awk after bam generation... You should add '-S' option to your samtools view command.

                  Code:
                  $ ./samtools view -S -F 0x0004 SRR027956_1.sam | awk '{OFS="\t"; if (and($2, 16)) print $3,$4,$4+length($10),$1,$5,"-"; else print $3,$4,$4+length($10),$1,$5,"+" }'
                  If the awk doesn't work... nevermind, you now should have SAM files to feed MACS with :-)

                  Comment


                  • #10
                    thanks again!

                    Awk still fails but I think I should be able to make it work. The goal is not only to parse it to a BED but to get the unique identifier from the run, too, in order to put it in the "name" field of the BED file.

                    Looks like I have to dive a little more into awk.

                    Despite of this you are right, for further processing the file is know suitable, MACS started smoothly, but it looks like it won't find any peaks, anyhow it will produce a .wig file!Good!

                    What I'm actually aiming for is a gene browser like representation of the coverage. So far I tried Gbrowse2, which did not show any data, perhaps due to the incorrectly formatted BAM files. Similarly IGV refuses to show any data, perhaps this is now getting better.

                    Just by chance: do you have a recommendation for an quick and easy way of making the Bowtie/Samtools output available for a common genome browser to look at the coverage (independent of ChIP-se like peak-building models depending on + and - reads like SISSRS/MACS etc)?

                    Best wishes

                    Maxim
                    Last edited by Maxim; 11-26-2009, 01:23 PM.

                    Comment


                    • #11
                      Hi Maxim, I'm glad you solved your issue...
                      It seems you have many other now :-)

                      Originally posted by Maxim View Post
                      Awk still fails but I think I should be able to make it work.
                      Strange... is it possible you are not using GNU/awk (i.e. you are running on FreeBSD or Solaris machine)?

                      Originally posted by Maxim View Post
                      Looks like I have to dive a little more into awk.
                      Definitely! awk and other command line utilities are really useful for processing this kind of data... you can perform many simple task by piping text streams to those commands.

                      Originally posted by Maxim View Post
                      What I'm actually aiming for is a gene browser like representation of the coverage. So far I tried Gbrowse2, which did not show any data, perhaps due to the incorrectly formatted BAM files. Similarly IGV refuses to show any data, perhaps this is now getting better.
                      We have a local mirror of UCSC genome browser for data visualization... Essentially for the high annotation level on genomes we are working with. we keep data in bigWig format.
                      I've recommended to some people in our institution to use IGV for fast visualization of coverage data. Version 1.4 supports BAM files (actually I don't remember right now if you have to preprocess them, but I don't think so...).
                      In addition, you may want to output alignment results in BAM on the fly, like this:

                      Code:
                      $ bowtie [options] -S <ref> <file.fq> | samtools view -ut <st-ref> - | samtools sort - outfile
                      where st-ref is the fasta reference file indexed with

                      Code:
                      $ samtools faidx
                      You'll save a lot of space!

                      Originally posted by Maxim View Post
                      Just by chance: do you have a recommendation for an quick and easy way of making the Bowtie/Samtools output available for a common genome browser to look at the coverage (independent of ChIP-se like peak-building models depending on + and - reads like SISSRS/MACS etc)?

                      Best wishes

                      Maxim
                      I've been trying to pipe "samtools pileup" output to wig2bdg command from MACS distribution (modified to read stdin). Unfortunately it is rather slow and resulting bedGraph (analogous to wig) is not "optimized" for size. It should be said that I usually generate coverage tracks as side product of other analysis, so I rely on MACS or FindPeaks output.
                      As IGV, UCSC genome browser and GBrowse support BAM you will get a coverage representation just loading SAM/BAM files...

                      d

                      Comment


                      • #12
                        Hi,

                        just wanted to let you know that it was a matter of the awk installed on my system (Ubuntu).

                        I just added MAWK, works great (including retrieval of the cluster ID).
                        Additonally the BAM file I created based on the reformatted fastq file works in all downstream applications (GBrowse/IGV/MACS etc).!


                        Best

                        Maxim
                        Last edited by Maxim; 11-28-2009, 02:09 AM.

                        Comment

                        Working...
                        X