Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • owwa
    Member
    • Jul 2012
    • 23

    how to create a custom bed file

    I have a BAM file and also the related refference genome (fasta), and now I need to extract all the position for some tags, such as "GCTTCT" in the genome, and then make a custom bed file for these tags as the following:

    chr1 127471196 127472363 +
    chr1 127472363 127473530 +
    chr1 127473530 127474697 +
    chrx 127474697 127475864 +
    chr2 127475864 127477031 -
    chr3 127477031 127478198 -
    chr5 127478198 127479365 -
    chr7 127479365 127480532 +
    chr7 127480532 127481699 -

    Your help will be greatly appreciated!
  • dariober
    Senior Member
    • May 2010
    • 311

    #2
    Originally posted by owwa View Post
    I have a BAM file and also the related refference genome (fasta), and now I need to extract all the position for some tags, such as "GCTTCT" in the genome, and then make a custom bed file for these tags as the following:

    chr1 127471196 127472363 +
    chr1 127472363 127473530 +
    chr1 127473530 127474697 +
    chrx 127474697 127475864 +
    chr2 127475864 127477031 -
    chr3 127477031 127478198 -
    chr5 127478198 127479365 -
    chr7 127479365 127480532 +
    chr7 127480532 127481699 -

    Your help will be greatly appreciated!
    Hello,

    I'm not sure why you mention the bam file... If you want to get the position of a pattern in a fasta file I happen to write a while ago a simple python script (attached) which searches a fasta file for a regular expression and returns the position of the matches in bed format. (You need python2.7 or any python with argparse module)

    After unzipping fastaRegexFinder.py.gz you would do:

    Code:
    ## Default: Search tag on forward and reverse strand, case insensitive
    
    python fastaRegexFinder.py -r 'GCTTCT' -f reference.fasta
    ref1	3	9	ref1_3_9_for	6	+	GCTTCT
    ref2	0	6	ref2_0_6_rev	6	-	AGAAGC
    
    ## Where example.fasta looks like:
    cat example.fasta 
    >ref1
    NNNGCTTCTACTG
    >ref2
    AGAAGCNNNN
    It's not superfast as it is pure pyhton but it should suite most needs. In fact, if you don't care searching for regular expressions you could use some general aligner tools like vmatch.

    Dario
    Attached Files

    Comment

    • owwa
      Member
      • Jul 2012
      • 23

      #3
      Originally posted by dariober View Post
      Hello,

      I'm not sure why you mention the bam file... If you want to get the position of a pattern in a fasta file I happen to write a while ago a simple python script (attached) which searches a fasta file for a regular expression and returns the position of the matches in bed format. (You need python2.7 or any python with argparse module)

      After unzipping fastaRegexFinder.py.gz you would do:

      Code:
      ## Default: Search tag on forward and reverse strand, case insensitive
      
      python fastaRegexFinder.py -r 'GCTTCT' -f reference.fasta
      ref1	3	9	ref1_3_9_for	6	+	GCTTCT
      ref2	0	6	ref2_0_6_rev	6	-	AGAAGC
      
      ## Where example.fasta looks like:
      cat example.fasta 
      >ref1
      NNNGCTTCTACTG
      >ref2
      AGAAGCNNNN
      It's not superfast as it is pure pyhton but it should suite most needs. In fact, if you don't care searching for regular expressions you could use some general aligner tools like vmatch.

      Dario
      Hi Dario,

      thank you so much for your kind help! iI will try it.
      I just want to extract the read count from the BAM file for the position where has the tag GCTTCT or some other tags,
      Last edited by owwa; 07-03-2013, 04:42 AM.

      Comment

      • owwa
        Member
        • Jul 2012
        • 23

        #4
        Originally posted by dariober View Post
        Hello,

        I'm not sure why you mention the bam file... If you want to get the position of a pattern in a fasta file I happen to write a while ago a simple python script (attached) which searches a fasta file for a regular expression and returns the position of the matches in bed format. (You need python2.7 or any python with argparse module)

        After unzipping fastaRegexFinder.py.gz you would do:

        Code:
        ## Default: Search tag on forward and reverse strand, case insensitive
        
        python fastaRegexFinder.py -r 'GCTTCT' -f reference.fasta
        ref1	3	9	ref1_3_9_for	6	+	GCTTCT
        ref2	0	6	ref2_0_6_rev	6	-	AGAAGC
        
        ## Where example.fasta looks like:
        cat example.fasta 
        >ref1
        NNNGCTTCTACTG
        >ref2
        AGAAGCNNNN
        It's not superfast as it is pure pyhton but it should suite most needs. In fact, if you don't care searching for regular expressions you could use some general aligner tools like vmatch.

        Dario
        The script works well. Thanks!

        if there is a fix for palindromic sequence, it will be better! for example, when I use the palindromic sequence tag CCGG, it return repeated records:

        CL1.Contig1_243 size 1223 gap 0 0% 311 315 CL1.Contig1_243 size 1223 gap 0 0%_311_315_for 4 + CCGG
        CL1.Contig1_243 size 1223 gap 0 0% 311 315 CL1.Contig1_243 size 1223 gap 0 0%_311_315_rev 4 - CCGG
        CL1.Contig1_243 size 1223 gap 0 0% 1056 1060 CL1.Contig1_243 size 1223 gap 0 0%_1056_1060_for 4 + CCGG
        CL1.Contig1_243 size 1223 gap 0 0% 1056 1060 CL1.Contig1_243 size 1223 gap 0 0%_1056_1060_rev 4 - CCGG
        CL1.Contig2_243 size 879 gap 0 0% 311 315 CL1.Contig2_243 size 879 gap 0 0%_311_315_for 4 + CCGG
        CL1.Contig2_243 size 879 gap 0 0% 311 315 CL1.Contig2_243 size 879 gap 0 0%_311_315_rev 4 - CCGG
        CL2.Contig1_243 size 1389 gap 0 0% 513 517 CL2.Contig1_243 size 1389 gap 0 0%_513_517_for 4 + CCGG
        CL2.Contig1_243 size 1389 gap 0 0% 513 517 CL2.Contig1_243 size 1389 gap 0 0%_513_517_rev 4 - CCGG
        CL2.Contig1_243 size 1389 gap 0 0% 622 626 CL2.Contig1_243 size 1389 gap 0 0%_622_626_for 4 + CCGG
        CL2.Contig1_243 size 1389 gap 0 0% 622 626 CL2.Contig1_243 size 1389 gap 0 0%_622_626_rev 4 - CCGG
        CL2.Contig1_243 size 1389 gap 0 0% 867 871 CL2.Contig1_243 size 1389 gap 0 0%_867_871_for 4 + CCGG
        CL2.Contig1_243 size 1389 gap 0 0% 867 871 CL2.Contig1_243 size 1389 gap 0 0%_867_871_rev 4 - CCGG
        CL2.Contig1_243 size 1389 gap 0 0% 874 878 CL2.Contig1_243 size 1389 gap 0 0%_874_878_for 4 + CCGG
        CL2.Contig1_243 size 1389 gap 0 0% 874 878 CL2.Contig1_243 size 1389 gap 0 0%_874_878_rev 4 - CCGG
        Last edited by owwa; 07-03-2013, 11:40 PM.

        Comment

        • dariober
          Senior Member
          • May 2010
          • 311

          #5
          Originally posted by owwa View Post
          The script works well. Thanks!

          if there is a fix for palindromic sequence, it will be better! for example, when I use the palindromic sequence tag CCGG, it return repeated records:

          CL1.Contig1_243 size 1223 gap 0 0% 311 315 CL1.Contig1_243 size 1223 gap 0 0%_311_315_for 4 + CCGG
          CL1.Contig1_243 size 1223 gap 0 0% 311 315 CL1.Contig1_243 size 1223 gap 0 0%_311_315_rev 4 - CCGG
          CL1.Contig1_243 size 1223 gap 0 0% 1056 1060 CL1.Contig1_243 size 1223 gap 0 0%_1056_1060_for 4 + CCGG
          CL1.Contig1_243 size 1223 gap 0 0% 1056 1060 CL1.Contig1_243 size 1223 gap 0 0%_1056_1060_rev 4 - CCGG
          CL1.Contig2_243 size 879 gap 0 0% 311 315 CL1.Contig2_243 size 879 gap 0 0%_311_315_for 4 + CCGG
          CL1.Contig2_243 size 879 gap 0 0% 311 315 CL1.Contig2_243 size 879 gap 0 0%_311_315_rev 4 - CCGG
          CL2.Contig1_243 size 1389 gap 0 0% 513 517 CL2.Contig1_243 size 1389 gap 0 0%_513_517_for 4 + CCGG
          CL2.Contig1_243 size 1389 gap 0 0% 513 517 CL2.Contig1_243 size 1389 gap 0 0%_513_517_rev 4 - CCGG
          CL2.Contig1_243 size 1389 gap 0 0% 622 626 CL2.Contig1_243 size 1389 gap 0 0%_622_626_for 4 + CCGG
          CL2.Contig1_243 size 1389 gap 0 0% 622 626 CL2.Contig1_243 size 1389 gap 0 0%_622_626_rev 4 - CCGG
          CL2.Contig1_243 size 1389 gap 0 0% 867 871 CL2.Contig1_243 size 1389 gap 0 0%_867_871_for 4 + CCGG
          CL2.Contig1_243 size 1389 gap 0 0% 867 871 CL2.Contig1_243 size 1389 gap 0 0%_867_871_rev 4 - CCGG
          CL2.Contig1_243 size 1389 gap 0 0% 874 878 CL2.Contig1_243 size 1389 gap 0 0%_874_878_for 4 + CCGG
          CL2.Contig1_243 size 1389 gap 0 0% 874 878 CL2.Contig1_243 size 1389 gap 0 0%_874_878_rev 4 - CCGG
          Hi- glad you made it work! So with palindromic patterns you want to search only one strand, right? You can use the --noreverse flag:

          Code:
          echo ">ref1
          NNNGCTTCTACTG
          >ref2
          AGAAGCNNNN
          NNNCCGG" | fastaRegexFinder.py -f - -r 'CCGG' --noreverse
          ref2	13	17	ref2_13_17_for	4	+	CCGG
          Type "fastaRegexFinder.py -h" to see the program's help

          By the way, if you want to count how many reads in your bam file overlap the positions of your "tag" you could use bedtools coverageBed:

          Code:
          fastaRegexFinder.py -f myseq.fa -r 'CCGG' --noreverse | coverageBed -abam myaln.bam -b - -counts > tag.count.bed
          Dario
          Last edited by dariober; 07-04-2013, 12:08 AM.

          Comment

          • owwa
            Member
            • Jul 2012
            • 23

            #6
            Originally posted by dariober View Post
            Hi- glad you made it work! So with palindromic patterns you want to search only one strand, right? You can use the --noreverse flag:

            Code:
            echo ">ref1
            NNNGCTTCTACTG
            >ref2
            AGAAGCNNNN
            NNNCCGG" | fastaRegexFinder.py -f - -r 'CCGG' --noreverse
            ref2	13	17	ref2_13_17_for	4	+	CCGG
            Type "fastaRegexFinder.py -h" to see the program's help

            By the way, if you want to count how many reads in your bam file overlap the positions of your "tag" you could use bedtools coverageBed:

            Code:
            fastaRegexFinder.py -f myseq.fa -r 'CCGG' --noreverse | coverageBed -abam myaln.bam -b - -counts > tag.count.bed
            Dario
            Thank you so much Dario!

            Actually, I need to search both + and - strand for palindromic patterns. Could you please help again?

            By the way, how about intersectBed for the read counts. Thanks!
            Last edited by owwa; 07-04-2013, 04:53 PM.

            Comment

            • dariober
              Senior Member
              • May 2010
              • 311

              #7
              Originally posted by owwa View Post
              Actually, I need to search both + and - strand for palindromic patterns. Could you please help again?
              If you want to search both strands than do not use the --noreverse flag and for palindromic patterns, by definition, you will get the same position twice (on + and -). But it seems this is not what you wanted in your previous post.

              By the way, how about intersectBed for the read counts. Thanks!
              You would have to convert your bam file to bed and pass it to intersectBed as the b-file with -c flag. This will count for each feature in the a-file the number of hits with the b-file. But why not to use coverageBed that it is designed for this purpose?

              Dario

              Comment

              • owwa
                Member
                • Jul 2012
                • 23

                #8
                Originally posted by dariober View Post
                If you want to search both strands than do not use the --noreverse flag and for palindromic patterns, by definition, you will get the same position twice (on + and -). But it seems this is not what you wanted in your previous post.



                You would have to convert your bam file to bed and pass it to intersectBed as the b-file with -c flag. This will count for each feature in the a-file the number of hits with the b-file. But why not to use coverageBed that it is designed for this purpose?

                Dario
                got it. Many thanks to Dario!

                Comment

                • B124816A
                  Junior Member
                  • Aug 2010
                  • 1

                  #9
                  Hi, I have a related task:

                  I have a tab-delimited list of primers of a known gene:
                  <name 1> <sequence 1>
                  <name 2> <sequence 2>
                  etc.
                  We know on which chromosome the gene is, of course.

                  I would like to run the entire list and get a BED file of the primer locations:
                  chr"n" Start Stop <name 1> 0 +or- Start Stop
                  chr"n" Start Stop <name 2> 0 +or- Start Stop
                  etc.

                  The BED file is used to show the primers on a custom annotation track in the Alamut program. I am looking for a way to do this with a batch of primers instead of copying and pasting them in a dialog box one by one. I tried BLAT but that requires primers of at least 20 nt length, which we don't always have.

                  Thank you!

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    Yesterday, 10:05 AM
                  • SEQadmin2
                    Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                    by SEQadmin2


                    With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                    Introduction

                    Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                    05-22-2026, 06:42 AM
                  • SEQadmin2
                    Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                    by SEQadmin2

                    Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                    Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                    05-06-2026, 09:04 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, Yesterday, 12:03 PM
                  0 responses
                  17 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, Yesterday, 11:40 AM
                  0 responses
                  13 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 05-28-2026, 11:40 AM
                  0 responses
                  29 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 05-26-2026, 10:12 AM
                  0 responses
                  31 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...