Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how do I output the CS tag for BWA align of SOLID reads?

    Hi,
    I am trying to use GATK to run some analysis on BWA aligned colorspace reads but I encountered this problem described here


    basically "org.broadinstitute.sting.utils.StingException: Unable to find color space information in SOLiD read."

    I am aware that BFAST generates this info. How do I make BWA generate this tag as well?
    http://kevin-gattaca.blogspot.com/

  • #2
    Originally posted by KevinLam View Post
    Hi,
    I am trying to use GATK to run some analysis on BWA aligned colorspace reads but I encountered this problem described here


    basically "org.broadinstitute.sting.utils.StingException: Unable to find color space information in SOLiD read."

    I am aware that BFAST generates this info. How do I make BWA generate this tag as well?
    It doesn't. Contact the developer if you want this feature in the future.

    Comment


    • #3
      Originally posted by nilshomer View Post
      It doesn't. Contact the developer if you want this feature in the future.
      Hi Nils, can you share how bfast generates the CS tag?

      I think I probably will have to write a post alignment script to add this tag
      http://kevin-gattaca.blogspot.com/

      Comment


      • #4
        Originally posted by KevinLam View Post
        Hi Nils, can you share how bfast generates the CS tag?

        I think I probably will have to write a post alignment script to add this tag
        The CSFASTQs I use as input are not "double-encoded" but instead contain the origin color sequence (unadulterated) and color qualities. This allows the CS/CQ tags to filled out easily. BWA "double-encodes" the color sequence and does some trimming too, thus making it impossible to recover the CS/CQ without going back to the original data (CSFASTA and QUAL).

        Comment


        • #5
          I see...
          So far, I only have these info. so presumably I have to reverse the order of the original CS and CQ if the read is mapped in another direction?

          the trimming bit might indeed be a problem though even if trying to construct a query db of the original csfasta and qual files isn't computationally intensive.


          Color read sequence on the same strand as the reference 4
          CS Z
          Color read quality on the same strand as the reference; encoded in the same way as <QUAL> 4
          CQ Z

          On a raw SOLiD read, the first nucleotide is the primer base and the first color is the one between the primer base
          and the first nucleotide from the sample being sequenced. The primer base and the first color must be present in CS.
          http://kevin-gattaca.blogspot.com/

          Comment


          • #6
            Originally posted by KevinLam View Post
            I see...
            So far, I only have these info. so presumably I have to reverse the order of the original CS and CQ if the read is mapped in another direction?

            the trimming bit might indeed be a problem though even if trying to construct a query db of the original csfasta and qual files isn't computationally intensive.


            Color read sequence on the same strand as the reference 4
            CS Z
            Color read quality on the same strand as the reference; encoded in the same way as <QUAL> 4
            CQ Z

            On a raw SOLiD read, the first nucleotide is the primer base and the first color is the one between the primer base
            and the first nucleotide from the sample being sequenced. The primer base and the first color must be present in CS.
            I don't ever match the direction of the CS/CQ tags to the reference, since the reverse (not compliment) is not symmetric. The adapter would then be the last base.

            Comment


            • #7
              Thanks Nils,
              I have an example of CS CQ tags

              VAB_S1332068_1358_1351 131 1 227 255 25M = 1373 1171 CTAACCCCTAACCCTAACCCTAAAC !A@B?@@@CAC?@?AAC?
              ??B@AA! RG:Z:TG133 CS:Z:G3230100023010023010023001 CQ:Z:<<<<<<<<<<;:<;* MD:Z:25 OQ:Z:!@@@@@@@@@@@@@@@@@@@@@@@!

              So am I correct in saying that CS and CQ are essentially the original csfasta and qual line?
              or at least bfast outputs it in this way?

              additionally
              but if I were to do the same I might have problems as bwa does trimming?
              http://kevin-gattaca.blogspot.com/

              Comment


              • #8
                Originally posted by KevinLam View Post
                Thanks Nils,
                I have an example of CS CQ tags

                VAB_S1332068_1358_1351 131 1 227 255 25M = 1373 1171 CTAACCCCTAACCCTAACCCTAAAC !A@B?@@@CAC?@?AAC?
                ??B@AA! RG:Z:TG133 CS:Z:G3230100023010023010023001 CQ:Z:<<<<<<<<<<;:<;* MD:Z:25 OQ:Z:!@@@@@@@@@@@@@@@@@@@@@@@!

                So am I correct in saying that CS and CQ are essentially the original csfasta and qual line?
                or at least bfast outputs it in this way?

                additionally
                but if I were to do the same I might have problems as bwa does trimming?
                They should be the qualities from the csfasta/qual lines. I don't think it has to be matched to the trimming. When it means "original", it means unaltered "original" values IMHO.

                Comment


                • #9
                  Solution?

                  Hi Kevin,

                  Did you work out a program to integrate the color space data into the SAM files for BWA alignments? If so, could you share? I am running into the same issue, and would like avoid re-implementation if possible.

                  Thanks,
                  Todd

                  Comment


                  • #10
                    Hi Todd,
                    Unfortunately, I decided to switch mapper for SOLID reads in the end.
                    I am trying to find the fastq indexer program that I intended to use for this with scripts to post process the bam
                    but I can only find this http://ivory.idyll.org/blog/mar-10/s...ving-sequences

                    Hope it helps!
                    http://kevin-gattaca.blogspot.com/

                    Comment


                    • #11
                      Ah found it!
                      This is a long over due tool for those trying to do non-typical analysis with your reads. Finally you can index and compress your NGS reads...
                      http://kevin-gattaca.blogspot.com/

                      Comment


                      • #12
                        Thanks Kevin! I will take a look.

                        BTW, did you ever try the BWA alignment from within BFAST (bfast+bwa)? That would seem to solve the problem -- assuming it includes the CS and CQ tags. I will be trying that soon.

                        Todd

                        Comment


                        • #13
                          bfast+bwa only replaces the match step. Postprocess is the same as in traditional bfast. You will have those tags.
                          -drd

                          Comment


                          • #14
                            Just in case anyone's still interested:
                            I wrote a small (rather dirty) python script to integrate those tags. It is slow and relies on the fact that bwa solid2fastq, bwa aln and bwa samse step doesn't change the order of the reads. Anyway it might be of interest to someone...


                            Code:
                            #! /usr/bin/python
                            
                            #ADD CS and CQ tags from original CSfasta and csqual file
                            
                            import sys
                            
                            #try getting file names from comand line
                            try:
                              SAMfile = sys.argv[1]
                              csfastafile = sys.argv[2]
                              qualfile = sys.argv[3]
                              outputfile=sys.argv[4]
                            except: 
                              print ("Usage: ./add_CSCQ.py <input SAM> <input csfasta> <input csqual> <output SAM>")
                              sys.exit()
                            
                            #try open files specified in command line
                            try:
                               SAM = open(SAMfile)
                               csfasta = open (csfastafile)
                               qual = open (qualfile)
                               output = open (outputfile, "w")
                            except:
                              print ("Couldn't open SAMfile")
                              sys.exit()
                            
                            #reading the first lines of the three files
                            SAMline = SAM.readline()
                            cs = csfasta.readline()
                            
                            #iterate till no comment
                            startcs=cs[0:1]
                            while startcs=='#':
                              cs=csfasta.readline()
                              startcs=cs[0:1]
                              
                            cq = qual.readline()
                            startcq=cq[0:1]
                            while startcq=='#':
                              cq=qual.readline()
                              startcq=cq[0:1]
                            
                            count = 0
                            
                            #iterate through all the files and add CS / CQ tags in the reads
                            #assuming solid2fastq didn't change the order of the reads
                            while SAMline:
                              info=SAMline.split()
                              start=SAMline[0:1]
                              alignment=SAMline[:-1]
                              
                              if (((count % 100000) == 0) and (count != 0)):
                                print count, "alignments processed"
                            
                            #print out header section
                              if start == '@':    
                                output.write(alignment+'\n')
                            
                            #print alignment section and add CS and CQ tags
                              else:
                                #read csfasta file until no comments
                                cs=csfasta.readline()
                                cq=qual.readline()
                                cs="\tCS:Z:"+cs
                                cs=cs[:-1]
                                
                                #encode quals to sanger quals
                                cq=cq[:-1]
                                intquals=cq.split()
                                asciiquals=""
                                for quality in intquals:
                                   quality = int (quality)
                                   ascii=chr(quality+33)
                                   asciiquals=asciiquals+ascii
                                cq="\tCQ:Z:"+asciiquals
                                
                                #write alignment to file
                                output.write(alignment+cs+cq+'\n')
                                cs=csfasta.readline()
                                cq=qual.readline()
                              SAMline = SAM.readline()
                              count = count + 1
                              
                            SAM.close()
                            csfasta.close()
                            qual.close()

                            Comment


                            • #15
                              Nice! Have u tried it with gatk?
                              http://kevin-gattaca.blogspot.com/

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM
                              • seqadmin
                                Understanding Genetic Influence on Infectious Disease
                                by seqadmin




                                During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                                Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                                09-09-2024, 10:59 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-26-2024, 12:57 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X