Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina - mapping genome on reference - how to extract assembly?

    Hello,

    I'm having trouble extracting assembled reads after using soapaligner.

    I mapped my raw reads to my reference genome using soap aligner, how do I then extract the assembly and submit for annotation? I tried converting soap to sam format, then using,

    # This is for extracting mapping reads from bam files
    samtools view -F4 input.bam > /data/bowtieOutput/mapped.sam

    converted sam to bam and bam to fastq, then fastq to fasta -
    I need a genebank file to upload sequences to http://rast.nmpdr.org/rast.cgi

    Please help!

    bgansw

  • #2
    Technically thats not really an assembly.

    TO get a consensus sequence, you should look at the variant calling section of the samtools manual (http://samtools.sourceforge.net/mpileup.shtml):

    samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

    Comment


    • #3
      Originally posted by jimmybee View Post
      To get a consensus sequence, you should look at the variant calling section of the samtools manual (http://samtools.sourceforge.net/mpileup.shtml):
      But bear in mind that the vcf2fq script is designed for SNPs, not INDELs. If there is an INDEL in your sequence relative to the reference, then the INDEL and a few flanking bases will be changed to lower case, but not replaced. This means that any fastq file generated from this script will have the same length as the input reference sequence.

      I sent a patch to their sourceforge page to fix this (and allow more compact partial vcf files with a provided reference sequence), but I don't think it's been implemented yet.

      Comment


      • #4
        Hello Jimmybee and Gringer,

        Thanx for your guidance with this.

        I've been trying the commands (samtools, bcftools).. and I'm getting an error:

        bgansw@BioinfoSecond:~/bwa/samtools-0.1.18$ samtools mpileup -uf 533240.5.fna N3_on505.bam |bcftools/bcftools view -cg - |bcftools/vcfutils.pl vcf2fq > N3_on_505_assembly_extracted.fastq
        [mpileup] 1 samples in 1 input files
        <mpileup> Set max per-file depth to 8000
        [afs] 0:0.000 1:0.000 2:0.000
        Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.
        Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.

        I've tried installing samtools 1.19, and running this.. but I keep getting the same error.

        Do you'll know what I'm doing wrong??

        Really really appreciate your help. Thanx v v much

        Bgansw

        Comment


        • #5
          Originally posted by Bgansw View Post

          Code:
          bgansw@BioinfoSecond:~/bwa/samtools-0.1.18$ samtools mpileup -uf 533240.5.fna N3_on505.bam |bcftools/bcftools view -cg - |bcftools/vcfutils.pl vcf2fq > N3_on_505_assembly_extracted.fastq
          ...
          Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.
          Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.
          I recommend checking your file names. I get an error one line after this when I give an incorrect fasta file name to samtools mpileup. There is a chance that your fasta file is called
          Code:
          533240.5.fa
          rather than
          Code:
          533240.5.f[b][i]n[/i][/b]a

          Comment


          • #6
            Hi Gringer,

            Thanx f the tip. I've gone over all the file names, and have changed the 53340.5.fna to 533240.5.fa but I'm still getting the same error. Does this command work well for you? I'm trying to figure out what the problem could be.

            Do u know if theres any other way to extract the consensus sequence after mapping one genome on another using soap aligner?

            Comment


            • #7
              Does this command work well for you?
              I have a slightly different processing script, but the command you have seems to work for me on my mitochondrial data (i.e. it produces a fastq file at the end of it). Here's the actual command sequence that I run for my processing:

              Code:
                  echo "** getting VCF for ${y} (max depth = ${maxdepth}) **";
                  pv results/${y}.bam | samtools mpileup -L ${maxdepth} -uf db/hs_mtDNA_ref.fasta - | \
                      bcftools view -v -g - > results/${y}.vcf
                  echo "** getting consensus FASTQ for ${y} **";
                  perl scripts/vcf2fq.pl -Q 20 -L 20 -f db/hs_mtDNA_ref.fasta results/${y}.vcf > results/con_${y}.fastq
                  echo "** converting ${y} consensus FASTQ to plain FASTA **";
                  perl scripts/fastq2fasta.pl results/con_${y}.fastq | perl -pe 'tr/acgt/ACGT/' > results/con_${y}.fasta
              I've attached my modified vcf2fq script to this post [also the fairly trivial fastq2fasta], maybe it will help to diagnose the problem.
              Attached Files

              Comment


              • #8
                Originally posted by gringer View Post
                But bear in mind that the vcf2fq script is designed for SNPs, not INDELs. If there is an INDEL in your sequence relative to the reference, then the INDEL and a few flanking bases will be changed to lower case, but not replaced. This means that any fastq file generated from this script will have the same length as the input reference sequence.

                I sent a patch to their sourceforge page to fix this (and allow more compact partial vcf files with a provided reference sequence), but I don't think it's been implemented yet.
                Hi gringer, I got a problem, i found that fastq file generated from vcf2fq have the different length as the input reference sequence. Do you have any update news about vcf2fq ?

                Comment


                • #9
                  Hi gringer, I got a problem, i found that fastq file generated from vcf2fq have the different length as the input reference sequence. Do you have any update news about vcf2fq ?
                  Is this with the standard vcftools script, or the modified version that I wrote? What is the length difference? My modified version will produce a different sequence length because of INDELs -- any insertion or deletion will change the length of the sequence.

                  Comment


                  • #10
                    Originally posted by gringer View Post
                    Is this with the standard vcftools script, or the modified version that I wrote? What is the length difference? My modified version will produce a different sequence length because of INDELs -- any insertion or deletion will change the length of the sequence.
                    hi gringer, thanks for your reply. I use standard vcfutils.pl of samtools(Version: 0.1.19-44428cd). Here is my code:

                    Code:
                    samtools mpileup -uf ${refFile} ${uniquelyMappedBAMFile} > ${mpileupFile}
                    bcftools view -cg ${mpileupFile} | vcfutils.pl vcf2fq > ${cnsFileFQ}
                    seqtk seq -l 0 -A ${cnsFileFQ} > ${cnsFileFA}
                    the cns.fq file contains "A,T,C,G,a,t,c,g,n". but i check length of sequences in fastq file, it doesn't same with length in reference sequence. so i guess the error coming from command "bcftools view -cg ${mpileupFile} | vcfutils.pl vcf2fq > ${cnsFileFQ}". i will check the command again.
                    Last edited by xmubingo; 01-12-2014, 04:26 PM.

                    Comment


                    • #11
                      Is this behaviour consistent regardless of what BAM file you use?

                      Your command line looks correct (it at least matches the signature of the samtools man page), so the next thing I would check would be inputs/outputs. Note that the current vcfutils vcf2fq implementation needs every base to be specified, so the output of bcftools should be one "variant" line per base. Unfortunately I can't really help you on this much more without actual examples.

                      Comment


                      • #12
                        Do not use a reference sequence to assembly. They are different. If you want to get a consistent after soap,you can use SOAPsnp. This software will generate a new consistent sequence for you.

                        Comment


                        • #13
                          Originally posted by gringer View Post
                          Is this behaviour consistent regardless of what BAM file you use?

                          Your command line looks correct (it at least matches the signature of the samtools man page), so the next thing I would check would be inputs/outputs. Note that the current vcfutils vcf2fq implementation needs every base to be specified, so the output of bcftools should be one "variant" line per base. Unfortunately I can't really help you on this much more without actual examples.
                          Thanks, gringer!! I think you figure out the problem. I use the following command to check the input of vcf2fq. I found that the output of bcftools did not cover each base.
                          Code:
                          bcftools view -cg ${mpileupFile} | more
                          For example, i pick up several lines of bcftools ouput. You can see that contig "AGTP01133720.1" ends in position of 23,295bp, but actually "AGTP01133720.1" has a length of 23,339bp. More clearly, contig "AGTP01133827.1" starts in position of 30bp, not in the beginning.
                          Code:
                          AGTP01133720.1  23292   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133720.1  23293   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133720.1  23294   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133720.1  23295   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133827.1  30      .       G       .       33      .       DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133827.1  31      .       G       .       33      .       DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133827.1  32      .       C       .       33      .       DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133827.1  33      .       T       .       36      .       DP=2;AF1=0;AC1=0;DP4=1,1,0,0;MQ=50;FQ=-33       PL      0
                          AGTP01133827.1  34      .       A       .       36      .       DP=2;AF1=0;AC1=0;DP4=1,1,0,0;MQ=50;FQ=-33       PL      0
                          AGTP01133827.1  35      .       G       .       36      .       DP=2;AF1=0;AC1=0;DP4=1,1,0,0;MQ=50;FQ=-33       PL      0
                          So, gringer, what is the cause of my problem?
                          1. the bam file i used? for reads didn't cover each base in reference sequence.
                          2. or the parameters i used in "samtools mpileup"?

                          Great thanks!!
                          Last edited by xmubingo; 01-12-2014, 05:51 PM.

                          Comment


                          • #14
                            Originally posted by xmubingo View Post
                            1. the bam file i used? for reads didn't cover each base in reference sequence.
                            2. or the parameters i used in "samtools mpileup"?
                            This looks like it could be a bug in samtools mpileup, in not reporting all locations when a fasta file is specified. Does the mpileup output include all locations in the fasta file?

                            Code:
                            samtools mpileup -f ${refFile} ${uniquelyMappedBAMFile} | less

                            Comment


                            • #15
                              Originally posted by gringer View Post
                              This looks like it could be a bug in samtools mpileup, in not reporting all locations when a fasta file is specified. Does the mpileup output include all locations in the fasta file?

                              Code:
                              samtools mpileup -f ${refFile} ${uniquelyMappedBAMFile} | less
                              No, you can see:
                              Code:
                              AGTP01133720.1  23292   A       1       ,       >
                              AGTP01133720.1  23293   A       1       ,       ;
                              AGTP01133720.1  23294   A       1       ,       9
                              AGTP01133720.1  23295   A       1       ,$      8
                              AGTP01133827.1  30      G       1       ^S.     C
                              AGTP01133827.1  31      G       1       .       C
                              AGTP01133827.1  32      C       1       .       C
                              AGTP01133827.1  33      T       2       .^S,    F>
                              AGTP01133827.1  34      A       2       .,      FG
                              AGTP01133827.1  35      G       2       .,      FI

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Best Practices for Single-Cell Sequencing Analysis
                                by seqadmin



                                While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                                06-06-2024, 07:15 AM
                              • seqadmin
                                Latest Developments in Precision Medicine
                                by seqadmin



                                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                Somatic Genomics
                                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                05-24-2024, 01:16 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 07:24 AM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 08:58 AM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 06-12-2024, 02:20 PM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 06-07-2024, 06:58 AM
                              0 responses
                              184 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X