Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unfamiliar SAM file format outputted by Rockhopper program

    Hi,

    I'm using Rockhopper to analyze E. coli RNA-Seq data.
    rockhopper, rna-seq, rnaseq, analysis, bacteria, bacterial, bioinformatics

    I'm not familiar with the SAM format outputted by Rockhopper.
    Has anyone seen this format before, or have any ideas on how to convert it the traditional format, which I could then view in IGV or on the UCSC Genome Browser? I'm quite comfortable with both Python and R, but I really don't understand the current format, so I'm unable to convert it.
    The data is paired-end.

    Here is the first fourteen lines from the SAM file.
    I've put more lines in the attached file.

    Code:
    [blancha@lg-1r14-n04 samFiles]$ samtools view -h -f 2 IK_21C-EM9-1_R1.sam | more
    @HD	VN:1.0	SO:unsorted
    @SQ	SN:gi|556503834|ref|NC_000913.3|	LN:4641652	SP:Escherichia coli str. K-12 substr. MG1655
    @PG	ID:Rockhopper	PN:Rockhopper	VN:2.03
    D69F08P1:403:C6Y8VACXX:5:1101:1436:2236 1:N:0:AGTCAAC	67	gi|556503834|ref|NC_000913.3|	2527763	255	50M	=	2527927	213	TGGCAAATGGCATCCCGATGGCAAACATTCTGTTCCCCACATCGGTGATC	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIII
    +	131	gi|556503834|ref|NC_000913.3|	2527927	255	49M	=	2527763	-213	CGCAACTGGTCCAGCCCCTGAAGCGTCCGCTTTAAGCTTTATCGGCGCT	BBBFFFFFFFFFFIIIIIIIFIIIIFIIIIIIIIIIIIIIIIIIIIIFF
    D69F08P1:403:C6Y8VACXX:5:1101:1606:2216 1:N:0:AGTCAAC	67	gi|556503834|ref|NC_000913.3|	3441734	255	50M	=	3441811	126	CGACAACCGTTATGAGGGATCGGAGTCACATCAGTAATGTTAGTGATGCG	BBBFBFF<F0<FFIIIIIF7FFFFFIIIIIIFFFBFFFF<FFFB7B7B<F
    +	131	gi|556503834|ref|NC_000913.3|	3441811	255	49M	=	3441734	-126	GAATCTGGAAGTTATGGTTAAAGGTCCGGGTCCAGGCCGCGAAACTACT	BBBBFBFFFBF<FFFIIB<FFFIBFFFFF7BBFFFFFIFFIFF<FFFFB
    D69F08P1:403:C6Y8VACXX:5:1101:1955:2210 1:N:0:AGTCAAC	67	gi|556503834|ref|NC_000913.3|	3471221	255	50M	=	3471324	152	CCCGTACGGTGGTGATTGCAGCGGTCAGAGTAGTTTTACCGTGGTCAACG	BBBFFFFFFFFFFFFIIIIIIIIIIIIIIIFFFIIIIFFIIIIIIIIIII
    +	131	gi|556503834|ref|NC_000913.3|	3471324	255	49M	=	3471221	-152	GCTCTCTCCTGAAGGGGAGAGCACTATAGTAAGGAATATAGCCGTGTCT	BBBFFFFFFFFFFIIIIIIIIIIIIIIFIFFIIIIIIIIIIIIIFIIII
    D69F08P1:403:C6Y8VACXX:5:1101:2133:2203 1:N:0:AGTCAAC	115	gi|556503834|ref|NC_000913.3|	1719838	255	50M	=	1719872	83	AAGAGACAGACCTACCATTGAAACAACCAATACGCGTTTAATCATTGAAA	BBBFFFFFFFFFFIIIIIIFFIIIFIIIIIIFFFBFBFFFIIIFFFFFFB
    +	179	gi|556503834|ref|NC_000913.3|	1719872	255	49M	=	1719838	-83	GCTTGCGTGGCGTTTCATGGTGAACAGGAGATTTTTCAATGATTAAACG	BBBFFFFFFFFFFFFFIIIIBFBFFIIIFFBFFFIIIIBFFBFIFBBFB
    D69F08P1:403:C6Y8VACXX:5:1101:1916:2222 1:N:0:AGTCAAC	67	gi|556503834|ref|NC_000913.3|	3444439	255	50M	=	3444490	100	CCCACGACCACCGGTTTTACCGAGGCCAGAACCGATACCACGACCCAGGC	BBBFFFFFFFFFFFFFFFFIFFII<BBFFFFIIFFIF<<<BF<BBFBF7B
    +	131	gi|556503834|ref|NC_000913.3|	3444490	255	49M	=	3444439	-100	TGCGTTTAAATACTCTGTCTCCGGCCGAAGGCTCCAAAAAGGCGGGTAA	BB<FFFFFFFFFFFBFFBBBFBFFFFFFFB7BFFIBFFFBFB<BBB0<B
    D69F08P1:403:C6Y8VACXX:5:1101:2117:2249 1:N:0:AGTCAAC	115	gi|556503834|ref|NC_000913.3|	639393	255	50M	=	639501	157	GGCGACGCCAACGCCGCTATGGCGTGAAAGACGAAGGAAATTTAGATTTT	<BBFBFFFBBFBFFFIFFBFFIIIIIFBFFIIIIF7<BF<BBBBBBBBB<
    +	179	gi|556503834|ref|NC_000913.3|	639501	255	49M	=	639393	-157	GTAAAATCAAAGCAGCACAGTACGTAGCTTCTCACCCAGGTGAAGTTTG	B<BFFFFFFFFFFFBFFFFBBFFFFFFIIIFFFIFFBFFFFIBFFBFFF
    Thank you for your help.
    Attached Files
    Last edited by blancha; 07-09-2015, 04:11 PM. Reason: Put lines from SAM file in Code box

  • #2
    It mostly looks like a normal sam file; the specification is here: https://samtools.github.io/hts-specs/SAMv1.pdf

    However, the second line has "+" for the read name, which is odd to say the least. Can you run head on the input fastq file to show the first 8 lines?

    Edit - looking at the attachment, it appears that either you have an odd fastq file with read2 always named "+" or that Rockhopper has a bug causing it to incorrectly report the read name.

    Comment


    • #3
      Thank you Brian.
      You are correct in pointing out that the only problem with the format is the + sign on every other line.
      The + just corresponds to the paired FASTQ read.
      If this was the only issue I had with Rockhopper, I would be happy.

      The main problem I have is that when I view the alignments in IGV, at least half the reads are mostly composed of mutations relative to the reference genome.
      I've tried all the different settings, fr, ff, rf, and rr.
      I cannot figure out why Rockhopper insists on aligning reads in what appears to be the wrong location.

      I think I'll just give up on the software, even if it appears to be widely used in respected publications for E. coli RNA-Seq analysis.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Genetic Variation in Immunogenetics and Antibody Diversity
        by seqadmin



        The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
        11-06-2024, 07:24 PM
      • seqadmin
        Choosing Between NGS and qPCR
        by seqadmin



        Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
        10-18-2024, 07:11 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 11:09 AM
      0 responses
      24 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Today, 06:13 AM
      0 responses
      20 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 11-01-2024, 06:09 AM
      0 responses
      30 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 10-30-2024, 05:31 AM
      0 responses
      21 views
      0 likes
      Last Post seqadmin  
      Working...
      X