Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unfamiliar SAM file format outputted by Rockhopper program

    Hi,

    I'm using Rockhopper to analyze E. coli RNA-Seq data.
    rockhopper, rna-seq, rnaseq, analysis, bacteria, bacterial, bioinformatics

    I'm not familiar with the SAM format outputted by Rockhopper.
    Has anyone seen this format before, or have any ideas on how to convert it the traditional format, which I could then view in IGV or on the UCSC Genome Browser? I'm quite comfortable with both Python and R, but I really don't understand the current format, so I'm unable to convert it.
    The data is paired-end.

    Here is the first fourteen lines from the SAM file.
    I've put more lines in the attached file.

    Code:
    [blancha@lg-1r14-n04 samFiles]$ samtools view -h -f 2 IK_21C-EM9-1_R1.sam | more
    @HD	VN:1.0	SO:unsorted
    @SQ	SN:gi|556503834|ref|NC_000913.3|	LN:4641652	SP:Escherichia coli str. K-12 substr. MG1655
    @PG	ID:Rockhopper	PN:Rockhopper	VN:2.03
    D69F08P1:403:C6Y8VACXX:5:1101:1436:2236 1:N:0:AGTCAAC	67	gi|556503834|ref|NC_000913.3|	2527763	255	50M	=	2527927	213	TGGCAAATGGCATCCCGATGGCAAACATTCTGTTCCCCACATCGGTGATC	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIII
    +	131	gi|556503834|ref|NC_000913.3|	2527927	255	49M	=	2527763	-213	CGCAACTGGTCCAGCCCCTGAAGCGTCCGCTTTAAGCTTTATCGGCGCT	BBBFFFFFFFFFFIIIIIIIFIIIIFIIIIIIIIIIIIIIIIIIIIIFF
    D69F08P1:403:C6Y8VACXX:5:1101:1606:2216 1:N:0:AGTCAAC	67	gi|556503834|ref|NC_000913.3|	3441734	255	50M	=	3441811	126	CGACAACCGTTATGAGGGATCGGAGTCACATCAGTAATGTTAGTGATGCG	BBBFBFF<F0<FFIIIIIF7FFFFFIIIIIIFFFBFFFF<FFFB7B7B<F
    +	131	gi|556503834|ref|NC_000913.3|	3441811	255	49M	=	3441734	-126	GAATCTGGAAGTTATGGTTAAAGGTCCGGGTCCAGGCCGCGAAACTACT	BBBBFBFFFBF<FFFIIB<FFFIBFFFFF7BBFFFFFIFFIFF<FFFFB
    D69F08P1:403:C6Y8VACXX:5:1101:1955:2210 1:N:0:AGTCAAC	67	gi|556503834|ref|NC_000913.3|	3471221	255	50M	=	3471324	152	CCCGTACGGTGGTGATTGCAGCGGTCAGAGTAGTTTTACCGTGGTCAACG	BBBFFFFFFFFFFFFIIIIIIIIIIIIIIIFFFIIIIFFIIIIIIIIIII
    +	131	gi|556503834|ref|NC_000913.3|	3471324	255	49M	=	3471221	-152	GCTCTCTCCTGAAGGGGAGAGCACTATAGTAAGGAATATAGCCGTGTCT	BBBFFFFFFFFFFIIIIIIIIIIIIIIFIFFIIIIIIIIIIIIIFIIII
    D69F08P1:403:C6Y8VACXX:5:1101:2133:2203 1:N:0:AGTCAAC	115	gi|556503834|ref|NC_000913.3|	1719838	255	50M	=	1719872	83	AAGAGACAGACCTACCATTGAAACAACCAATACGCGTTTAATCATTGAAA	BBBFFFFFFFFFFIIIIIIFFIIIFIIIIIIFFFBFBFFFIIIFFFFFFB
    +	179	gi|556503834|ref|NC_000913.3|	1719872	255	49M	=	1719838	-83	GCTTGCGTGGCGTTTCATGGTGAACAGGAGATTTTTCAATGATTAAACG	BBBFFFFFFFFFFFFFIIIIBFBFFIIIFFBFFFIIIIBFFBFIFBBFB
    D69F08P1:403:C6Y8VACXX:5:1101:1916:2222 1:N:0:AGTCAAC	67	gi|556503834|ref|NC_000913.3|	3444439	255	50M	=	3444490	100	CCCACGACCACCGGTTTTACCGAGGCCAGAACCGATACCACGACCCAGGC	BBBFFFFFFFFFFFFFFFFIFFII<BBFFFFIIFFIF<<<BF<BBFBF7B
    +	131	gi|556503834|ref|NC_000913.3|	3444490	255	49M	=	3444439	-100	TGCGTTTAAATACTCTGTCTCCGGCCGAAGGCTCCAAAAAGGCGGGTAA	BB<FFFFFFFFFFFBFFBBBFBFFFFFFFB7BFFIBFFFBFB<BBB0<B
    D69F08P1:403:C6Y8VACXX:5:1101:2117:2249 1:N:0:AGTCAAC	115	gi|556503834|ref|NC_000913.3|	639393	255	50M	=	639501	157	GGCGACGCCAACGCCGCTATGGCGTGAAAGACGAAGGAAATTTAGATTTT	<BBFBFFFBBFBFFFIFFBFFIIIIIFBFFIIIIF7<BF<BBBBBBBBB<
    +	179	gi|556503834|ref|NC_000913.3|	639501	255	49M	=	639393	-157	GTAAAATCAAAGCAGCACAGTACGTAGCTTCTCACCCAGGTGAAGTTTG	B<BFFFFFFFFFFFBFFFFBBFFFFFFIIIFFFIFFBFFFFIBFFBFFF
    Thank you for your help.
    Attached Files
    Last edited by blancha; 07-09-2015, 04:11 PM. Reason: Put lines from SAM file in Code box

  • #2
    It mostly looks like a normal sam file; the specification is here: https://samtools.github.io/hts-specs/SAMv1.pdf

    However, the second line has "+" for the read name, which is odd to say the least. Can you run head on the input fastq file to show the first 8 lines?

    Edit - looking at the attachment, it appears that either you have an odd fastq file with read2 always named "+" or that Rockhopper has a bug causing it to incorrectly report the read name.

    Comment


    • #3
      Thank you Brian.
      You are correct in pointing out that the only problem with the format is the + sign on every other line.
      The + just corresponds to the paired FASTQ read.
      If this was the only issue I had with Rockhopper, I would be happy.

      The main problem I have is that when I view the alignments in IGV, at least half the reads are mostly composed of mutations relative to the reference genome.
      I've tried all the different settings, fr, ff, rf, and rr.
      I cannot figure out why Rockhopper insists on aligning reads in what appears to be the wrong location.

      I think I'll just give up on the software, even if it appears to be widely used in respected publications for E. coli RNA-Seq analysis.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Non-Coding RNA Research and Technologies
        by seqadmin




        Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

        Nobel Prize for MicroRNA Discovery
        This week,...
        Yesterday, 08:07 AM
      • seqadmin
        Recent Developments in Metagenomics
        by seqadmin





        Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
        09-23-2024, 06:35 AM
      • seqadmin
        Understanding Genetic Influence on Infectious Disease
        by seqadmin




        During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

        Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
        09-09-2024, 10:59 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 10-02-2024, 04:51 AM
      0 responses
      91 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 10-01-2024, 07:10 AM
      0 responses
      100 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 09-30-2024, 08:33 AM
      1 response
      101 views
      0 likes
      Last Post EmiTom
      by EmiTom
       
      Started by seqadmin, 09-26-2024, 12:57 PM
      0 responses
      20 views
      0 likes
      Last Post seqadmin  
      Working...
      X