Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • fulvio.dan
    Junior Member
    • Dec 2012
    • 3

    Converting RNA-Seq bam in fastq

    Hi everyone!
    I have to convert some RNA-Seq bam files into corrensponding paired-end fastq files.
    I tried to use "samtools view" and Picard "SamToFastq"
    Code:
    samtools view -h -o sample.sam sample.bam
    Code:
    java -jar SamToFastq.jar INPUT=sample.sam FASTQ=sample_1.fastq SECOND_END_FASTQ=sample_2.fastq
    It resulted in this error:
    Code:
    Error parsing text SAM file. MRNM not specified but flags indicate mate mapped
    and empty fastq files.

    This is the sample.sam
    Code:
    @HD	VN:1.0	SO:unsorted
    @SQ	SN:chr1	LN:249250621
    @SQ	SN:chr10	LN:135534747
    @SQ	SN:chr11	LN:135006516
    @SQ	SN:chr12	LN:133851895
    @SQ	SN:chr13	LN:115169878
    @SQ	SN:chr14	LN:107349540
    @SQ	SN:chr15	LN:102531392
    @SQ	SN:chr16	LN:90354753
    @SQ	SN:chr17	LN:81195210
    @SQ	SN:chr18	LN:78077248
    @SQ	SN:chr19	LN:59128983
    @SQ	SN:chr2	LN:243199373
    @SQ	SN:chr20	LN:63025520
    @SQ	SN:chr21	LN:48129895
    @SQ	SN:chr22	LN:51304566
    @SQ	SN:chr3	LN:198022430
    @SQ	SN:chr4	LN:191154276
    @SQ	SN:chr5	LN:180915260
    @SQ	SN:chr6	LN:171115067
    @SQ	SN:chr7	LN:159138663
    @SQ	SN:chr8	LN:146364022
    @SQ	SN:chr9	LN:141213431
    @SQ	SN:chrM_rCRS	LN:16569
    @SQ	SN:chrX	LN:155270560
    @SQ	SN:chrY	LN:59373566
    @RG	ID:110624_UNC14-SN744_0134_AD0CVTABXX_8_	PL:illumina	PU:barcode	LB:TruSeq	SM:110624_UNC14-SN744_0134_AD0CVTABXX_8_
    UNC14-SN744_134:8:2102:15138:99673/2	147	chr7	99998918	69	42M2357N8M	=	99998683	-2642	CCAAGGCCTTGCTCTGGGGAGCTTTAAATTTTTTCTTAGGGCTGTTTTCT	IIIGHGGGIJIIIGHAHGHH@JIGJJJIHEJJIJJJJHHHHHFFFFFCCC	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:2101:1447:161692/2	147	chr7	99998918	69	42M2357N8M	=	99998797	-2528	CCAAGGCCTTGCTCTGGGGAGCTTTAAATTTTTTCTTAGGGCTGTTTTCT	HGHGHDCHGGGIIDGIIHIHEIHGGJIGGHIIIJJJJHHGHHFFFFF@C@	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:2207:13624:39322/2	147	chr7	99998920	69	40M2357N10M	=	99998689	-2638	AAGGCCTTGCTCTGGGGAGCTTTAAATTTTTTCTTAGGGCTGTTTTCTCT	@HF<JIGIJIHCCGD9CIGIHGGJIGDJIGJJJJJJJHHGHHEDDDFCB@	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:2108:11461:118679/2	163	chr7	99998929	60	31M2357N19M	=	100001809	2930	CTTTGGGGAGCTTTAAATTTTTTCTTAGGGCTGTTTTCTCTCCTTCCTCC	CCCFFFFFFHHHHJJJJJJJJJIJJIJJJJJJIHIJJIJJIJJJJIJDIH	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:1	XS:A:-
    UNC14-SN744_134:8:1107:2904:31086/1	99	chr7	99998929	60	31M2357N19M	=	100001809	2930	CTTTGGGGAGCTTTAAATTTTTTCTTAGGGCTGTTTTCTCTCCTTCCTCC	BCCFFFFFFHHHHJJJJJJJJJJJJJJIJIJJGHHJJJJJJJJJJJJJJJ	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:1	XS:A:-
    UNC14-SN744_134:8:2107:8382:2405/1	83	chr7	99998936	69	24M2357N26M	=	99998696	-2647	GAGCTTTAAATTTTTTCTTAGGGCTGTTTTCTCTCCTTCCTCCTTTTCCA	JJJIIJJJJIGGJJJJJIJJJJJJJIJJJIHHJIJJJHHHFHFFFFDCCB	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:2106:3457:77846/1	83	chr7	99999623	69	42M474N8M	=	99998870	-1277	TCCTGCCTCGGCCATCTGCTGTGCCTGCATCACCCCCAAGCCCTCTTGGC	DDDDDFHJJJJJJJJJJJJJJIJJJJJIGGD?JJJJJHHHHHFFFFFCCC	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:2107:3652:145199/2	163	chr7	99999624	69	41M474N9M	=	100001398	2216	CCTGCCTCGGCCATCTGCTGTGCCTGCATCACCCCCAAGCCCTCTTGGCT	CCCFFFFFFGHHHJJJIJJJIJJJJJJJJGHIJGIEIGHIJJJIJIJGIG	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:1201:13771:91534/2	163	chr7	99999624	69	41M474N9M	=	100001333	1759	CCTGCCTCGGCCATCTGCTGTGCCTGCATCACCCCCAAGCCCTCTTGGCT	BCCFFFFFHHGHHJHJFIJIGHIIHGIIHIIEIHHHIIJJIJIIGCGIIG	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:2103:11276:160481/1	83	chr7	99999642	69	23M474N27M	=	99998948	-1218	TGTGCCTGCATCACCCCCAAGCCCTCTTGGCTTGGTTTTTTGGGTCTGTA	DEBFFFFHFHEB;IIIIIJJIJGGEIIJJIJIJJJIFHHHGHFFFFFCCC	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    I understood there are some lines with MRNM not specified, such as:
    Code:
    UNC14-SN744_134:8:2206:10660:87358/2	145	chr7	100001077	60	50M	*	0	0	ATCCGCTTCCCTCGGCCTCCCAAAGTGCTGGGATCACAGGCGTGAGCCAC	9:BBAF@5'HEAIJGIGEHF<HEBA;D@?HHGGBCA@AD<?4;FFFF@BB	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:1
    but I don't understand why I cannot retrieve the other corrected reads in output fastq files.

    I also tried to include these two options in Piacard SamToFastq
    Code:
    INCLUDE_NON_PF_READS=TRUE VALIDATION_STRINGENCY=SILENT
    and it resulted in all reads unpaired and empty fastq files.

    Then I tried with another tool, TopHat2 bam2fastqx.
    First, I sorted sample.bam by chr name
    Code:
    samtools sort -n sample.bam sample_sn
    resulting in
    Code:
    @HD	VN:1.0	SO:unsorted
    @SQ	SN:chr1	LN:249250621
    @SQ	SN:chr10	LN:135534747
    @SQ	SN:chr11	LN:135006516
    @SQ	SN:chr12	LN:133851895
    @SQ	SN:chr13	LN:115169878
    @SQ	SN:chr14	LN:107349540
    @SQ	SN:chr15	LN:102531392
    @SQ	SN:chr16	LN:90354753
    @SQ	SN:chr17	LN:81195210
    @SQ	SN:chr18	LN:78077248
    @SQ	SN:chr19	LN:59128983
    @SQ	SN:chr2	LN:243199373
    @SQ	SN:chr20	LN:63025520
    @SQ	SN:chr21	LN:48129895
    @SQ	SN:chr22	LN:51304566
    @SQ	SN:chr3	LN:198022430
    @SQ	SN:chr4	LN:191154276
    @SQ	SN:chr5	LN:180915260
    @SQ	SN:chr6	LN:171115067
    @SQ	SN:chr7	LN:159138663
    @SQ	SN:chr8	LN:146364022
    @SQ	SN:chr9	LN:141213431
    @SQ	SN:chrM_rCRS	LN:16569
    @SQ	SN:chrX	LN:155270560
    @SQ	SN:chrY	LN:59373566
    @RG	ID:110624_UNC14-SN744_0134_AD0CVTABXX_8_	PL:illumina	PU:barcode	LB:TruSeq	SM:110624_UNC14-SN744_0134_AD0CVTABXX_8_
    UNC14-SN744_134:8:1101:1284:144798/1	83	chr7	100276741	21	50M	=	100276627	-164	ATTTTTATTATATTTTCAGTTTTTCCATAAAGGAGCCAATTCCAACNCTG	###############################################CC@	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8IH:i:1	HI:i:1	NM:i:1
    UNC14-SN744_134:8:1101:1284:144798/2	163	chr7	100276627	59	50M	=	100276741	164	CAGGAGGCCCTCATCCTTCTGCTGCCCTGGCGTTGGGGCCTCACCCCTCT	BCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJIJJHHIIHIJJJJJJJJJ	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8IH:i:1	HI:i:1	NM:i:1
    UNC14-SN744_134:8:1101:1295:171825/1	99	chr7	100210452	69	50M	=	100210588	383	GTCCGGGGCCCCCTGGGCGGGGGTCCCGGGGCGCCCCTCCTCCCTTGGGA	@@BFF>DFHHHGHIJJIJJJJDD7@BBDDBBDBBBDDDDDDDDD8@CCD8	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8IH:i:1	HI:i:1	NM:i:0
    UNC14-SN744_134:8:1101:1295:171825/2	147	chr7	100210588	69	32M197N18M	=	100210452	-383	TAACCCCACAGGAACTGCGCTTCGCTTCCGAGTCCTGTGCACAGCACCTG	AHGIIHHFGIIJJIJJJIIFCAJGHGGJJIGGGIIGGAHHHHDDD=F@@B	XF:Z:GTAG,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:+
    UNC14-SN744_134:8:1101:1296:110092/1	65	chr7	100417813	52	50M	*	0	0	CGGCACTGGCAGACGGCTGATCCAATGGTGTTAGAGTGGCTAATAGCTGG	@@@DDDDDHHHHFGADG@AGCBH*?:9D*::B>DHGBFHD9?B#######	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:2
    UNC14-SN744_134:8:1101:1296:110092/2	129	chr7	100417873	57	50M	*	0	0	CAGGACCCTTCTCCTGACAGGGGCTTGAAGGTGCCCTGGGCACTGGCAGG	CCCFFFFFHHHHHJJJGHIJJJJJJIA>GDH?BBHHBDGGB>B98B####	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:3
    UNC14-SN744_134:8:1101:1298:165228/1	83	chr7	100463356	69	50M	=	100459519	-3887	ACACGTTGGTCCTAGGTTTCTACGATGACGCTCCACCGCAGGACCATTTC	IGGJJJJIJJJIJIJJJJGJJIIJJJJJJJJIIHEIIHHHHHFFFFF@@B	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8IH:i:1	HI:i:1	NM:i:0
    UNC14-SN744_134:8:1101:1298:165228/2	163	chr7	100459519	69	15M769N35M	=	100463356	3887	CCCTGGGAGACCTCGACTCCCTGCCCTCGGACCCTGTACAGCCGCAGTAT	CCCFFFFFHHHHHJIIJJJJIIJJJIJJJJJJJJJJHIHIJCHJIHIHHE	XF:Z:GTAG,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:+
    UNC14-SN744_134:8:1101:1306:60600/1	99	chr7	100417799	69	50M	=	100419893	2144	GGAAGTACCCGACGCGGCACTGGCAGACGGCTGATCCAATGGTGTTAGAG	BCCFFDDFHHHFHJJJJJJJGIJIIJ;F@FA@B=ACH;B;@C);.;;>C>	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8IH:i:1	HI:i:1	NM:i:0
    UNC14-SN744_134:8:1101:1306:60600/2	147	chr7	100419893	69	50M	=	100417799	-2144	CTCGGCACTTGGTGTTCCCCTCAGCTGCCTCGAACCCCGGAGCACAGCTG	<B>HHECHFIIIHCHGIIIGGEIIJIIJJIJIHFJJJHHHHHFDFFFCCC	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8IH:i:1	HI:i:1	NM:i:0
    and then I used TopHat2 bam2fastx
    Code:
    bam2fastx -q -A -o sample.fastq -P -N sample_sn.bam
    resulting in this error
    Code:
    Error: couldn't retrieve both reads for pair UNC14-SN744_134:8:1101:1284:144798/1. Perhaps the input file is not sorted by name?
    (using 'samtools sort -n' might fix this)
    Could someone explain this issue? Have you got any suggestion?

    Thanks!
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    If this is TCGA RNA-seq data from UNC then the following would work. Send me a PM if you have any problems.


    In certain circumstances, a small fraction of the sequences and quality scores in these reads are rearranged such that they cannot perfectly reconstruct the original fastq record. To remedy this error we have provided fastq files to CGHUB.

    OR

    A sam2fastq option is available in UBU version 1.2. It is only properly tested against Mapsplice paired end.

    Sample usage:

    Code:
    $ java -Xmx512M -jar ubu.jar sam2fastq --in sorted_by_name.bam --fastq1 1.fastq --fastq2 2.fastq --end1 /1 --end2 /2
    The input BAM should be sorted by name. i.e. with "samtools sort -n"

    The standalone jar file ubu-1.2-jar-with-dependencies.jar is available from the UBU downloads page:

    Comment

    • fulvio.dan
      Junior Member
      • Dec 2012
      • 3

      #3
      Thanks GenoMax! You are right!
      They are TCGA RNA-Seq data, and ubu sam2fastq worked!

      Comment

      • jstjohn
        Member
        • Jun 2010
        • 35

        #4
        UBU likes only paired reads in the BAM files

        In case this helps anyone else: when I was converting TCGA RNA-seq reads to fastq format UBU complained about the presence of unpaired reads. The following was my workaround.
        1. Split paired and unpaired bam records.
          Code:
          samtools  view -b -U unpaired.bam -o paired.bam  \
                  -@ 3  -f 1 \
                  $BAM
        2. Sort paired reads by name.
          Code:
          samtools sort \
                  -n -o namesort.bam  -T namesort_pre -@ 3 -m 3G -O bam \
                  paired.bam
        3. Run UBU sam2fastq on paired namesorted reads, outputing --fastq1 and --fastq2
          Code:
          java -jar -Xmx512m ubu-1.3-SNAPSHOT-jar-with-dependencies.jar sam2fastq \
                  --in namesort.bam \
                  --fastq1 r1.fastq \
                  --fastq2 r2.fastq \
                  --mapsplice
        4. Run UBU sam2fastq on unpaired reads, outputting --fastq1 only into an unpaired fastq file.
          Code:
          java -jar -Xmx512m ubu-1.3-SNAPSHOT-jar-with-dependencies.jar sam2fastq \
                  --in  unpaired.bam  \
                  --fastq1 fu.fastq \
                  --mapsplice

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


          Here are nine questions we think about, in roughly the order they matter, before...
          Yesterday, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM
        • SEQadmin2
          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
          by SEQadmin2


          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


          Introduction

          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
          05-22-2026, 06:42 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        20 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        38 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        44 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-04-2026, 08:59 AM
        0 responses
        49 views
        0 reactions
        Last Post SEQadmin2  
        Working...