Header Leaderboard Ad

Collapse

Several input files w. Novoalign?

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Several input files w. Novoalign?

    Hi all,

    I am using Novoalign to align 50bp illumina SE reads to a viral genome. The reads were processed with Casava 1.8, so I have a bunch of individual read files - e.g.

    read001.gz
    read002.gz
    .
    .
    read036.gz

    Now, is there a way I can tell Novoalign to align all of these reads in one go? (I know I could pipe it with gunzip, but I would rather not do that). I have tried to put them in a folder and point Novoalign like so:

    novoalign -f Reads/ -c 6 -F ILM1.8 --**ILQ_SKIP -d reference -k -o SAM 2> reads.novoalign_logS.txt | samtools view -S -b -q 1 - | samtools sort - reads_sortedS

    But that doesn't work - neither does replacing 'Reads/' with read0*.gz.

    Any thoughts?

  • #2
    Why not just align each of them independantly, then merge BAMS?

    Comment


    • #3
      I guess I could do that, but that workflow isn't very optimal either - what we do isn't (currently) scripted, so I would have to make a billion commands to make this work.

      I guess I will settle with the gunzip pipe and then hopefully NovoCraft will get this into a future release - other programs like Mosaik can easily handle multiple small files - and with the new Illumina output, it would be good for other programs to do that as well.

      Any other workarounds?

      Comment


      • #4
        Yeah, why not just gunzip -c *.gz > reads.fq and start from there?

        Comment


        • #5
          Hi nickloman,

          Ya, that may be the best option - I'm just worried that this will slow down the whole process, but maybe not that much. I guess I could test it with a real experiment .

          Comment


          • #6
            Yep "why think, why not do the experiment"

            Comment


            • #7
              Bummer, I spoke too soon - I thought Novoalign could read from stdin, but that doesn't appear to be the case. I tried the following commands, but none of them work:

              gunzip *.gz -c | novoalign -f - -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM
              gunzip *.gz -c | novoalign -f stdin -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM
              gunzip *.gz -c | novoalign -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM

              I could of course gunzip, cat and save the file and then use that as the input, but again, that workflow really is pretty far from optimal.

              Bummer. NovoCraft - we need the ability to input more than one sequence file (or two for PE) with the new Casava 1.8 pipeline.

              Comment


              • #8
                That was the point of my post, simply "gunzip -c *.gz > reads.fq" and then use reads.fq as the input.

                Comment


                • #9
                  Also if you really want pipes, you might find /dev/stdin or /dev/fd/0 may work.

                  Comment


                  • #10
                    You know what - that actually worked....

                    gunzip *.gz -c | novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....

                    I'll settle with that option - not too many compromises here.

                    Thanks nickloman!

                    Comment


                    • #11
                      Originally posted by kga1978 View Post
                      I guess I could do that, but that workflow isn't very optimal either - what we do isn't (currently) scripted, so I would have to make a billion commands to make this work.

                      I guess I will settle with the gunzip pipe and then hopefully NovoCraft will get this into a future release - other programs like Mosaik can easily handle multiple small files - and with the new Illumina output, it would be good for other programs to do that as well.

                      Any other workarounds?
                      I'm not quite sure what you mean about being optimal or a billlion commands. You will use the same amount of resources to do the alignment. It would be very simple to construct a bash loop to submit all your jobs too.

                      Comment


                      • #12
                        Hi RockChalkJayhawk

                        I guess, by 'optimal' I meant 'so I don't have to write too many commands'. I'm not familiar with bash loops, but I just googled it and it looks fairly easy - thanks for heads up. I have been avoiding any sort of scripting (we don't have that many sequences and our genomes are small), but I guess I need to take the plunge .

                        Comment


                        • #13
                          if the names of your files are read001.gz, read002.gz, read036.gz, etc. do
                          Code:
                          for x in read *gz
                          do
                          gunzip $x| novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....{enter the rest here]
                          done

                          Comment


                          • #14
                            Originally posted by kga1978 View Post
                            Hi nickloman,

                            Ya, that may be the best option - I'm just worried that this will slow down the whole process, but maybe not that much. I guess I could test it with a real experiment .
                            You may try this:

                            https://bitbucket.org/dawe/utils/src...e6/makeMake.sh

                            Just modify it for novoalign, like this:

                            Code:
                            #!/bin/bash
                            
                            # Look for binaries we need, exit if not found
                            
                            SAMTOOLS=/home/elia/bin/samtools
                            NOVOALIGN=/home/elia/dawe/novocraft/novoalign
                            INDEXPATH=/home/elia/dawe/index
                            
                            [[ -z ${BWA} ]] && exit 1
                            [[ -z ${SAMTOOLS} ]] && exit 1
                            
                            # Use the SampleSheet.csv to create the script and align
                            # SampleSheets has a header we want to skip
                            
                            nlines=(`wc -l SampleSheet.csv`)
                            while read line
                            do
                              # There may be reasons to skip this line... i.e. when there's no indexed genome 
                              Skip=1
                            
                              # Split the SampleSheet line using commas, store values we need (ID, Lane, Genome and Index)
                              IFS=","
                              lineArray=($line)
                              Lane=`printf "%03d" ${lineArray[1]}`
                              ID=${lineArray[2]}
                              Genome=${lineArray[3]}
                              if [ -z ${lineArray[4]} ]
                              then 
                                # If there's no index, set it to NoIndex, just like Illumina does
                                Index="NoIndex"
                              else 
                                Index=${lineArray[4]}
                              fi
                              
                              # Create the Experiment name, this is also the name Illumina gives to files
                              expName=${ID}_${Index}_L${Lane}
                              
                              # Check how many chunks we have, this will be the number of SGE array tasks
                            
                              ntask=`ls ${ID}_${Index}_L${Lane}_R1_*.fastq.gz | wc -l`
                            
                              # Also check if this is paired end sequencing. This changes the bwa line
                              isPE=0
                              if [ -f ${ID}_${Index}_L${Lane}_R2_001.fastq.gz ]
                              then
                                isPE=1
                              fi  
                              
                              # Ok, what if we don't have the genome?
                              [[ -d ${INDEXPATH}/${Genome} ]] && Skip=0
                              if [ $Skip -eq 1 ]
                              then
                                echo "Missing ${Genome}, you should correct the file by hand (align_${expName}) and qsub it"
                                GFile="FIXME"
                              else
                                # If there's the genome, look for the proper index (assuming there's one index per directory ^__^ )
                                GFile=`ls ${INDEXPATH}/${Genome}/*.nix`
                                #GFile=${GFile%%.bwt}
                                echo "Using ${GFile}"
                              fi  
                            done < <(tail -n $(( ${nlines[0]} - 1 )) SampleSheet.csv)
                            
                            # generate Makefile
                            
                            unset IFS
                            
                            # set the list of BAM files
                            bamTargets=''
                            ibam=''
                            for x in `seq ${ntask}`
                            do
                              chunk=$(printf "%03d" ${x})
                              ibam="${ibam} INPUT=${expName}_${chunk}.bam"
                              bamTargets="${bamTargets} ${expName}_${chunk}.bam"
                            done  
                            
                            cat > Makefile << ENDOFMAKE
                            SHELL = /bin/bash
                            NOVO = /home/elia/dawe/novocraft/novoalign 
                            SAMTOOLS = ${SAMTOOLS}
                            REFINDEX = ${GFile}
                            PICARD_BASE = /home/elia/dawe/software/picard-tools-1.56
                            
                            all: ${expName}.bam.bai
                            	/bin/mv ${expName}.bam ${expName}.bam.lock
                            	/bin/rm -f ${bamTargets}
                            	/bin/mv ${expName}.bam.lock ${expName}.bam
                            	@echo "Time to give some meaning to these files"
                            
                            ${expName}.bam: ${bamTargets}
                            	java -jar  \$(PICARD_BASE)/MergeSamFiles.jar OUTPUT=${expName}.bam ${ibam} ASSUME_SORTED=True SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT
                            
                            ${expName}.bam.bai: ${expName}.bam
                            	\$(SAMTOOLS) index ${expName}.bam
                            
                            ENDOFMAKE
                            
                            for x in `seq ${ntask}`
                            do
                              chunk=`printf "%03d" ${x}`
                              if [ $isPE == 1 ]
                              then
                                cat >> Makefile << ENDOFMAKE1
                            ${expName}_${chunk}.bam: ${expName}_R1_${chunk}.fastq.gz ${expName}_R2_${chunk}.fastq.gz
                            	\$(NOVO) -d \$(REFINDEX) -f ${expName}_R1_${chunk}.fastq.gz ${expName}_R2_${chunk}.fastq.gz -i 200,50 -o SAM | \$(SAMTOOLS) view -Su - | \$(SAMTOOLS) sort - ${expName}_${chunk}
                            	
                            ENDOFMAKE1
                            
                              else
                                cat >> Makefile << ENDOFMAKE2
                            ${expName}_${chunk}.bam: ${expName}_R1_${chunk}.fastq.gz
                            	\$(BWA) samse \$(BWAOPT_SE) \$(REFINDEX) <(\$(BWA) aln \$(BWAOPT_ALN) \$(REFINDEX) ${expName}_R1_${chunk}.fastq.gz) ${expName}_R1_${chunk}.fastq.gz | \$(SAMTOOLS) view -Su - | \$(SAMTOOLS) sort - ${expName}_${chunk}
                            	
                            ENDOFMAKE2
                            
                              fi
                            done
                            Launch it in each Sample directory created by CASAVA, you will find a Makefile. That done you may just issue make -j N (just like CASAVA) to perform parallel processing of your fastq.gz files. Everything will be done up to your final merged sorted BAM file.

                            Comment


                            • #15
                              Originally posted by kga1978 View Post
                              You know what - that actually worked....

                              gunzip *.gz -c | novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....

                              I'll settle with that option - not too many compromises here.

                              Thanks nickloman!
                              That's a neat way and it may improve performance as decompression is in its own thread.
                              One issue with Novoalign is that it reads a few records to check the file format then rewinds the files and starts again. If it detects a named pipe as input then it skips the file format check and you must use the -F option. I haven't tried with -f /dev/stdin but I expect it appears as a named pipe to Novoalign so the above should work.

                              Before we implemented use of gzipped input files in Novoalign we were doing paired end like this:

                              mkfifo read1_pipe read2_pipe
                              gunzip -c read1.fastq.gz > read1.pipe
                              gunzip -c read2.fastq.gz > read2.pipe
                              novoalign -f read1.pipe read2.pipe -F STDFQ ....

                              The same technique could be used to process multiple paired read files.

                              Colin

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                                by seqadmin


                                ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                                01-24-2023, 01:19 PM
                              • seqadmin
                                Introduction to Single-Cell Sequencing
                                by seqadmin
                                Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                                The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                                ...
                                01-09-2023, 03:10 PM

                              ad_right_rmr

                              Collapse
                              Working...
                              X