Seqanswers Leaderboard Ad

**dariober** · 11-28-2013, 03:08 PM

Hi- This python script should do what you need. There is no error checking. *All* the files in each input directory are concatenated, and it is assumed they are all in the same order (as you mention above).

Assuming your input dirs are species1, species2..., this will dump the concatenated files in directory OUTDIR:

Code:

python -c "
import os

INPUTDIRS= ['./species1', './species2', './species3', './species4']
OUTDIR= './'
PREFIX= 'gene.'

geneDict= {}
for d in INPUTDIRS:
    fileNames= sorted(os.listdir(d))
    geneDict[d]= fileNames

for i in range(0, len(fileNames)):
    with open(os.path.join(OUTDIR, PREFIX + fileNames[i]), 'w') as fout:
        for d in INPUTDIRS:
            with open(os.path.join(d, geneDict[d][i])) as fin:
                for line in fin:
                    fout.write(line)
"

(There must be an easier way of doing it!)

Dario

**gringer** · 11-28-2013, 04:12 PM

Originally posted by gevielr View Post

I've got 1000 gene sequences, each in a separate fasta file, for 4 different species. So, each species has its own sequence for 1000 different genes. Each species is it's own directory, and all the sequence files are in the same order for each species, so:

Directory1(species1): geneA.fa, geneB.fa, geneC.fa, ...
Directory2(species2): geneA.fa, geneB.fa, geneC.fa, ...
etc...

I want to concatenate all the sequences for geneA into a single file to end up for 1000 fasta files with 4 sequences (1 from each species) in every file.

Is there an easy way to automate this? I could just use cat and go one gene at a time, but I'd like to do it more quickly.

File globbing in the shell makes this easy:

Code:

mkdir -p combined_files
for x in $(ls Directory1 | grep '\.fa$'); do echo "Creating combined_files/${x}"
  cat Directory*/${x} > combined_files/${x}
done

For a more complicated situation, I might use find and exec.

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, Today, 06:35 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, Yesterday, 02:46 PM	0 responses 18 views 0 likes	Last Post by seqadmin Yesterday, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 17 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 18 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

fasta file manipulation- combining sequences by gene rather than species

Comment

Comment

Latest Articles

ad_right_rmr

News