Seqanswers Leaderboard Ad

**dariober** · 11-28-2013, 03:08 PM

Hi- This python script should do what you need. There is no error checking. *All* the files in each input directory are concatenated, and it is assumed they are all in the same order (as you mention above).

Assuming your input dirs are species1, species2..., this will dump the concatenated files in directory OUTDIR:

Code:

python -c "
import os

INPUTDIRS= ['./species1', './species2', './species3', './species4']
OUTDIR= './'
PREFIX= 'gene.'

geneDict= {}
for d in INPUTDIRS:
    fileNames= sorted(os.listdir(d))
    geneDict[d]= fileNames

for i in range(0, len(fileNames)):
    with open(os.path.join(OUTDIR, PREFIX + fileNames[i]), 'w') as fout:
        for d in INPUTDIRS:
            with open(os.path.join(d, geneDict[d][i])) as fin:
                for line in fin:
                    fout.write(line)
"

(There must be an easier way of doing it!)

Dario

**gringer** · 11-28-2013, 04:12 PM

Originally posted by gevielr View Post

I've got 1000 gene sequences, each in a separate fasta file, for 4 different species. So, each species has its own sequence for 1000 different genes. Each species is it's own directory, and all the sequence files are in the same order for each species, so:

Directory1(species1): geneA.fa, geneB.fa, geneC.fa, ...
Directory2(species2): geneA.fa, geneB.fa, geneC.fa, ...
etc...

I want to concatenate all the sequences for geneA into a single file to end up for 1000 fasta files with 4 sequences (1 from each species) in every file.

Is there an easy way to automate this? I could just use cat and go one gene at a time, but I'd like to do it more quickly.

File globbing in the shell makes this easy:

Code:

mkdir -p combined_files
for x in $(ls Directory1 | grep '\.fa$'); do echo "Creating combined_files/${x}"
  cat Directory*/${x} > combined_files/${x}
done

For a more complicated situation, I might use find and exec.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

fasta file manipulation- combining sequences by gene rather than species

Comment

Comment

Latest Articles

ad_right_rmr

News