">chromosomeseq
cccccccccccccccccccccccc
>plasmid1
p1p1p1p1p1p1p1p1
>plasmid2
p2p2p2p2p2p2p2p2"
...would be correct.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by piet View Postusually you will map your reads against a set of reference sequences. All reference sequences are stored in a single FASTA file and this FASTA file is passed to the mapping program (bwa for example). If the genome you want to use as a reference comprises one chromosome and two plasmids than you have to copy the three sequences into a single file. This is called a concatenation and is often done with the Unix command line program 'cat'. For a small genome like E.coli you may also do it with a text editor like notepad.
Does that mean I just paste all the chromosome & plasmid sequence together without breaks using one heading, example
">combinedsequences
ccccccccccccccp1p1p1p1p1p1p1p1p2p2p2p2p2p2p2p2p2"
c is chromosome sequence
p1 is plasmid one sequence
p2 is second plasmid's sequence
OR
in the notepad file, there are three headings?
">chromosomeseq
cccccccccccccccccccccccc
>plasmid1
p1p1p1p1p1p1p1p1
>plasmid2
p2p2p2p2p2p2p2p2"
Many thanks!
Leave a comment:
-
Originally posted by michaellim View PostBut how do I add the plasmid sequences too
Leave a comment:
-
Originally posted by GenoMax View Post@michaellim: What is your ultimate aim with this RNAseq study? Are you looking to do differential expression or just checking to see what is expressed under some specific condition(s)?
For the immediate issue of not being able to see annotations you can use the gff file from Piet's example and see if that works with IGV. You should compare the pre- and post-trimming FastQC plots to see if there is an improvement in stats. Plots you have posted don't look bad but it is difficult to say if you have adapter contamination unless you try the trimming. No fastq grooming in galaxy should be necessary with MiSeq data. It is already in sanger fastq format.
Once you go away from "model" organisms tools such as galaxy start becoming limiting (as you have already discovered). Depending on your overall goals it may be beneficial to start learning how to do these analyses on command line. If this is a small part of whatever you are trying to do then enlisting the help of a friend/local bioinformatics support folks may be the easiest thing to do so you can get a set of hypotheses to test at the bench and move on.
Thank you for the advise, do you have any recommended handbook/guides/manual/website which teaches some RNA-seq analyses using command lines/softwares?
I would really like to get a bioinformatician who knows about RNA-seq to help, however, no one in my department knows about it. That's why I have been hunting for information online.
It would be easier if all the sequences has been input into Galaxy and IGV, however like what you've said, they only have certain 'model organism' and the rest are not in there. That's why I was thinking of inputting the reference genome (chromosome and plasmid; separately or combined,whichever more appropriate for analyses) myself.
Thank you.
Leave a comment:
-
Originally posted by piet View PostThe primary format NCBI has used for ages is Genbank flat file format. Downlad the entry in 'Genbank' format an then use a tool like 'seqret' from the Emboss package to convert Genbank flat file into GFF.
Or you may use the TogoWS web service to download the entry in GFF format directly:
wget http://togows.org/entry/nucleotide/407479587.gff
Please note, that GFF is not a strict format but rather a framework to invent your own format. Column 9 of the GFF file comprises several tags. The names of these tags are more or less arbitrary. The tag names assigned by TogoWS may or may not meet the requirements of your sequence viewer. Furthermore, column 1 of the GFF file holds the name of the sequence. The name used in column 1 of the GFF file must be EXACTLY the same as the name used in the corresponding FASTA file. In a FASTA file the name of the sequence is the first word of the description line (all the characters before the first space).
Please also note that GI=407479587 is an isolate from the German HUSEC outbreak in 2011 which is sequence type 678 and differs from the uropathogenic ST131 you have ask for before.
With regard to your questions about read trimming and about inclusion of plasmids, I would recomment that you initially start with just a single chromosomal sequence and without any read trimming. You should be able to map 60 to 80 percent of your reads that way. Your goal for the next weeks should be to make yourself familiar with all these tools and to establish a basic work flow. If you have found such a work flow you can try to improve the number of reads mapped by either adding plasmidic sequences to your set of reference sequences or by doing some read trimming.
--
piet
Thanks for letting me know, I was unaware of the difference in sequence type. I will give it a go with just the chromosome first. But how do I add the plasmid sequences too?
Leave a comment:
-
@michaellim: What is your ultimate aim with this RNAseq study? Are you looking to do differential expression or just checking to see what is expressed under some specific condition(s)?
For the immediate issue of not being able to see annotations you can use the gff file from Piet's example and see if that works with IGV. You should compare the pre- and post-trimming FastQC plots to see if there is an improvement in stats. Plots you have posted don't look bad but it is difficult to say if you have adapter contamination unless you try the trimming. No fastq grooming in galaxy should be necessary with MiSeq data. It is already in sanger fastq format.
Once you go away from "model" organisms tools such as galaxy start becoming limiting (as you have already discovered). Depending on your overall goals it may be beneficial to start learning how to do these analyses on command line. If this is a small part of whatever you are trying to do then enlisting the help of a friend/local bioinformatics support folks may be the easiest thing to do so you can get a set of hypotheses to test at the bench and move on.Last edited by GenoMax; 12-21-2014, 08:29 AM.
Leave a comment:
-
Originally posted by michaellim View Postdownload the GFF file from NCBI, but I don't see any "http://www.ncbi.nlm.nih.gov/nuccore/407479587" place for me to download the GFF file from the "Display Settings" (Top Left of the screen).
Or you may use the TogoWS web service to download the entry in GFF format directly:
wget http://togows.org/entry/nucleotide/407479587.gff
Please note, that GFF is not a strict format but rather a framework to invent your own format. Column 9 of the GFF file comprises several tags. The names of these tags are more or less arbitrary. The tag names assigned by TogoWS may or may not meet the requirements of your sequence viewer. Furthermore, column 1 of the GFF file holds the name of the sequence. The name used in column 1 of the GFF file must be EXACTLY the same as the name used in the corresponding FASTA file. In a FASTA file the name of the sequence is the first word of the description line (all the characters before the first space).
Please also note that GI=407479587 is an isolate from the German HUSEC outbreak in 2011 which is sequence type 678 and differs from the uropathogenic ST131 you have ask for before.
With regard to your questions about read trimming and about inclusion of plasmids, I would recomment that you initially start with just a single chromosomal sequence and without any read trimming. You should be able to map 60 to 80 percent of your reads that way. Your goal for the next weeks should be to make yourself familiar with all these tools and to establish a basic work flow. If you have found such a work flow you can try to improve the number of reads mapped by either adding plasmidic sequences to your set of reference sequences or by doing some read trimming.
--
pietLast edited by piet; 12-21-2014, 09:14 AM.
Leave a comment:
-
Originally posted by GenoMax View PostWatch this short video from Illumina that explains how their sequencing technology works (it addresses adapters/indexes): https://www.youtube.com/watch?v=HMyCqWhwB8E Index read sequence will not be part of the actual read. It will be included in the Fastq read header (http://en.wikipedia.org/wiki/FASTQ_f...ce_identifiers skip to CASAVA 1.8 format headers).
Post the FastQC plots for your sample(s) if you need specific comments but in general if you had inserts that were shorter than your read length then you are going to have adapters in your sequences. If your data was processed on the MiSeq by MiSeq reporter then the adapters may have already been removed (ask the facility if you are not sure).
BBDuk is specially good at documenting statistics about how many reads had adapters/were trimmed. Make sure you use the correct adapter reference files (nextera, truseq etc they are included in BBMap download in reference directory).
Thanks for the explanation. I've attached four plots for you to comment. Honestly, I have not much idea about it. All I was told was that as long as the read is above 20 on the Y-axis then it's good to use. Those below 20 may probably be a wrongly called base and may need to be trimmed before mapping.
By the way, I did a trial mapping of one of the RNA seq Groom'ed file with a reference chromosome sequence, but when I try to view the BAM file on Integrative Genomic Viewer, the reference chromosome is not in the drop down list. When I tried to upload my own fasta file downloaded from NCBI, there is no gene annotation in it. Do you know how should I upload the annotation? I tried reading the IGV website, it says to download the GFF file from NCBI, but I don't see any "http://www.ncbi.nlm.nih.gov/nuccore/407479587" place for me to download the GFF file from the "Display Settings" (Top Left of the screen).
Could you kindly advise?
Thank you.Attached Files
Leave a comment:
-
Originally posted by michaellim View PostCould you please explain how are they different? Will the sequence still be in sequencing FASTQ file?
By the way, looking at the Per Base Sequence Quality, for all of my samples, the lower end of the yellow box goes below the 20 Quality Score after base-150 (all sequences are 200 bases). Does this mean I need to trim the adapters and also everything after base-150?
Thank you.
BBDuk is specially good at documenting statistics about how many reads had adapters/were trimmed. Make sure you use the correct adapter reference files (nextera, truseq etc they are included in BBMap download in reference directory).
Leave a comment:
-
Originally posted by Brian Bushnell View PostAll aligners are designed to handle references with multiple contigs; you don't need to combine anything (nor should you). You just need to index it.
Well since you ask me, I will recommend BBMap, which also handles RNA-seq data, but is faster and more sensitive than Tophat. But bacteria generally lack introns - when they are present, they are very short and only in a handful of genes. So it's not strictly necessary to use a splice-aware aligner for bacterial RNA-seq, though I would still recommend it.
Can I get some further clarification from you too? I was looking at some genomes in NCBI and they are deposited as Chromosome and multiple plasmids.
In this case, when I'm mapping, am I supposed to combine all the sequences (chromosome and plasmid) in NCBI? Or do I index them as you've mentioned? Sorry, I have no prior knowledge at all on DNA sequencing/RNA sequencing.
I was trying to map the RNA seq data in Galaxy, but I can only choose one reference at a time.
Thank you.
Leave a comment:
-
Originally posted by Sergioo View PostBy now, you've got many suggestions from more experienced readers. You are lucky because you've just got to sit down and think of which option to use.
I am not familiar wth RNA seq projects, but if it was whole genome seq, I will go first for an assembly (even de novo) using a complete genome (not the one in multiple contigs). The complete genome, even not exactly related, will allow you to order your contigs and resolve misassembly. Note that you can not rely on a draft genome sequence since its biggest inconvenience is the lack of order of composite contigs.
Now, once you've got your draft sequences ordered, you are free to compare it with what you think is more related (for example sequences from the same ST as your isolate).
Hope it helps.
Yes, I'm truly very grateful for all the response given. I'm slowly understanding more about the software options and uses and the initial mapping analyses. I have currently no idea as I've not done this before and there are no one in the department who has does this kind of work, so I couldn't get any advice internally.
By the way, when you are doing mapping, for example when you have 1 chromosome sequence, and 5 plasmid sequences on NCBI. How do you do the mapping? I was looking at Galaxy and you can only choose one reference genome for any single mapping task.
Thank you.
Leave a comment:
-
Originally posted by GenoMax View PostIt is always a good idea to check for and trim adapter sequences, if present. Many aligners will soft clip them but if you are planning to do any assembly you want to start with clean reads. BTW adapters and indexes are not the same thing. With illumina technology index sequences are never a part of the main read so do not need to be trimmed (unless you are using custom inline indexes).
BBDuk is easy to use (on Windows/Mac/*nix) so is Trimmomatic. You could do this in galaxy but at some point you will need to move to command line (e.g if you decide to use Mauve).
Sorry, it's my first time doing RNA seq and dealing with sequencing data. I was using MiSeq for the sequencing (the running of the flow cell was done by the sequencing lab, I prepared all the way up to the denatured libraries). From the Illumina Library Prep manual, I (mis)understood 'adapters' to be the same as 'index/indices' (unique 6 nucleotide sequences to labelled each RNA sample).
Could you please explain how are they different? Will the sequence still be in sequencing FASTQ file?
By the way, looking at the Per Base Sequence Quality, for all of my samples, the lower end of the yellow box goes below the 20 Quality Score after base-150 (all sequences are 200 bases). Does this mean I need to trim the adapters and also everything after base-150?
Was reading some blogs, there are arguments about whether it is important to trim or not to trim before mapping. It's rather confusing to me.
Thank you.
Leave a comment:
-
Originally posted by michaellim View PostHi Sergioo,
Yes, MLST. For example, E.coli ST11 will be different from ST131. However, we aren't certain whether there is any genes which is specific to ST131 which cannot be found in other E. coli sequence types.
So, if ST11 has a completed genome, but ST131 is in contigs, and my current RNA seq data is on ST131, should I use ST131 (multiple contigs) as the reference or the completed genome of ST11 which is not so related as the reference genome. That was my question. Hope that makes it clearer.
Thank you.
I am not familiar wth RNA seq projects, but if it was whole genome seq, I will go first for an assembly (even de novo) using a complete genome (not the one in multiple contigs). The complete genome, even not exactly related, will allow you to order your contigs and resolve misassembly. Note that you can not rely on a draft genome sequence since its biggest inconvenience is the lack of order of composite contigs.
Now, once you've got your draft sequences ordered, you are free to compare it with what you think is more related (for example sequences from the same ST as your isolate).
Hope it helps.
Leave a comment:
-
It is always a good idea to check for and trim adapter sequences, if present. Many aligners will soft clip them but if you are planning to do any assembly you want to start with clean reads. BTW adapters and indexes are not the same thing. With illumina technology index sequences are never a part of the main read so do not need to be trimmed (unless you are using custom inline indexes).
BBDuk is easy to use (on Windows/Mac/*nix) so is Trimmomatic. You could do this in galaxy but at some point you will need to move to command line (e.g if you decide to use Mauve).Last edited by GenoMax; 12-20-2014, 06:04 AM.
Leave a comment:
-
Originally posted by GenoMax View PostThat is a likely explanation. If submitters are not completely sure that the contigs go together (there could be multiple plasmids in some bacteria and the separate pieces may be real) they would be left in that state.
May I check with you whether I need to trim the adapter sequence from my RNA seq FASTQ file? My Library was about 260 bp each.
Any suggestion how should I do this? Do I just set a software to trim from base 1 to base X or do I need to input the individual adapter sequence to the trimmer, I've noticed quite a few version of trimmers online. There is a built in one in Galaxy too.
Many thanks.
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
Like all molecular biology applications, next-generation sequencing (NGS) workflows require diligent quality control (QC) measures to ensure accurate and reproducible results. Proper QC begins at nucleic acid extraction and continues all the way through to data analysis. This article outlines the key QC steps in an NGS workflow, along with the commonly used tools and techniques.
Nucleic Acid Quality Control
Preparing for NGS starts with isolating the...-
Channel: Articles
02-10-2025, 01:58 PM -
-
by seqadmin
In recent years, precision medicine has become a major focus for researchers and healthcare professionals. This approach offers personalized treatment and wellness plans by utilizing insights from each person's unique biology and lifestyle to deliver more effective care. Its advancement relies on innovative technologies that enable a deeper understanding of individual variability. In a joint documentary with our colleagues at Biocompare, we examined the foundational principles of precision...-
Channel: Articles
01-27-2025, 07:46 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Genetic Mapping of Plasmodium knowlesi Identifies Essential Genes and Drug Resistance Mechanisms
by seqadmin
Started by seqadmin, 02-07-2025, 09:30 AM
|
0 responses
71 views
0 likes
|
Last Post
by seqadmin
02-07-2025, 09:30 AM
|
||
Started by seqadmin, 02-05-2025, 10:34 AM
|
0 responses
112 views
0 likes
|
Last Post
by seqadmin
02-05-2025, 10:34 AM
|
||
Started by seqadmin, 02-03-2025, 09:07 AM
|
0 responses
86 views
0 likes
|
Last Post
by seqadmin
02-03-2025, 09:07 AM
|
||
Started by seqadmin, 01-31-2025, 08:31 AM
|
0 responses
47 views
0 likes
|
Last Post
by seqadmin
01-31-2025, 08:31 AM
|
Leave a comment: