Hi,
You have asked some interesting questions about contig creation in the paper you mentioned. I’m not an expert on this topic, but I can try to give you some answers based on what I’ve read from other sources.
(a) The contigs from Megahit and Trinity are not necessarily unique, because they use different algorithms and parameters to assemble the reads into contigs. It is possible that some of the contigs are identical or overlapping between the two methods, but it is also possible that some of the contigs are different or complementary. To find out the exact number of unique contigs, you would need to compare and merge the contigs from both methods using tools like CD-HIT or QUAST.
(b) The origin of the non-human contigs in the sample is not easy to determine, because they could come from various sources such as bacteria, viruses, fungi, parasites, or environmental contaminants. To identify the origin of these contigs, you would need to compare them to reference databases using tools like BLAST or Kraken. However, these tools may not be able to identify all of the contigs, especially if they are novel or divergent from known sequences. In that case, you would need to use other methods such as phylogenetic analysis or functional annotation to infer their origin.
Best regards, Hanna
Header Leaderboard Ad
Collapse
Questions about contig creation.
Collapse
Announcement
Collapse
No announcement yet.
X
-
If you can have a bedfile with regions where you would like to count, bedtools multicov doest the job fast and well. Required an indexed bam.
Leave a comment:
-
This forum is called SEQ "answers"? It is a complete misnomer. There are no answers to be had here. Two weeks and not one single reply. Doesn't anybody know the answers to my questions? Or are people too scared to answer them?
Leave a comment:
-
Is there nobody here who can help me with this? I am quite surprised that nobody had even replied to my post. I thought someone would have.
Leave a comment:
-
Questions about contig creation.
Hi everyone,
My first questions. I am quite a noob, so please forgive me if I say or ask anything silly. My questions are about contig creation in the paper called 'A new coronavirus associated with human respiratory disease in China', which is here:
In the paper, it says this:
"Sequencing reads were first adaptor and quality trimmed using the Trimmomatic program32. The remaining 56,565,928 reads were assembled de novo using both Megahit (v.1.1.3) and Trinity (v.2.5.1) with default parameter settings. Megahit generated a total of 384,096 assembled contigs (size range of 200–30,474 nt), whereas Trinity generated 1,329,960 contigs with a size range of 201–11,760 nt. All of these assembled contigs were compared using BLASTn and Diamond BLASTx) against the entire non-redundant (nr) nucleotide and protein databases."
And this:
"Of the 384,096 contigs assembled by Megahit, the longest (30,474 nucleotides (nt)) had a high abundance and was closely related to a bat SARS-like coronavirus."
So, in summary, after filtering out known human sequences, they were left with 384,096 assembled contigs from Megahit and 1,329,960 assembled contigs from Trinity. And one of these contigs from Megahit was 30,474 nucleotides in length, which was closely related to a bat SARS-like coronavirus.
My questions are:
(a) Are the contigs from Megahit and Trinity individually unique, giving a total of 384,096 + 1,329,960 = 1, 714,056 unique contigs? Or, did Megahit and Trinity both find some of the same contigs, giving a lower number of unique contigs?
(b) Whatever the answer to question (a) above, leaving aside the one 30,474 nucleotide contig which was closely related to a bat SARS-like coronavirus, there remains about a million (or maybe more) contigs of non-human origin in the sample. They are not of human origin, so what is their origin? Obviously there will be other viruses, bacteria, etc, in the bronchoalveolar lavage sample. But a million or more individual unique contigs? I don't know, but that seems like a lot to me. Do these contigs represent a million or more different individual organisms, viruses, bacteria, etc? Or, if not individual organisms, how many different kinds of virus, bacteria, etc, would normally be present in a bronchoalveolar lavage sample? Wow, That question went on a bit longer than I intended. Anyway, please just answer what you can. I don’t expect this question to be answered completely. Maybe some of the things I asked are not actually known.
(c) This question, however, I think may be able to be answered completely. Is it possible that some (or maybe lots) of these million or more contigs could be generated simply by the computer algorithms, and do not actually exist in reality? Because if you have 56,565,928 reads, and only four letters (ATCG) to form the nucleotide strings from, I assume that you would get some overlapping strings in the alignment process, just from the theory of probability alone. Am I correct in assuming this?
(d) If there was a high abundance of strings of 30,000+ nucleotides in the bronchoalveolar lavage sample, wouldn't you be able to find those strings using gel electrophoresis?
I think that’s enough for now. I expect you are all busy people and I don’t want to take up too much of your time. Thank you for reading my post. Any answers I receive will be greatly appreciated.
Warm regards,
Bobby.Last edited by Bobby_Quan; 11-26-2022, 08:22 AM.Tags: None
Latest Articles
Collapse
-
Differential Expression and Data Visualization: Recommended Tools for Next-Level Sequencing Analysisby seqadmin
After covering QC and alignment tools in the first segment and variant analysis and genome assembly in the second segment, we’re wrapping up with a discussion about tools for differential gene expression analysis and data visualization. In this article, we include recommendations from the following experts: Dr. Mark Ziemann, Senior Lecturer in Biotechnology and Bioinformatics, Deakin University; Dr. Medhat Mahmoud Postdoctoral Research Fellow at Baylor College of Medicine;...-
Channel: Articles
05-23-2023, 12:26 PM -
-
by seqadmin
Continuing from our previous article, we share variant analysis and genome assembly tools recommended by our experts Dr. Medhat Mahmoud, Postdoctoral Research Fellow at Baylor College of Medicine, and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.
Variant detection and analysis tools
Mahmoud classifies variant detection work into two main groups: short variants (<50...-
Channel: Articles
05-19-2023, 10:03 AM -
-
by seqadmin
With new tools and computational resources being released regularly, it can be hard to determine which are best suited for the analysis process and which older tools continue to be maintained. In an effort to assist the sequencing community, we interviewed three highly skilled bioinformaticians about their recommended tools for several important analysis applications.
Quality control and preprocessing tools
“Garbage in, garbage out” is a popular...-
Channel: Articles
05-16-2023, 10:11 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Exploring French-Canadian Ancestry: Insights into Migration, Settlement Patterns, and Genetic Structure
by seqadmin
Started by seqadmin, 05-26-2023, 09:22 AM
|
0 responses
8 views
0 likes
|
Last Post
by seqadmin
05-26-2023, 09:22 AM
|
||
Started by seqadmin, 05-24-2023, 09:49 AM
|
0 responses
15 views
0 likes
|
Last Post
by seqadmin
05-24-2023, 09:49 AM
|
||
Introducing ProtVar: A Web Tool for Contextualizing and Interpreting Human Missense Variation in Proteins
by seqadmin
Started by seqadmin, 05-23-2023, 07:14 AM
|
0 responses
30 views
0 likes
|
Last Post
by seqadmin
05-23-2023, 07:14 AM
|
||
Started by seqadmin, 05-18-2023, 11:36 AM
|
0 responses
116 views
0 likes
|
Last Post
by seqadmin
05-18-2023, 11:36 AM
|
Leave a comment: