Internship in Codon Genomic by thonzenn

Week 1 Day 1 2/7/2018, Mon

thonzenn — 2018-07-05 08:22:36 UTC

Since molecular biology is not part of my syllabus and I have not yet taken microbial genetics, I have only very fundamental knowledge on molecular biology. My very limited knowledge on molecular biology was acquired from high school biology and some first year courses that have been gone through roughly. I was taking bioinformatics in the semester that has just ended and I think it is necessary for me to know more about it since what we have learned in the class is very basic. This company came by to give us a workshop and tell us on what they are doing. They have certainly got my attention although my initial intention was to work on molecular biology.

I was brief through on the company policy and things that a new 'employee' needs to know. I was thrilled to learn that I get a chance to be involved in a workshop as a facilitator and participate in a conference. Also, a mini project will be assigned to me after I build up some basic.

Thing that I did not expect was that I need to learn about command line to communicate with computer. This is something that I have never explored before and never thought I would someday. However, I am anticipating on what's coming.

I like the environment and the ambient here, the office doesn't really look like an office, rather resemble a house, the people here are nice. I especially like how they work together professionally, everyone is so focused at their work (at least they look like they are focusing). Sometimes relax after some intense work or thinking. There's even a big pillow in the office so that we can lie down and play with our phone after some intense brain frying.

This is the office for Genome Science, where I will spend most of my time

The living room with this attractive colourful wallpaper and some bead bags

I like the transparent glass door that can also be a huge drawing board for project planning

There are Xbox, Wii and massage chair in the living room for the scientists to relax their mind when work got intense!

Another surprise is that every intern needs to prepare a lunch, start with proposal, budgeting, til cooking. Currently don't have anything in mind on what to cook but anyhow I have another intern to work that out with me.

I was asked to study on Sanger sequencing and NGS after my interview for this internship position. What I did today was some intense study. Before I could handle data first I need to know how it's generated. I enjoy how I could study without any interruption. Learning new things gives me satisfaction.

Study on Sanger Sequencing, NGS
- Illumina (Solexa) sequencing
- 454 sequencing/ Roche (Pyrosequencing)
- SOLiD sequencing

Study Illumina (Solexa) Sequencing in detail
- Sample preparation (Library preparation)
- DNA fragmentization
- Ligation of adapter
- Component of Adapter
- 3 sections: region
complementary to flow cell oligo
nucleotide, indices, sequencing
primer binding site
- Cluster generation (Clonal amplification)
- SBS (massive parallel sequencing)
- DATA analysis

Spent entire day studying Illumina (Solexa) sequencing in detail, a messy sketch to work out the flow

Week 1 Day 2 3/7/2018, Tue

thonzenn — 2018-07-05 08:23:59 UTC

Today I continue study on NGS. I spent the entire morning until lunch to summarize what i have studied yesterday by writting them neatly in papers so that I can revise.

They are a number of variety of NGS. Those that I studied are Illumina (Solexa) sequencing, 454/Roche sequencing, SOLiD sequencing and Ion Torrent sequencing.

Except for Illumina, the remaining three types of sequencing method is rather similar where they required attachment of DNA fragment to a bead and undergo emulsion PCR.

454 sequencing is also known as pyrosequence, relies on the release of pyrophosphate when a nucleotide is incorporated. The release of pyrophosphate trigger the ATP sulphurylase to concert APS to ATP that power the catalytic reaction od luciferin and luciferase that results in light emission that allows for detection by the receptor.

Ion Torrent is very similar to 454 sequencing but rely on the change of pH as a H+ will be release when a nucleotide is incorporated causing a drop in pH. Therefore, unlike other sequencing method, Ion Torrent does not involve light signalling.

SOLiD stands for Sequencing by Oligonucleotides Ligation and Detection. It relies on ligation instead of polymerization of DNA. 16 different types of probes is used, each with 2 nucleotide, but a few more bases is needed for proper ligation, 3 degenerated bases is added too. The primer is reset for a few times to cover the region occupied by the degenerated bases.

A lot neater as compared to what I did yesterday!

Week 1 Day 3 4/7/2018, Wed

thonzenn — 2018-07-05 08:24:19 UTC

Detailed study in Sanger sequencing, and Maxam-Gilbert Sequencing. The developers of these were awarded with The Nobel Prize in Chemistry in 1980 for their contributions concerning the determination of nucleic acid sequence. However, Sanger sequencing remain a popular sequencing method whereas Maxim-Gilbert sequencing method is slowly being forgotten. This is due to technical complexity and extensive use of hazardous chemical such as radioactive substance and hydrazine which is a neurotixin. Anyhow, M-G sequencing served us well as it was the first to allow us to peep through the mystery of the genetic information.
With the emergence of NGS and the more recent SMRT, Sanger sequencing is believe to face the same destiny as M-G sequencing.

Sanger sequencing is not with no problem, it suffers from determination of base in runs of identical nucleotides, unable to read stretches of CG and presence of secondary structure that can interfere with DNA synthesis that causes premature termination. However, it remains the standard protocol for some scientists as it offers long reads and gold-standard accuracy.

Even though the emergence of NGS has drastically reduce the time taken for sequencing and cost per base sequence, there are condition where Sanger is still preferred.

Week 1 Day 4 5/7/2018, Thu

thonzenn — 2018-07-05 08:24:42 UTC

This company is co-organizing a symposium and is in charge of the workshop which will be held in my university (UPM). Today we need to go over to the computer lab for workshop for installation of software, make sure every computer is able to connect to the cloud server and test run on the computers in the lab.

The poster for the symposium & workshop co-organised by this company

I was already told the first day of my internship that I'll be one of the facilitator for the workshop and I'll be going through the material before it gets to the participants. However, for the past few days I was only focus on studying the sequencing evolution and have not touched on the command line. So, today is the very first day I'm doing it and I'll have to dealt with 30 computers in the lab alongside with Zhe Kin, another intern in this company, and Melvin, one of the CoGen staff.

Though what we need to do there is very simple, maybe due to time constrain we don't get to test on more stuff. I'd only have to follow what is demonstrated, which is just some simple commands and to repeat it on 30 computers there.

For the first time I feel like I’m a hacker!

What we did was to login all computers in PuTTY and WinSCP to the server and download the workshop materials to the desktop. Simplify the steps for the participants, such as create shortcuts at the first level and save the host name so the there's lesser chance that the participant can get the wrong server.

Other than that, we also test run on depositing genome sequence of a sample onto the online genome database developed by this company, (https://arkgene.com/). Only then I realize that depositing a genome is not just about posting the sequences like how I copy and paste the sequence to BLAST. There are more things need to be included such as annotation, gene ontology, karyotype, KEGG and many more.

After the testing we went back to the office, and my plan was to continue on third generation sequencing.

The are 3 types of 3rd gen sequencing available in the market, the PacBio SMRT sequencing technique, Oxford Nanopore and Illumina Tru-Seq Synthethic Long Read Technology.

Week 1 Day 5 6/7/2018, Fri

thonzenn — 2018-07-05 08:27:39 UTC

Today was started with some studies on third generation sequencing, particularly the PacBio SMRT sequencing and Oxford Nanopore Technology.

The distinguishing difference between these technology with NGS is that it does not require cloning of DNA which reduce the chances of error introduced during the amplification. The data can also be read in real time, by that it means it's like watching a live show of polymerization of DNA whereas the sequence data for NGS cannot be read in real time. NGS is about massive parallel sequencing of DNA fragments that results in short reads of length about hundreds nucleotides that can be a problem in genome assembly. In the third generation, the read length is at average 10-20 Kb which can contribute in high accuracy sequence.

The SMRT sequencing technology was developed by the company, Pacific Biosciences. SMRT stands for single molecule real time, which mean the sequencing take place at the the sample DNA without clonal amplification and is read on time with the polymerization of DNA. This technology enables, for the first time, the observation of natural DNA synthesis by a DNA polymerase as it occurs. DNA sequencing is performed on SMRT cells, each containing tens of thousands zero-mode waveguides, which illuminated by laser light and identify the signal released by the fluorescently labeled nucleotide when a nucleotide is incorporated. A single DNA polymerase is attached to the bottom of the ZMW. The sequencing mechanism is similar to Illumina sequencing, that is, sequencing by synthesis.

The Oxford Nanopore sequencing technology is release more recently in the year of 2016. With this technology, DNA can be sequenced by threading through a microscopic pore in a membrane. Bases are identified by the way they affect ions flowing through the pore from one side of the membrane to the other. A flow of ions through the pore creates a current. Each bases block the flow to a different degree, altering the current.

The weakness of the third generation sequencing is their high error rate (10-15%), however this can be corrected by some assembly program to an accuracy up to 99.99%.

Also, I have started on the command line. The slides have run me through some basic Linux command. I try these command with a software named PuTTY. Some command I learned include:
$ ls - list files and directories
$ mkdir - make directory (create folder)
$ cd - to change directory (click in/enter)
$ cd - home
$ cd .. - go one level up
$ pwd - print working directory (show path)
$ cp - copying file to other destination
$ mv - move a file to a new destination
$ rm - remove a file (delete)
$ cat/less/head/tail - to display content
$ grep "string" - to search something in a file (like ctrl + F)
$ top - to view processor activity

Find it kind of fun doing thing that I don't usually do!
Though it wasn't so hard but it took me some time to learn the technique of using it and learn some shortcut and some useful tricks so I can save some typing.

Then I started with the second module, the pre-processing, the first step of bioinformatics analysis. The seqencing data (reads) have to be pre-process, this is because bad sequence quality can affect analysis. Bad data have to be removed before the sequence is assembled.

The very first task in pre-processing is to run FastQC. To run FastQC, the seqence data have to be in Fastq format. I was only exposed to FASTA format in my lecture which is a simpler format that contain only identity of the sequence and the sequence itself. The Fastq format is a 4 lines format where the first line which always start with @ is the tag that indicate the sequence, second line is the bases sequence, third line that starts with + sign is an optional sequence identifier and the last line is is the quality score corresponding the bases in line 2.

In the FastQC report, there are basic statistics and 9 other parameters. A good quality sequence should have a green tick beside each of the parameters. Parameters with red cross or yellow exclamation mark should be paid with more attention. The 10 parameters are
1. Per base sequence quality - has to be high across bases
2. Per sequence quality score - to see if there is any bad sequence
3. per base sequence content - each line in the graph should be parallel and should be about 25%
4. Per base GC content - to see if there is any biased sequence the graph should align with the normal distribution graph (bell shape), if it doesn't overlap nicely the sample might be contaminated.
5. per base N content - tell if there is any uncoded base, there shouldn't be any uncoded base at all
6. Length distribution - to see if the length is consistent, all sequence should have the same length
7. Sequence duplication level - determine how unique the sequence is in the library, most of the sequence in the library should only occur once
8. Over represented sequence - to show if there is sequence like adapter in the library
9. K-mer content - to show the gene that is over represented

Today's messy sketch!

Week 2 Day 1 9/7/2018, Mon

thonzenn — 2018-07-05 08:51:01 UTC

The previous working day I was working on FastQC, I was kind of curious in why can I do that in command line? Like do IT people know biology too?! I mean there are people who are good at booth computer science and biology but not everyone of them are bioinformaticians. So, I did a little research.

First, what is FastQC? FastQC is a quality control tool developed at Brabaham institute that aims to provide a simple way to do some quality checks on raw sequence data coming form high-throughput pipelines. It reads a set of sequence files and produces from each one a quality control report consisting of a number of different modules, each one which will help to identify potential type of problem in your data. According to the official Brabraham website (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
the main functions of FastQC are
1. Import data from BAM, SAM or FastQ files (any variant)
2. Providing a quick overview to tell you in which areas there may be problems
3. Summary graphs and tables to quickly assess your data
4. Export of results to an HTML based permanent report
5. Offline operation to allow automated generation of reports without running the interactive application

Babraham Institute, the institute that developed FastQC is a world-class institute that is with partnership with the University of Cambridge. Their bioinformatics group provides an interface between the computational infrastructure of the institute and the biology performed within the research group.

FastQC is a java application. In order to run FastQC, a suitable Java Runtime Environment (JRE) has to be installed in your system. Java version earlier than 1.6 will have problem running fastQC. However, installation of latest java version is easy.

The pre-processing of the raw sequences data were done with FASTX-Toolkit. FASTX-Tool kit is a collection of command line tools for Short-Reads FASTA/FASTQ files processing. NGS machines usually produces FASTA or FASTQ files, containing multiple short-reads sequences.
The available tools offered by this toolkit include:
- FASTQ-to-FASTA converter
- FASTQ information/Quality filter/Quality trimmer/ Masker
- FASTQ/A Collapser/Trimmer/Renamer/Clipper/Reverse-compliment/Barcode splitter/formatter/Nucleotide Changer

These tools can be used in two forms:
1. Web-based (with Galaxy) (http://main.g2.bx.psu.edu/)
2. Command-line

In my case, we are using the command line lsin processing the raw data,

1. trim off undesired segments with
$ fastx_trimmer -Q 33 -f # -l # -i (input file) -o (output file)
where,
Q 33 indicate the encoding of the sequence,
f = first base to keep,
l = last base to keep

2. trim low-quality bases and filter short reads with
$ fastq_quality trimmer -Q 33 -t # -l # -i(input file) -o (output file)
where,
t = quality threshold for trimming (the number assign represent the Phred score ie. Q 20 = 0.01 (99%)),
l = minimum length of the sequence
* to trim base that has Q < treshold & discard sequence with length shorter than set

3. Remove ambiguous bases (usually adaptor)
$ fast_clipper -Q 33 - a XXXXXXXXXXXXX -i(input file) -o (output file)
where,
a = adaptor (default = CCTTAAGG, dummy adaptor)

4. Remove low quality data
$ fastq_quality_filter -Q 33 -q # -p # -i(input file) -o (output file)
where,
q = minimum quality score to keep
p = minimum percentage % of bases in a sequence that must have [-q] quality

the last step before assembly is to separate unpaired reads. Some researchers find it unnecessary as they will however be discarded when they are not being mapped. Anyhow, to separate it allow us to get better statistic maybe of some meaningful usage. In the server I'm accessing, it is already pre-installed with the script so I'll only have to key in the command and sometimes including the full path to carry out this task. The command I used is
$ /share/apps/scripts/separateunpaired2.bpy
where the long string is the full pah to call the command, first two files are the input, the PEs and SE are the output (PE = paired end [PE 1 = read 1; PE 2 = read 2], SE = single end)
Some researchers suggested Galaxy Tool for this task. The tool is a short Python script which divides a fastq file into paired reads, and single (orphan) reads. The input file can be two separated files for the forward/reverse reads or have them interleaved in a single file. Fastq variant is unimportant in this sense.
It can be accessed from the following link (https://toolshed.g2.bx.psu.edu/repository?repository_id=3c4a991608b4c8dd)

Week 2 Day 2 10/7/2018, Tue

thonzenn — 2018-07-05 08:56:57 UTC

Bioinformatics depends heavily on Linux-based computers and software. A lot of good scientific software is written specifically for Linux/Unix. Additionally, Linux has most popular programming languages like Python, Perl, C, already installed and ready to use. Since I am now practicing on this company's server that run on linux, then I came to think about, how can I practice if I'm no longer with this company. Then I found some websites that offer online simulator in Linux. A shell account is a user account on a remote server, traditionally running under the Unix operating system, which gives access to a shell via a command-line interface protocol such as telnet or SSH. Shell providers are often found to offer shell account at low-cost or free. These shell accounts generally provide users with access to various software and services and some may also allow tunneling of traffic to bypass corporate firewalls.
Below are several links to online simultor in linux to practice Linux commands
http://www.freelinuxconsole.info/

http://www.tutorialspoint.com/codingground.htm

http://bellard.org/jslinux/

http://www.webminal.org
http://linuxzoo.net/
http://www.masswerk.at/jsuix/
https://witsbits.com
(suggested by https://gooroo.io/GoorooThink/Article/16501/Online-simulator-in-Linux--Practice-Linux-Commands-Untitled/20590#.W0QWjNIzZPa)

There are also website that offers courses for Linux novice like me who wants to learn the basic Linux without getting too much technical details. This website (https://www.bits.vib.be/training-list/112-bits/training/upcoming-trainings/124-linux-for-bioinformatics) is good to try out.

Some useful Linux command line exercise or rather a cheat sheet for NGS data processing can be accessed here (http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/linux.html)

After pre-processing, we proceed to the next stage, genome assembly. Genome assembly is started with de novo assembly by using Velvet assembler. Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments based on de Bruijn graphs. It is developed by Daniel Zerbina and Ewan Birney at the European Bioinformatics Institute (EBI) in UK. The velvet manual can be accessed here (https://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf)

Velvet takes the reads as input and turns them into contig. It consists of two steps, the first step, velveth, helps to construct data sets (hashes the read) for velvetg and includes information about the meaning of each sequence file. Afterward, the de Bruijn graph is built from k-mer obtained from velveth by running simplification and error correction over the graph and contigs are created with velvetg.

First step is to run velveth (chopping reads into short k-mer) in order to construct a hash table for each of the selected k-mer. The number of selected k-mer (the k-mer length a.k.a. has length) must be an odd number to avoid palindromes. The recommended K-mer length is 61-73. The k-mer should be long enough to avoid false positive, but not too long that you miss out reads in assembly. Longer K-mer has higher specificity (less spurious overlap) but lower coverage; however, short k-mer has higher sensitivity. Read this for more information on how k-mer length affect the assembly (http://ivory.idyll.org/blog/the-k-parameter.html). The command used in my system for velveth is
$ velveth hash61 61 -fastq -shortPaired -separate PE1.fasq PE2.fastq > log_velveth
where,
velveth = command,
hash61 = the output directory
61 = k-mer length
fastq = input read format type
shortPaired = input read format type. Paired end
(short = single end)
PE1/PE2.fastq = input files
> log_velveth = output file

After velveth, in the output directory should have
1. Log - useful reminder of what commands is typed to fet the assembly result
2. Roadmaps - contain the index that has just been created
3. Sequences - contain the sequence being put in

After velveth come out with the k-mers, velvetg build the de Bruijn graph by running simplification and error removal to come come out with contigs.

Simplification is about merging nodes that do not affect the path generated in the graph. Error in the graph can be caused by sequencing process or error contained by the sample itself.
Velvet recognize 3 kinds of error namely
1. tips - node that is disconnected on one of its ends

2. bubbles - when two distinct path start and end at the same node, error is removed by Tour Bus algorithm

3. erroneous connections - connections that do not generate paths or do not create any recognizable structure within the graph.

This website provides a comprehensive breakdown on the velvet algorithm (http://melbournebioinformatics.github.io/MelBioInf_docs/tutorials/assembly/assembly-background/)

The command used for velvetg in my case is as following
$ velvetg hash61 -exp_cov auto -read_trkg yes > log_velvetg
where
velvetg = command
hash61 = velvetg output directory name
exp_cov = expected coverage of unique regions (median coverage depth)
(auto = let the system to infer)
read_trkg = tracking of short read position in assembly
(add -unused_reads to save unassembled reads into file named UnusedReads.fa)

The Log file for velvetg (Log_velvetg) provides information such as
1. median coverage depth
2. number of nodes - number of contigs in the graph
3. N50 size
4. max contig size
5. total contig size
6. number of reads utilized to construct the de Bruijn graph

After velveth, in the output directory should have
1. contig.fa - the assembled contigs in .fasta format
2. LastGraph - Detailed representation of the de Bruijn graph
3. PreGraph
4. UnusedReads.fa
5. Graph2
6. stats.txt - intermediate information about each contig (Average coverage and length in k-mer, how many edges went in/out of this contig node)

N50 length is the length of the shortest contig such that the sum of contigs of equal or longer is at least 50% of the total length of all contigs. More descriptive explanation can be found in the following links
(https://www.molecularecologist.com/2017/03/whats-n50/)
(http://jermdemo.blogspot.com/2008/11/calculating-n50-from-velvet-output.html)

To get the best k-mer, optimization can be done by testing different value (61-93) of hash length. That literally means repeating velveth and velvtg with k-mer of odd numbers from 63-93 (63, 65, 69, 71, 73, 77, 79, 81, 83, 85, 87, 89, 91, 93). A script can be prepared for ease with automation in brute-force way

#!/bin/bash

#run_velvet.sh

for f in 61 63 65 69 71 73 75 77 79 81 83 85 89 91 93

/share/apps/velvet_1.2.08/velveth h"$f" "$f" -fastq -shortPaired -separate PE1.fastq PE2.fastq >> log.batch.velveth

/share/apps/velvet_1.2.08/velvetg h"$f" -exp_cov auto -read_trkg yes -unused_reads yes -ins_length 153 -ins_length_sd 100 >> log.batch.velvetg

done

This script varies depends on where the script for velvet is located in your server and the velvet version used. Brute force approach is to try with every possible way and decide afterwards which one is the best. In this context, to run velvet in brute-force way is to try every k-mer length and see which gives the highest coverage.

Thre script is run with the following command
$ bash run_velvet.sh &
bash = command processor that runs in a text window where the command that cause actions is typed
run_velvet.sh = the script file
& = run in background (can go without '&' if do not prefer command to run in background)

It takes couple minutes for it to run, by the end of the run an output file can be found ( log.batch.velveth/log.batch.velvetg)
To check for reports from every velvet run directories can use this command
$ tail -n 1 h*/Log
The report is a compilation of velvetg log files for all k-mer length.
Whereas, the number of scaffolds for each assembly can be checked with the following command line
$ grep -c ">" h*/contigs.fa

A combined graph of K-mer length against N50 (bp) and K-mer length against no. of contigs can be drafted to help finding the best k-mer length easier. A good k-mer length should have low number of contigs and long N50 length. The K-mer number that fulfill these two condition at this stage is selected to proceed for scaffolding.

Week 2 Day 3 11/7/2018, Wed

thonzenn — 2018-07-05 09:03:33 UTC

Note that velvet leverage de Bruijn graph method for de novo assembly but what is that method about?

First thing first, genome assembly refers to the process of taking a large number of short DNA sequences and putting them back together to create a representation of the original chromosomes from which the DNA originated. De novo genome assemblies assume no prior knowledge of the source DNA sequence length, layout or composition. In a genome sequencing project with NGS, the DNA of the target organism is broken up into millions of small pieces and read on a sequencing machines. These 'reads' vary from 20 to 20k nucleotide base pairs in length depending on the sequencing method used. The goal of the assembly is to produce long contiguous pieces of sequence (contigs) from these reads. The contigs are then ordered and oriented in relation to one another to form scaffolds. There are two commonly used method for de novo assembly, the Overlap-Layout-Consensus (OLC) assembly and de Bruijn graph (DBG) assembly.

In OLC assembly, as state in its name, there are three steps, overlap, layout and consensus. The overlap step find potentially overlapping reads. This is done by all against all pair-wise comparison. The graph is built with reads as nodes and overlaps as edges. Layout step merge reads into contigs by analyse, simplify and clean the overlap graph then determine the Hamiltonian path. Whereas, consensus derive the DNA sequence and correct read errors by align reads along assembly path and call bases using weighted voting. This method is more suitable for longer reads (>=1Kb). Some of the challenges faced this method include
1. building overlap is slow (how to determine all pairwise overlaps of reads)
2. Overlap graph is big (how to limit the size of the graph used to represent the data)
Seee detailed explaination of OLC assembly can be accessed here (http://www.cs.jhu.edu/~langmea/resources/lecture_notes/assembly_olc.pdf)

Overlap–layout–consensus genome assembly algorithm: Reads are provided to the algorithm. Overlapping regions are identified. Each read is graphed as a node and the overlaps are represented as edges joining the two nodes involved. The algorithm determines the best path through the graph (Hamiltonian path). Redundant information (i.e., unused nodes and edges) is discarded. This process is carried out multiple times and resulting sequences are combined to give the final consensus sequence that represents the genome.

The de Bruijn graph focus not on the reads but on what the reads can tell about the sequence of the original genome. Each reads is broken up into a smaller substring of a length k (k-mer). The K-mer will end up respresenting nodes in the graph. The edges are created when 2 K-mer are immediately adjacent to the read (L-mer overlap by exactly K-1 letter). The graph is constructed by assigning each unique K-mer as a node in a graph and connecting immediately overlapping k-mer by edge. At position where related sequence diverge due to alllelic polymorphism, splicing variation, repeats, sequencing errors, the graph will form bubble. After building the graph from all the reads, the graph is pruned to remove bubbles and structures that is likely to be sequencing error. The graph is compact by collapsing those nodes that from linear unbranched chain and overlapping K-mers.

De Bruijn graph method is very sensitive to sequencing error which leads to the creation of false k-mer which complicate the graph structure and reduce its apparent efficiency. A typical graph will have an exponential number of Eulerian path and as finding the one path that represent the genome being assembled leads to intractable computational problem. Eulerian path go through each edge exactly once. It is computationally challenging because it is affected by error which further complicating the choice between competing approach. It is also relatively complex process to construct and reduce its apparent efficient. This paper provides more insight of why de Bruijn graph is useful for genome assembly (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5531759/pdf/nihms329513.pdf)

The major differences between OLC assembly and dBG assembly are
OLC
- nodes = reads; edges = overlaps
- Determine Hamiltonian path
- less sensitive to repeats and read errors
- graph construction is more demanding (computationally)
- doesn't scale to voluminous short reads

dBG
- nodes = one for each unique k-mer; edges = k-1 exact overlap between 2 nodes
- determine Eulerian path (linear time algorithm)
- more sensitive to repeats and read errors
- graph converge at repeats of length k
- one read error introduces k false nodes

Week 2 Day 4 12/7/2018, Thu

thonzenn — 2018-07-05 09:03:35 UTC

Before proceeding to scaffolding, let's first be familiar with the vocabularies.
Contig - a sequence of a maximal path through the graph/ a contiguous length of genomic sequence in which the order of bases is known to a high confidence level
Scaffold - linking and orienting of contigs based on paired-end information, it is composed of contigs and gaps
Gap - occur where reds from the two sequenced end of at least one fragment overlap with other reads in two different contigs

PacBio has given a comprehensive review on the comparison of contigs and scaffolds. Contigs are continuous stretches containing only A, C, G or T bases without gaps. Scaffolds are created by chaining contigs together using additional information about the relative position and orientation of the contigs in the genome. Contigs in a scaffold are separated by gaps, which are designated by a variable number of 'N' letters. Scaffolding is often used for short-read assemblies to make sense of the fragmented genome assemblies containing short contigs.

Using paired-read sequencing data it is possible to assess the order, distance and orientation of contigs and combine them into scaffoldsPaired-read sequencing technology may help to reduce the amount of contigs as the known intermediate distance between read pairs can be used to place contigs in their likely order and orientation. The length of the resulting scaffolds (or supercontigs) also reflects the estimated distance between the initial contigs.

There are 3 important principal deficiencies of scaffolds
1. Scaffolds miss critical information
- gaps represent missing genomic information and these gaps often coincide with important genomic loci.
2. The length of a scaffold gap often has no relation to the true gap size
- In most cases, the true length of sequence represented by the gap differs from the set gap size. The uncertainties of gap size in scaffolds result in an inability to understand the true spatial relationships of functional elements in genome and is an underestimate of the actual extend of missing information
3. Gap-flanking scaffold sequence can be low-quality and is sometimes completely wrong
- The sequences surrounding gaps often fall into areas where short-read technologies have deficiencies due to GC-bias or read-length limitations. In some ways, having incorrect flanking sequence in scaffolds is worse than having 'N' gaps, since that erroneous sequence is considered and included for downstream analyses. (https://www.pacb.com/blog/genomes-vs-gennnnes-difference-contigs-scaffolds-genome-assemblies/)

The scaffolding will be done using SSPACE. SSPACE stands for SSAKE*-based Scaffolding of Pre-Assembled Contigs after Extension, is a stand-alone program for scaffolding pre-assembled contigs (in my case, velvet is used) using NGS paired-read data. Scaffolding algorithms are often built-in function in de novo assembly tools and cannot be independently controlled. SSPACE is unique in offering the possibility to manually control the scaffolding process. SSPACE is implemented in Perl, it requires bowtie (http://bowtie-bio.sourceforge.net/index.shtml).

Some insight according to the developer of SSPACE, Marten Boetzer, he developed this program since he couldn't find a program which was able to do a stand-alone scaffolding dated back to 2010, except from Bambus. He also found lots of issues on Bambus including errors and complicated input datasets. He also stated the main features of SSPACE including
1. Input are simple FASTA contig sequences as well as (multiple) FASTA/FASTQ paired-read data
2. High quality scaffolds in a short runtime and limited memory requirements
3. High reduction of the amount of contigs stored in scaffolds and high N50 value
4. Multiple library input of both paired-end datasets
5. Possible contig extension of unmapped sequence reads
6. Easy interpretation of the final scaffolds
7. Visualization of the final scaffolds using GraphViz
(http://seqanswers.com/forums/showthread.php?t=8350)

*More information about SSAKE, please read
René L. Warren, Granger G. Sutton, Steven J. M. Jones, Robert A. Holt; Assembling millions of short DNA sequences using SSAKE, Bioinformatics, Volume 23, Issue 4, 15 February 2007, Pages 500–501, https://doi.org/10.1093/bioinformatics/btl629

The SSPACE algorithm
(for lazy readers, read the highlighted part)
1. short-paired DNA reads are filtered by removing sequences containing non-ATCG characters, the remaining read pairs are mapped against the pre-assembled contigs using Bowtie*
2. The position and orientation of each pair that could be mapped is stored in a hash.
3. Remove duplicate read-pairs
4. *starting of scaffolding* putative contig pairs are computed based on the position of the paired reads on different contigs
- contigs pairs are only considered if the calculated distance between them satisfy the user-defined distance range
5. after pairing of cotigs, scaffolds are formed by iteratively combining contigs starting with the largest contigs
- only if a minimum number of read pairs (k) support the connection (default k=5)
- If there is alternative connections, the algorithm seeks to place all alternatives in the correct order using the estimated insertion. Otherwise a ratio is calculated between the two best alternatives. If the ration is below the threshold (default a=0.7) , a connection with the best scoring alternative is established.
6. Extension of scaffolds is aborted if either
-a contigs has no links with other contigs or
- the ratio for alternatives is exceeded
7. the scaffolding process is repeated until all contigs are incorperated into linear scaffolds

This paper provides detailed information on scaffolding with SSPACE
(Bioinformatics, Volume 27, Issue 4, 15 February 2011, Pages 578–579, https://doi.org/10.1093/bioinformatics/btq683)

Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genome. It employs a Burrow-Wheeler index based on the full-text minute space (FM) index, which has a small memory footprint.
This paper describes Bowtie in detail.
(Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), R25. http://doi.org/10.1186/gb-2009-10-3-r25)

Before performing scaffolding with SSPACE, a SSPACE library file is needed. The file basically consists of the PE1 & 2 files name, insert size, allowed error rate and paired end indicator.
Following command line is used to call for the scaffolding
$ SSPACE_Basic_vXXXX -l sspace_library.txt -s contigs.fa -z # -k # -a # -n # -T # > log_sspace
where
# = number
-l = the library file
-s = the contigs file from the assembly (velvet)
-z = minimum contig length used for scaffolding. Filter out contigs that are below -z (default = 0, optional)
-k = minimum number of links (read pairs) to compute scaffold (default = 5, optional)
-n = minimumoverlap required between contigs to merge adjacent contigs in a scaffold (default = 15, optional)
-T = number of threads used for Bowtie map (default = 1)
-b = base name for your output files
more info can be found here (https://github.com/ablab/external-tools/blob/master/scaffolding/sspace/MANUAL)
or the Standard User manual
(http://gyra.ualg.pt/_static/F132-03_SSPACE_Standard_User_manual_v3.0.pdf)

What I encountered was the command cannot be called, in this case, try to write it out in full path. In my case, I tried
$ perl /share/apps/SSPACE-BASIC....depending on where it is stored in your server
The output include
1. log_sspace - logs execution time / errors
2. standard_output.final.scaffolds.fasta - final scaffolds produced by SSPACE
3. standard_output.final.evidence - produced scaffolds including the initial numbered contigs
4. standard_output.summaryfile.txt - gives a summary after every step. Summary of number of inserted sequences, filtered sequences, contig sequences, mapping stats, pairing stats and contig/scaffold size summaries
5. reads - FASTA file, converted files of the paired-read data, each two consecutive sequences are pairs. This file is used as input for both the contig extension as the scaffolding step
6. bowtieoutput - folder, containing 4 output files of bowtie, index files generated by 'bowtie-build'. Produced for each library.
7. intermediate_results - FASTA files, all contigs sequences.Both extended and non extended contigs
8. pairinfo - folder, consisting 2 files namely pairing distribution and pairing issues file.
more detail can be found here (https://github.com/nsoranzo/sspace_basic)

View the output assembly file at standard_output_final.scaffolds.fasta
the number of scaffolds can be counted with command
$ grep -c ">" standard_output.final.scaffolds.fasta

Low sequence coverage, repetitive elements and short read length make de novo genome assembly difficult, often resulting in sequence and/or fragment "gap" - unchracterized nucleotide (N) stretches of unknown or estimated lengths. Some of these gaps can be closed by re-processing latent information in the raw reads. Gap filling is the last phase in genome assembly. The input will be the scaffolds (linearly ordered contigs) and reads; whereas output will be scaffolds where gaps between contigs have been filled.

GapFiller is chosen for gap closing in my case. GapFiller is a stand-alone program for closing gaps within pre-assembled scaffolds. It is unique in offering the possibility to manually control the gap closure process. By using the distance information of paired-read data, GapFiller seeks to close the gap from each edge in an iterative manner.

GapFiller was developed by the developer of SSPACE too! The developer claimed that GapFiller is more accurate than IMAGE and SOAP's GapClosure. Although GapFiller yields similar results in terms of the number of gaps/nucleotides closed as SOAP's GapClosure, the smaller error rate indicates the GapFiller is more appropriate for reliable gap filling.
The main features of GapFiller mentioned are as following
1. Input are simple FASTA scaffold sequences as well as (multiple) FASTA/FASTQ paired-read data
2. Multiple library input of both paired-end and/or mate pair datasets
3. High quality closing of gaps
4. High reduction of the number of gaps, and the number of gapped nucleotides
5. Detailed output of gaps (such as no. of reads used, no. of nucleotides, remaining gaped nucleotides
6. Detailed output of the gapclosing process
(http://seqanswers.com/forums/showthread.php?t=21493)

The innovative feature of GapFiller is the possibility to produce a highly reliable output that, having been certified correct and hence needing no further validation, can be used to improve or validate a whole genome assembly. This method is based on a seed-and-extend schema, it select one read and tries to extend it using reads that overlap for a significant region. The main drawback of seed and extend assembler is their inherent incapability to cope with complex (ie. repetitive) genome. The advantage of this method lies in generating correct and certified contigs and, as a by product, in the identification of "difficult" areas such as repeats, low coverage region etc. thus avoiding the production of wrong contigs. The algorithm is based on carefully chosen hash function together with a set of heutistics able to avoid or detect errors, as well as on a test for establishing the correctness of a sequence.

How seed-and-extend fill the gaps?
Seed-and-extend assemblers repeatedly pick up a seed (either a read or a previously assembled contigs) and extend it using other reads. This procedure is realised by computing and analysing all/almost all the overlaps between seed's tip and the remaining available reads.The reads used for an extension are those with the highest alignment score. Their computation bottleneck is their capability to quickly cope with all the allignment scores to be determined.

How GapFiller fill the gaps?
1. Dataset preparation. Storing all useful reads in memory efficient data structure
- allows to readily compute overlaps between the contig under construction and the remaining available reads
2. Each seed read (possible belonging to a new set of paired read) is selected one after the other and used to start an extension phase
3. Contig extension. Halts when a stop condition is reached.
- the contig produced is labelled as trusted or not trusted depending on the stop condition

Steps in contig extension

(a) The putative overlapping reads, selected by their fingerprint values, are checked for the presence of mismatches and possibly discarded.
(b) The consensus string is computed for every position j such that either j ≤ F(C) or at least m = 2 reads are available. The characters rounded in gray and red refer to low-represented and non-represented positions, respectively. In presence of ambiguities (i.e., positions in which more than one character with the same
representation rate occur) GapFiller chooses the character belonging to the first read encountered, from left to right.
(c) Reads with mismatches in correspondence of the low-represented positions are discarded , hence they do not contribute to reach the threshold m to
compute a new consensus string. In this example read r4’s tail is cut in the non-represented position, regardless on whether it matches the
consensus string or not.
(d) The reads still alive after Step 3 are used to compute the final consensus string Cnew. Since there are 2 ≥ m available reads exceeding Si’s tail, Cnew is computed, it is attached to Si, and the extended contig Si+1 is obtained.

For more detailed explanation, please refer to the following paper
Nadalin, F., Vezzi, F., & Policriti, A. (2012). GapFiller: a de novo assembly approach to fill the gap within paired reads. BMC Bioinformatics, 13(Suppl 14), S8. https://doi.org/10.1186/1471-2105-13-S14-S8

Like SSPACE, GapFiller needs a library file, its almost exactly the same as the thefor SSPACE just that it has include the aligner. An example of the library file would be
Lib1 bowtie PE1.fastq PE2.fastq 4000 0.25 FR
where
Lib 1 - name of the library
bowtie - name of the aligner, either bowtie, bwa or bwasw
PE1/2.fastq - fasta or fastq files for both ends. For each paired read, one of the reads should be in the first file
4000 - expected/ observed inseted size between paired reads
0.25 - minimum allowed error
* with an expected insert size of 4000 and 0.25 error, the distance can have an error of 4000*0.25 = 1000 in either direction. Thus pairs between 3000 and 5000 distance are valid pairs
FR - the orientation of the paired-reads, orientation can be FF, FR, RF, or RR. Where F stands for forward --> and R stands for reverse <--

The following command is used to call for gapfilling using GapFiller
$ GapFiller.pl -l gapfill_library.txt -s standard_output.final.scaffolds.fasta -m 60 -o 15 -r 0.8 -n 30 -t 10 -T 2 > log_gapfill
where
-l = the library file
-s = the final scaffold output file from SSPACE
-m = minimum number of overlapping bases with the edge of the gap (default = 29)
-o = minimum number of reads needed to call a base during an extension (default = 2)
-r = percentage of reads that should have a single nucleotide extension in order to close gap in a scaffold (default = 0.7)
-n = minimum overlap required between contigs to merge adjacent sequences in a scaffold (default = 10)
-t = number of bases to trim off the start and begin of the sequence (usually missambled/low-coverage reads) (default = 10, optional)
-T = number of threads to run (default = 1)
-i = number of iterations to fill the gaps (default = 10, optional)

more detailed explanation on the parameter can be accessed from the manual (http://stab.st-andrews.ac.uk/wiki/index.php?title=GapFiller&action=pdfbook&format=single)

After running the program, the output files included in the folder standard_output are
1. standard_output.filled.final.txt
2. standard_output.gapfilled.final.fa - the output assembly file
3. standard_output.summaryfile.final.txt
4. standard_output.closed.evidence.final.txt
4. alignoutput (folder)
5. intermediate_results (folder)
6. reads (folder)

count the number of scaffolds of the final assembly with
$ grep -c ">" standard_output.grapfilled.final.fa

Compare the number of scaffolds before and after gapfilling. If the number drops, the gapfilling succeed in closing some of the gaps. However, if the number remain, it doesn't mean the gaps are not getting smaller!

To this step, it's the end of genome assembly.

Fun fact! There are tool that can scaffold and complete short read assemblies while the long read sequencing run is in progress! The assembly metrics will be reported in real-time so the sequencing run can be terminated once an assembly of sufficient quality is obtained.

Stream of long reads are aligned to the existing contigs to create alignment records. Bridges connecting contigs are formed, and are used for extending scaffolds.These steps are performing in a streaming fashion.

Read this paper for more info
Cao, M. D., Nguyen, S. H., Ganesamoorthy, D., Elliott, A. G., Cooper, M. A., & Coin, L. J. (2017). Scaffolding and completing genome assemblies in real-time with nanopore sequencing. Nature Communications, 8, 14515.
https://doi.org/10.1038/ncomms14515

ASM Genome Announcement

thonzenn — 2018-07-11 09:37:15 UTC

For Genome sequences. manuscripts should provide:
- A rationale or significance for the sequencing
-The provenance for the organism sequenced.
-Taxonomic identification down to genus for prokaryotic isolates.
- A description of how the organism was isolated and growth conditions for cultivation.
For single-cell amplified genomes, authors should instead supply information about how the cell was identified and isolated.
- Detailed methods for DNA isolation, library preparation, and sequencing (including the technology and chemistry used).
- A description of how the reads were quality controlled.
- Details on how the genome was assembled and, if applicable, annotated.
- A citation and version number for every piece of software used.
- Relevant statistics for the sequencing run (e.g., read length and number of reads in total).
- Relevant statistics for the assembly (e.g., number of contigs and N50 values).
- Genome GC content and total size.
- Accession numbers for both the assembly and raw reads that link to publicly available data.

thonzenn — 2018-07-12 01:41:01 UTC

thonzenn — 2018-07-12 09:26:54 UTC

thonzenn — 2018-07-12 09:27:10 UTC