下一代測序 (NGS)/Velvet

Velvet 實踐：第 1 部分

本實踐將涵蓋以下內容

編譯 Velvet
單端
K-mer 長度
覆蓋率截止值
整個基因組序列作為輸入？？？

準備環境

首先確保您位於主目錄，方法是鍵入

cd

並透過鍵入以下內容絕對確保您在那裡

pwd

現在為本練習和其他兩個 Velvet 實踐建立子目錄。所有這些目錄都將作為整個課程名為 NGS 的目錄的子目錄建立。為此，您可以使用以下命令

mkdir -p NGS/velvet/{part1,part2,part3}

# The -p tells mkdir (make directory) not to worry if a parent directory is missing.
# That is, if a sub-directory cannot be made because its parent directory does not exist,
# just make the parent directory  first rather than reporting an error.
# The “one at a time” approach would be:
mkdir NGS
mkdir NGS/velvet
mkdir NGS/velvet/part1
mkdir NGS/velvet/part2
mkdir NGS/velvet/part3

建立目錄後，檢查結構並透過鍵入以下內容移動到準備進行第一個 Velvet 練習的目錄中

ls -R NGS;

cd NGS/velvet/part1; pwd;

下載並編譯 Velvet

您可以在以下位置找到最新版本的 Velvet

http://www.ebi.ac.uk/~zerbino/velvet/

您可以訪問此 URL 並下載最新版本的 Velvet，或者等效地，您可以鍵入以下內容，這將下載、解壓縮、檢查、編譯並執行您本地編譯的 Velvet 版本

cd ~/NGS/velvet/part1; pwd;

cp ~/NGS/Data/velvet_1.2.07.tgz . ;

tar xzf ~/NGS/Data/velvet_1.2.07.tgz;

ls -R; cd velvet_1.2.07;

make velveth velvetg;

./velveth'''

檢視您建立的可執行檔案。它們將透過以下命令顯示為綠色

ls --color=always;

The switch --color, instructs that files be coloured according to their type.
This is often the default. Here we are just being cautious. 
The =always is included as colouring is usually suppressed in scripts.
If you run this exercise using the script provided, just --color would not be enough.
--color=always insists on file colouring even from a script.

檢視命令生成的輸出，您將看到以下引數傳遞到編譯器中

“MAXKMERLENGTH=31” 和 “CATEGORIES=2”

這表示預設編譯設定為最大大小為 31 的 De Bruijn 圖 KMER，並允許最多 2 個讀取類別。您可以使用命令列引數覆蓋這些和其他預設配置選項。假設您想使用 3 個類別執行 KMER 長度為 41 的 Velvet，需要透過鍵入以下內容重新編譯 Velvet 來啟用此功能

make clean; make velveth velvetg MAXKMERLENGTH=41 CATEGORIES=3; ./velveth

Velvet 也可用於處理 SOLID 顏色空間資料。為此，您需要另一個 make 引數。使用以下命令清除上次編譯並嘗試以下引數

make clean; make MAXKMERLENGTH=41 CATEGORIES=3 color ./velveth_de

有關 Velvet 編譯和執行時引數的更多說明，請參閱 Velvet 手冊：https://github.com/dzerbino/velvet/wiki/Manual

單端讀取組裝

The data you will examine is from Staphylococcus aureus USA300 which has a genome of around 3MB.
The reads are  Illumina and are unpaired, also known as single-end library. 
Even though you have carefully installed velvet in your own  workspace, we will use a pre-installed version. 
The data needed for this section can be obtained from the Sequence Read Archive (SRA). 
For the following example use the run data SRR022825 and SRR022823 from the SRA Sample SRS004748.
The SRA experiment could be viewed by setting your browser to the URL:
	
http://www.ebi.ac.uk/ena/data/view/SRS004748

以下練習重點介紹使用單端讀取的 Velvet，可用引數如何影響組裝以及如何衡量和比較變化。首先，先回到您為本練習準備的目錄，建立一個新資料夾，為本部分命名一個合適的名稱，然後進入該資料夾。從網際網路下載檔案的命令將是

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022825/SRR022825.fastq.gz

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022823/SRR022823.fastq.gz

或者，如果您在本地安裝了這些檔案，只需建立指向這些檔案的軟連結。繼續複製（或鍵入）

<syntax highlight="text> cd ~/NGS/velvet/part1 mkdir SRS004748 cd SRS004748 pwd ln -s ~/NGS/Data/SRR022825.fastq.gz . ln -s ~/NGS/Data/SRR022823.fastq.gz . ls -l </syntaxhighlight>

You are ready to process your data with velvet, which is a compressed fastq file. Velvet has two main components:
velveth -used to construct, from raw read data, a dataset organised in the fashion expected by the second component, velvetg.
velvetg -the core of velvet where the de Bruijn graph assembly is built and manipulated.
You can always get further information about the usage of both velvet programs by typing velvetg or velveth in your terminal.

現在使用以下選項為 SRR022825.fastq.gz 和 SRR022823.fastq.gz 中的讀取執行 velveth

- 25 的 De Bruijn 圖 k-mer

- 名為 run_25 的輸出目錄

velveth run_25 25 -fastq.gz -short SRR022825.fastq.gz SRR022823.fastq.gz

velveth 自行執行一段時間，最後在輸出目錄中生成一些檔案。進入輸出目錄 run_25，看看 velveth 到目前為止做了什麼。UNIX 命令 less 允許您檢視輸出檔案（按 q 退出）。以防您仍然需要提示

cd run_25; 
ls -l;
head Sequences;

現在向上移動一個目錄級別，並使用以下命令在您的輸出目錄上執行 velvetg

cd ..
time velvetg run_25

回到您的結果目錄以檢查 velvetg 的效果

cd run_25; ls -l;

為您：執行上面的命令後

Q1：您在 run_25 資料夾中看到了哪些額外檔案？
Q2：您認為它們可能代表什麼？
Q3：在 run_25 中的 Log 檔案中，N50 是多少？

N50 statistic: Broadly, it is the median (not average) of a sorted data set using the length of a set of sequences.
Usually it is the length of the contig whose length.
When added to the length of all longer contigs, makes a total greater that half the sum of the lengths of all contigs. 
Easy, but messy – a more formal definition can be found here:
	
http://www.broadinstitute.org/crd/wiki/index.php/N50

備份 contigs.fa 檔案，並使用以下命令計算 N50（以及 N25、N75）值

cp contigs.fa contigs.fa.0

您現在嘗試

gnx -min 100 -nx 25,50,75 contigs.fa

Q4. N50 值是否與 Log 檔案中儲存的值一致？
Q5. 如果沒有，您認為可能是什麼原因？

In order to improve our results, take a closer look at the standard options of velvetg by typing 'velvetg' without parameters.
For the moment focus on the two options -cov_cutoff and -exp_cov. 
Clearly -cov_cutoff will allow you to exclude contigs for which the kmer coverage is low, implying unacceptably poor quality. 
The -exp_cov switch is used to give velvetg an idea of the coverage to expect. 
If the expected coverage of any contig is substantially in excess of the suggested expected value,
maybe this would indicate a repeat. 
For further details of how to choose the parameters, go to 'Choice of a coverage cutoff': 	
  http://wiki.github.com/dzerbino/velvet/

簡而言之，每個重疊群的 Kmer 覆蓋率（以及更多資訊）都儲存在 stats.txt 檔案中，可與 R 一起使用以視覺化覆蓋率分佈。檢視 stats.txt 檔案，啟動 R，載入並使用以下命令視覺化資料

R
library(plotrix) 
data <- read.table("stats.txt", header=TRUE)
x11()
weighted.hist(data$short1_cov, data$lgth, breaks=0:50)

加權直方圖是視覺化覆蓋率資訊的更好方法，因為存在噪音（許多非常短的重疊群）。您可以在下面看到示例輸出

選擇預期覆蓋率和覆蓋率截止值後，您可以透過鍵入以下內容退出 R

q()
n

加權直方圖表明，預期覆蓋率約為 14，而低於 6 的所有內容可能都是噪音。

覆蓋率也約為 20、30 和大於 50，這可能是汙染或重複（取決於資料集），但目前您不必擔心。

要檢視改進，請首先使用 -cov_cutoff 6 重新執行 velvetg，並檢查 N50 後僅使用/新增 -exp_cov 14 到命令列選項。

還要保留 contigs 檔案的副本以供比較

 cd ~/NGS/velvet/part1/SRS004748
time velvetg run_25 -cov_cutoff 6
# Make a copy of the run
cp run_25/contigs.fa run_25/contigs.fa.1

time velvetg run_25 -exp_cov 14
cp run_25/contigs.fa run_25/contigs.fa.2

time velvetg run_25 -cov_cutoff 6 -exp_cov 14
cp run_25/contigs.fa run_25/contigs.fa.3

Q6. 沒有引數時的 N50 是多少？
Q7. -cov_cutoff 6 時的 N50 是多少？
Q8. -exp_cov 14 時的 N50 是多少？
Q9. -cov_cutoff 6 -exp_cov 14 時的 N50 是多少？
Q10 您是否注意到 velvetg 執行時間有所變化？如果是，您能解釋一下原因嗎？

您一直在使用給定的 -exp_cov 和 -cov_cutoff 引數執行 velvetg。現在嘗試使用不同的截止值、預期引數進行實驗，並探索其他設定（例如 -max_coverage、-max_branch_length、-unused_reads、-amos_file、-read_trkg 或檢視 velvetg 幫助選單）。

In particular, look at the -amos_file parameter which instructs velvetg to create a version of the assembly that can be
processed and viewed with a program called AMOS Hawkeye. 
Another  program, called tablet, can also understand and display the AMOS file. 
For now, we will take a look at tablet.

僅使用 -cov_cutoff 6 執行 velvetg，但請求 amos 檔案

velvetg run_25 -cov_cutoff 6 -amos_file yes

現在使用 tablet 快速檢視組裝

tablet run_25/velvet_asm.afg &

velvet_asm.afg 是 velvetg 對包含 -amos_file yes 開關的響應而生成的。tablet 將佔用相當多的記憶體，您可以安全地忽略 tablet 的抱怨。檔案載入後，選擇其中一個較長的重疊群。注意缺少讀取資訊和重疊群顯示的一維性。在您已經看到足夠的資訊後，關閉 tablet。現在重新執行 velvetg，新增 -exp_cov 7 額外引數，無需儲存任何檔案，因為 amos 檔案需要更改。現在重新執行 velvetg，新增 -exp_cov 14 額外引數，無需儲存任何檔案，因為 AMOS 檔案需要更改。

velvetg run_25 -cov_cutoff 6 -exp_cov 14 -amos_file yes

再次使用 tablet 檢視 amos 檔案

tablet run_25/velvet_asm.afg &

與之前一樣，選擇一個較長的重疊群，並探索顯示選項，然後關閉 tablet。為您

Q11. 您認為為什麼第二次使用 tablet 時只有讀取資訊？
Q12. 您可能已經注意到，-exp_cov 14 開啟時，velvetg 執行時間更長。這有道理嗎？

如果你想進一步探索天鵝絨的行為，你可以嘗試以下操作。

減少序列覆蓋率，只選擇一個輸入檔案進行組裝，例如SRR022825.fastq.gz。

增加序列覆蓋率，從同一個ENA樣本SRS004748下載更多檔案 (http://www.ebi.ac.uk/ena/data/view/SRS004748）。

針對你：N50 如何變化

Q13. 減少覆蓋率？
Q14. 增加覆蓋率？