生物資訊學資料管理/SQL練習
針對以下問題,我們將考慮以下關於微陣列基因表達資料的表。您的任務是將每個給定的查詢用 SQL 表達出來。為了方便起見,這些表中提供了示例資料,但請注意,它們只是示例資料。您的查詢必須適用於所有潛在的資料,而不僅僅是這裡所示的示例資料。
Genes gid | name | organism | annotation -----+---------+----------+---------------------------- g1 | YLR180C | yeast | hypothetical protein g2 | YLR181D | yeast | response to desiccation g3 | sp15 | yeast | drought stress responsive g4 | pdp77 | pine | putative stress responsive g5 | hsp70 | pine | heat shock protein g6 | hsp90 | pine | heat shock protein
Expression gid | experimentid | level | significance -----+--------------+-------+-------------- g1 | exp12 | 3.5 | 1 g2 | exp23 | -3 | 1 g3 | exp12 | 1 | 2 g3 | exp13 | -1.5 | 2 g3 | exp23 | 1.7 | 4 g4 | exp12 | 1.5 | 2 g4 | exp13 | 1.5 | 2 g4 | exp23 | 1.5 | 2 g4 | exp6 | 1.5 | 2 g5 | exp6 | 2 | 1 g5 | exp13 | 2.5 | 2 g6 | exp6 | -3.86 | 3
Experiments experimentid | name | whoperformed | date --------------+--------------------+---------------+------------ exp12 | Systematic Torture | Prof. Pain | 2004-06-02 exp23 | Heaped Abuse | Tommy Student | 2004-06-03 exp13 | Salt Stress | Gasch | 1998-07-04 exp6 | Sorbitol Exposure | Gasch | 1999-07-05
Membership gid | category -----+-------------------- g1 | glutathione g2 | antioxidant g3 | glycine binding g1 | amino acid binding g4 | amino acid binding g5 | amino acid binding g6 | binding
GOTree
category | parent_category
--------------------+--------------------
antioxidant | molecular function
binding | molecular function
glutathione | antioxidant
glycine binding | amino acid binding
amino acid binding | binding
查詢 Pain 教授在 2004 年 1 月 1 日之後進行的實驗名稱。
SELECT name
FROM Experiments
WHERE whoperformed = 'Prof. Pain'
AND date > '2004-01-01';
查詢在某些實驗中表達量至少為兩倍或更高的基因名稱,或在某些實驗中表達量至少為兩倍或更低的基因名稱。將它們與其生物體一起列出,以兩列格式顯示。
SELECT Genes.gid, name, level, significance
FROM Expression, Genes
WHERE Expression.gid = Genes.gid
AND significance >= 1.0
AND (level >= 2.0 OR level <= -2.0);
'甘氨酸結合' 的祖先類別是什麼?
SELECT Parents.parent_category
FROM GOTree as Children, GOTree as Parents
WHERE Children.category = 'glycine binding'
AND Children.parent_category = Parents.category;
查詢在某個 Gasch 實驗之前進行的實驗名稱。
SELECT E1.name
FROM Experiments AS E1, Experiments AS E2
WHERE E1.date < E2.date
AND E2.whoperformed = 'Gasch';
SELECT name
FROM Experiments
WHERE Experiments.date < (
SELECT MAX(date)
FROM Experiments
WHERE whoperformed = 'Gasch'
);
查詢在至少兩個實驗中表達量超過 0.5 倍(顯著性為 1.0 或更高)的松樹基因名稱。
首先,我們必須找到基因上調且顯著的實驗。
CREATE VIEW Upregulated AS
SELECT gid, experimentid
FROM Expression
WHERE significance >= 1.0
AND level >= 0.5;
接下來,我們必須確定在至少兩個實驗中上調的基因。我們透過獲取上調基因的乘積並選擇基因 ID 相同但實驗 ID 不同的行來做到這一點。
CREATE VIEW UpInTwoOrMore AS
SELECT DISTINCT U1.gid AS gid
FROM Upregulated AS U1, Upregulated AS U2
WHERE U1.gid = U2.gid
AND U1.experimentid <> U2.experimentid;
最後,我們確定這些基因中哪些來自松樹,並投影它們的名稱。
SELECT name
FROM Genes, UpInTwoOrMore
WHERE Genes.gid = UpInTwoOrMore.gid
AND organism = 'pine';
我們也可以在一個查詢中完成所有這些步驟。
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND organism = 'pine';
作為另一種方法,我們可以使用子查詢來找到答案。關鍵是要建立一個 **相關子查詢**,其中子查詢依賴於父查詢的某些屬性(在本例中為基因 ID)。請注意,我們仍然需要使用上調上面建立的檢視以減少程式碼冗餘。
SELECT DISTINCT name
FROM Genes, Upregulated as U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated as U2
WHERE U1.gid = U2.gid
AND U1.experimentid <> U2.experimentid
);
我們可以使用某些資料庫系統(如 PostgreSQL 和 MySQL)的擴充套件功能,透過以下方式以更方便的方式執行這些查詢GROUP BY和COUNT.
CREATE VIEW UpInTwoOrMore AS
SELECT gid
FROM Expression
WHERE level >= 0.5
AND significance >= 1.0
GROUP BY gid
HAVING COUNT(*) > 1;
SELECT name
FROM Genes, UpInTwoOrMore
WHERE Genes.gid = UpInTwoOrMore.gid
AND organism = 'pine';
查詢在至少三個實驗中表達量上調 0.5 倍或更高(顯著性為 1.0 或更高)的松樹基因名稱。
類似於問題 5 的答案。這裡的關鍵是,雖然等式評估是可傳遞的,但不等式評估不是,因此必須涵蓋每種情況。
CREATE VIEW UpInThreeOrMore AS
SELECT DISTINCT U1.gid AS gid
FROM Upregulated AS U1, Upregulated AS U2, Upregulated as U3
WHERE U1.gid = U2.gid
AND U1.gid = U3.gid
AND U1.experimentid <> U2.experimentid
AND U1.experimentid <> U3.experimentid
AND U2.experimentid <> U3.experimentid;
SELECT name
FROM Genes, UpInThreeOrMore
WHERE Genes.gid = UpInThreeOrMore.gid
AND organism = 'pine';
或者
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2, Expression as E3
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.gid = E3.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E3.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E3.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND E1.experimentid <> E3.experimentid
AND E2.experimentid <> E3.experimentid
AND organism = 'pine';
我們需要在原始的相關子查詢中構建另一個相關子查詢才能使它工作。
SELECT DISTINCT name
FROM Genes, Upregulated AS U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U2
WHERE U1.gid = U2.gid
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U3
WHERE U1.gid = U2.gid
AND U1.gid = U3.gid
AND U1.experimentid <> U2.experimentid
AND U1.experimentid <> U3.experimentid
AND U2.experimentid <> U3.experimentid
)
);
只需調整計數評估。
CREATE VIEW UpInThreeOrMore AS
SELECT gid
FROM Expression
WHERE level >= 0.5
AND significance >= 1.0
GROUP BY gid
HAVING COUNT(*) > 2;
SELECT name
FROM Genes, UpInThreeOrMore
WHERE Genes.gid = UpInThreeOrMore.gid
AND organism = 'pine';
找出在恰好兩個實驗中上調 0.5 倍或更多(顯著性為 1.0 或更多)的松樹基因的名稱。
這裡的關鍵是識別,將上調於兩個或更多個實驗的基因集減去上調於三個或更多個實驗的基因集,即可得到在恰好兩個實驗中上調的基因集。因此,我們的答案是問題 5 的答案減去問題 6 的答案。
CREATE VIEW UpInTwo AS
SELECT *
FROM UpInTwoOrMore
EXCEPT
SELECT *
FROM UpInThreeOrMore;
SELECT name
FROM Genes, UpInTwo
WHERE Genes.gid = UpInTwo.gid
AND organism = 'pine';
或者
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND organism = 'pine'
EXCEPT
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2, Expression as E3
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.gid = E3.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E3.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E3.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND E1.experimentid <> E3.experimentid
AND E2.experimentid <> E3.experimentid
AND organism = 'pine';
SELECT DISTINCT name
FROM Genes, Upregulated as U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated as U2
WHERE U1.gid = U2.gid
AND U1.experimentid <> U2.experimentid
)
EXCEPT
SELECT DISTINCT name
FROM Genes, Upregulated AS U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U2
WHERE U1.gid = U2.gid
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U3
WHERE U1.gid = U2.gid
AND U1.gid = U3.gid
AND U1.experimentid <> U2.experimentid
AND U1.experimentid <> U3.experimentid
AND U2.experimentid <> U3.experimentid
)
);
CREATE VIEW UpInTwo AS
SELECT *
FROM UpInTwoOrMore
EXCEPT
SELECT *
FROM UpInThreeOrMore;
SELECT name
FROM Genes, UpInTwo
WHERE Genes.gid = UpInTwo.gid
AND organism = 'pine';
找出在所有記錄的實驗中表現出陽性表達的基因。
SELECT Expression.gid, level from Expression WHERE level>1.0;
在實驗 exp23 中表達最正面的基因的名稱是什麼?假設最小顯著性為 1.0。
以下是一種編寫 SQL 查詢以查詢在實驗 exp23 中表達最正面的基因名稱的方法
SELECT Genes.name FROM Genes JOIN Expression ON Genes.gid = Expression.gid WHERE Expression.experimentid = 'exp23' AND Expression.level > 0 AND Expression.significance >= 1 ORDER BY Expression.level DESC LIMIT 1;
此查詢從 "Genes" 表中選擇 "name" 列,並根據 "gid" 列將其與 "Expression" 表聯接。然後,它根據 "WHERE" 子句中提供的條件過濾結果。WHERE 子句中的條件檢查實驗 ID 是否為 'exp23',水平是否大於 0,顯著性是否大於或等於 1。這將返回 "Genes" 表中所有在實驗 exp23 中以至少 1.0 的顯著性水平正向表達的基因的行。然後,查詢按水平降序對結果進行排序,並將結果限制為 1,這將給出表達水平最高的基因。
此查詢將返回在實驗 exp23 中以至少 1.0 的顯著性水平表達最正面的基因的名稱。
這個問題建立在問題 8 的基礎上;表達“第二正面”的基因的名稱是什麼?再次假設最小顯著性為 1.0。
哪些基因在“Experiments”表中列出的所有實驗中都表現出陽性表達?對顯著性水平沒有限制。
準備一個基因表,其中包含它們的註釋以及任何在其中表達最高或最低(任何顯著性水平)的實驗。包含第四列來表明它們是最高還是最低。