李雄
(華東交通大學(xué) 軟件學(xué)院,江西 南昌,330013)
復(fù)雜疾病的組學(xué)數(shù)據(jù)挖掘方法研究
李雄
(華東交通大學(xué) 軟件學(xué)院,江西 南昌,330013)
目前針對單獨某一類型的組學(xué)數(shù)據(jù),已挖掘出部分與腫瘤真實相關(guān)的遺傳因素及環(huán)境因素等信息,但仍然可能僅是隱藏于復(fù)雜遺傳機制背后的冰山一角,導(dǎo)致這種局限性的關(guān)鍵原因可能是疾病模型過于簡化即忽略多層次組學(xué)數(shù)據(jù)之間的相互關(guān)系。研究認(rèn)為在加深理解全基因組SNP數(shù)據(jù)的基礎(chǔ)上,進一步融合多源組學(xué)數(shù)據(jù),加深理解上位性、異質(zhì)性等現(xiàn)象,從而提高腫瘤風(fēng)險評估能力,有利于實現(xiàn)個體化醫(yī)療目標(biāo)。本文從SNP數(shù)據(jù)和多源組學(xué)數(shù)據(jù)分析的角度比較分析現(xiàn)有復(fù)雜疾病的組學(xué)數(shù)據(jù)挖掘方法。
SNP;全基因組關(guān)聯(lián)研究;系統(tǒng)生物學(xué);機器學(xué)習(xí)
復(fù)雜疾病是一類由多種因素導(dǎo)致且形成機制尚未明晰的人類健康殺手,如精神失常、多發(fā)性硬化癥及腫瘤等常見疾病,而腫瘤是復(fù)雜疾病中最為常見的疾病之一。據(jù)中國腫瘤登記年報最新統(tǒng)計結(jié)果顯示,全國每分鐘約有6人被確診為癌癥,并且患者已呈現(xiàn)出年輕化趨勢,因此,腫瘤對國民生活質(zhì)量造成了巨大威脅。單核苷酸多態(tài)性(SNP)是一類DNA序列層次的遺傳變異,其可能導(dǎo)致調(diào)控元件、基因、蛋白質(zhì)結(jié)構(gòu)等生物分子發(fā)生重大改變,使得個體患腫瘤風(fēng)險增加。目前,全球研究者針對不同腫瘤開展了全基因組關(guān)聯(lián)分析(GWAS),已準(zhǔn)確識別了部分重要SNP并已收錄至GWAS Catalog[1]。但隨著深入分析發(fā)現(xiàn),傳統(tǒng)GWAS存在研究結(jié)果難以重現(xiàn),可解釋性低及遺傳力缺失等不足。缺乏深入理解易感位點之間相互作用(上位性)及孤立地考察SNP數(shù)據(jù)是導(dǎo)致這些不足的關(guān)鍵因素,從計算機學(xué)科角度可大致可歸結(jié)為三點:第一,全基因組SNP數(shù)據(jù)中包含有上百萬個位點,對生物信息處理中計算方法及硬件資源帶來巨大挑戰(zhàn),難以深入挖掘[2];第二,對腫瘤等復(fù)雜疾病缺乏系統(tǒng)、完整的認(rèn)知,導(dǎo)致其定義存在模糊性甚至歧義性,使得病例樣本中呈現(xiàn)多種不同的遺傳結(jié)構(gòu)(異質(zhì)性),一定程度上掩蓋了遺傳變異與腫瘤不同亞型之間相關(guān)性[3];第三,腫瘤發(fā)生、發(fā)展涉及多種生物分子相互作用,僅分析某一層次組學(xué)數(shù)據(jù)將加劇偏離真實疾病模型,從而難以發(fā)現(xiàn)真實完備的風(fēng)險因素,導(dǎo)致遺傳力缺失[4]。
高通量生物數(shù)據(jù)生成技術(shù)使得全基因組SNP數(shù)據(jù)、表觀基因組、轉(zhuǎn)錄組及代謝組等組數(shù)據(jù)得以顯著豐富,從而有利于充分發(fā)揮基于大數(shù)據(jù)驅(qū)動的研究模式應(yīng)用于腫瘤研究中。癌癥基因組圖譜(TCGA)及國際癌癥基因組聯(lián)盟(ICGC)計劃分別針對多種癌癥提供多層次組學(xué)數(shù)據(jù),為系統(tǒng)地探索腫瘤背后多源組數(shù)據(jù)之間交互機制提供了堅實的數(shù)據(jù)基礎(chǔ)。此外,陳洛南研究員團隊[5,6]對基于分子標(biāo)記、網(wǎng)絡(luò)標(biāo)記的復(fù)雜疾病分類及預(yù)測方法進行比較分析,指出基于網(wǎng)絡(luò)標(biāo)記的多分子、多源數(shù)據(jù)相互作用分析方法能更系統(tǒng)、更完備地反映復(fù)雜系統(tǒng)。
單一組學(xué)數(shù)據(jù)的深入研究,不僅有利于進一步發(fā)現(xiàn)新的遺傳變異基礎(chǔ),同時為后續(xù)多源數(shù)據(jù)融合提供保障。以下將以SNP數(shù)據(jù)分析中數(shù)據(jù)降維及疾病-對照研究等關(guān)鍵步驟為例加以概括。
數(shù)據(jù)降維:全基因組范圍內(nèi)存在有上百萬個甚至更多的SNP,但通常待研究疾病樣本數(shù)量相對非常有限,這種高維、小樣本數(shù)據(jù)將導(dǎo)致過擬合等現(xiàn)象。盡管交叉驗證、置換校驗等方法能一定程度緩解該現(xiàn)象,但在學(xué)習(xí)模型訓(xùn)練之前實施數(shù)據(jù)降維,不僅可以提取更具有代表性的樣本特征,而且能大大提升后續(xù)分析效率及顯著降低多重假設(shè)檢驗成本[7]。例如,對包含有500萬個位點的全基因組SNP數(shù)據(jù)開展窮舉式兩位點相互作用分析,那么需要對1.25*1013個SNP-SNP相互作用對作統(tǒng)計檢驗,而對3個位點相互作用分析時則增加至2.09*1019,可以發(fā)現(xiàn)隨著模型復(fù)雜度增加后續(xù)關(guān)聯(lián)研究的運算成本(存儲消耗、運行時間等)呈現(xiàn)指數(shù)級增長。假設(shè)某運算設(shè)備一秒鐘能處理100萬次檢驗,那么處理完所有2個位點及3個位點相互作用的時間分別長達3400小時和5.7*109,即使采用GPU等計算設(shè)備,運算成本仍然難以承受。因此,有效的數(shù)據(jù)降維手段是具有實際應(yīng)用意義的[8-15]。
易感位點識別:傳統(tǒng)基于單位點的全基因組關(guān)聯(lián)研究為避免多重假設(shè)檢驗誤差,而設(shè)置嚴(yán)格的顯著性水平如p<5×10-8,使得一些弱效甚至中等效應(yīng)的關(guān)聯(lián)信號被忽略,從而導(dǎo)致復(fù)雜疾病的遺傳力缺失、研究結(jié)果難以重現(xiàn)等現(xiàn)象[16]。研究指出忽略致病因素之間的上位性等都將過于簡化疾病模型,導(dǎo)致遺傳力缺失[17]。
全基因組易感位點識別方法優(yōu)勢在于能對數(shù)據(jù)集中所有待考察位點進行分析,從而避免人為忽略真實的易感位點,但在高階上位性分析時將產(chǎn)生組合爆炸現(xiàn)象,從而導(dǎo)致計算成本巨大等挑戰(zhàn)。全基因組上位性識別可進一步分為窮舉搜索法、隨機搜索法、啟發(fā)式搜索及機器學(xué)習(xí)法等四類[18],其中窮舉搜索策略所需要考察的上位性組合空間最大。多因子降維法(MDR)[19]作為窮舉搜索法中最具有代表性的方法之一,它將多種多位點基因型組合劃分為高風(fēng)險及低風(fēng)險兩類,從而將高維基因型預(yù)測問題轉(zhuǎn)換為一維,以實現(xiàn)高階上位性分析。由于MDR僅適用于較小規(guī)模數(shù)據(jù)集,因此Yang等[20]基于MDR提出一種快速MDR計算框架提升了計算效率。為進一步提升窮舉策略適用性,Hemani等[21]及Kam-Thong等[22]分別基于GPU提出窮舉式上位性搜索方法,而文獻[23-25]中分別基于高性能計算架構(gòu)如云計算等提高了計算效率,但這些方法一定程度忽略了異質(zhì)性或缺乏聯(lián)系其它層次生物數(shù)據(jù)加以分析?;陔S機采樣技術(shù)的隨機搜索法能提高上位性識別過程的效率,如BEAM[26]結(jié)合了貝葉斯位點組合劃分模型及馬爾可夫鏈蒙特卡羅采樣策略保證模型后驗概率最大化,基于隨機森林[27]及蟻群算法的AntEpiSeeker[28]等方法也一定程度改善了上位性組合搜索效率,但隨著組合空間的迅速增大,隨機策略穩(wěn)定性將大大降低,相同實驗環(huán)境下可能產(chǎn)生差異較大的上位性組合,降低了實際應(yīng)用價值,因而Jing等[18]提出基于準(zhǔn)則互補的多目標(biāo)優(yōu)化方法MACOED,然后利用蟻群優(yōu)化算法尋找非占優(yōu)Pareto解,從而增強了研究魯棒性。啟發(fā)式搜索策略結(jié)合其它信息引導(dǎo)上位性搜索過程,避免了隨機性及窮舉性,如CART依據(jù)信息熵等指標(biāo)優(yōu)化SNP子集的分類性能,迭代劃分SNP直到生成滿足條件的分類樹[29],該方法難以適用于純上位性現(xiàn)象,而MSCD[30]則基于能量分布差異啟發(fā)式搜索高階上位性組合空間,能有效識別出更顯著的高階上位性,該類方法有效性取決于啟發(fā)信息與研究目標(biāo)的相關(guān)性?;跈C器學(xué)習(xí)的上位性識別方法主要特點在于該方法無需事先了解基因型與個體表型之間的關(guān)系,而是通過訓(xùn)練學(xué)習(xí)模型以捕捉基因型與表型之間復(fù)雜關(guān)系,如Zhang等[31]利用函數(shù)回歸模型整體考察兩個基因組區(qū)域所有兩兩位點之間的上位性,其主要優(yōu)勢在于能有效處理稀罕SNP之間上位性。但是,機器學(xué)習(xí)模型就像一個黑匣子,難以讓研究者理解其背后的生物意義并且對于易感位點之間的相對重要性也知之甚少[32-39]。
目前,多源組學(xué)數(shù)據(jù)融合方法可大致分為兩類:多階段融合策略以及多特征融合策略[7]。多階段融合策略中每階段僅利用兩個不同層次組學(xué)數(shù)據(jù)構(gòu)建模型,以層次狀分階段考察多源組學(xué)數(shù)據(jù),而多特征融合策略則是利用所有多源數(shù)據(jù)所對應(yīng)特征同時融合以構(gòu)建多源生物數(shù)據(jù)與復(fù)雜疾病之間關(guān)聯(lián)模型。
多階段融合策略是一種類似于過濾機制的分析方法,初始多源數(shù)據(jù)中大規(guī)模遺傳變異經(jīng)過分層次、分階段的過濾,使得與待考察性狀無關(guān)的遺傳變異得以剔除,該策略中過濾機制通常依據(jù)統(tǒng)計顯著值或先驗知識等信息。Holzinger及Ritchie[40]提出一種三階段分析方法以融合基因組上基因表達譜及SNP等數(shù)據(jù),該方法首先基于全基因組統(tǒng)計顯著性閾值剔除與疾病不存在顯著關(guān)聯(lián)的位點,接著考察保留的顯著關(guān)聯(lián)位點與基因表達譜值之間關(guān)系以識別eQTL,最后考察易感基因或位點與待研究性狀之間的關(guān)系,其中與易感基因表達相關(guān)的eQTL也被用于藥物分析[41]。針對三階段分析方法中eQTL識別過程,已有一些研究成果分別從改進統(tǒng)計檢驗方法[42]、深入考察SNP與基因間調(diào)控關(guān)系[43]及加速分析效率[44]等角度加以改進??梢姡嚯A段融合策略有效性取決于統(tǒng)計檢驗及先驗信息的可靠性,并受限于及偏袒于先驗知識,同時一定程度上忽略了多源數(shù)據(jù)之間相互作用[45,46]。
多特征融合策略根據(jù)特征信息集成方式分為數(shù)值集成、特征轉(zhuǎn)換集成及模型集成[7,47,48]。Kim等[49]則基于特征轉(zhuǎn)換集成法首先將不同數(shù)據(jù)所對應(yīng)的特征轉(zhuǎn)換為子圖,接著利用子圖之間的關(guān)系融合。該類方法優(yōu)勢集中體現(xiàn)在能保留原數(shù)據(jù)的特有性質(zhì),并無需統(tǒng)一不同數(shù)據(jù)之間測量尺度,但特征轉(zhuǎn)換可能導(dǎo)致部分信息丟失?;谀P图傻姆椒ㄊ紫葐为殞⒚繉訑?shù)據(jù)分別訓(xùn)練多個模型,然后將多個模型進行集成,常見多模型集成方法有語義進化神經(jīng)網(wǎng)絡(luò)[50]、投票算法[51]及深度學(xué)習(xí)模型[52]等,其非常適用于異質(zhì)數(shù)據(jù),但模型之間的重疊將可能導(dǎo)致偏袒性或過擬合現(xiàn)象。
研究認(rèn)為理解復(fù)雜疾病背后多層次生物分子相互作用機制,深化云計算技術(shù)在復(fù)雜疾病大數(shù)據(jù)挖掘中的應(yīng)用,其研究內(nèi)容涉及多個研究領(lǐng)域的交叉,對解釋復(fù)雜疾病形成機制具有一定理論意義。同時,識別與復(fù)雜疾病真實相關(guān)的分子網(wǎng)絡(luò)標(biāo)記并建立風(fēng)險評估模型,具有實用價值。因此,在加深理解全基因組SNP數(shù)據(jù)的基礎(chǔ)上,進一步融合多源組學(xué)數(shù)據(jù),加深理解上位性、異質(zhì)性等現(xiàn)象,從而提高復(fù)雜疾病風(fēng)險評估能力,有利于實現(xiàn)個體化醫(yī)療目標(biāo)。
[1]Welter D,MacArthur J,Morales J,et al.The NHGRI GWAS Catalog,a curated resource of SNP-trait associations[J].Nucleic Acids Research,2014,42(Database issue):1001-6.
[2]Xiong H Y,Alipanahi B,Lee L J,et al.The human splicing code reveals new insights into the genetic determinants of disease[J].Science,2015,347(6218):1254806.
[3]Urbanowicz R J,Andrew A S,Karagas M R,et al.Role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome:a learning classifier system approach[J].Journal of the American Medical Informatics Association,2013,20(4):603-612.
[4]Li P,Guo M,Wang C,et al.An overview of SNP interactions in genome-wide association studies[J].Briefings in functional genomics,2014,14(2):143-55.
[5]Zeng T,Zhang WW,Yu X T,et al.Edge biomarkers for classification and prediction of phenotypes[J].Science China Life Sciences,2014,57(11):1103-1114.
[6]Liu R,Wang X,Aihara K,et al.Early diagnosis of complex diseases by molecular biomarkers,network biomarkers,and dynamical network biomarkers[J].Medicinal research reviews,2014,34(3):455-478.
[7]Ritchie M D,Holzinger E R,Li R,et al.Methods of integrating data to uncover genotype-phenotype interactions[J].Nature Reviews Genetics,2015,16(2):85-97.
[8]Patil N,Berno A J,Hinds D A,et al.Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21[J].Science,2001,294(5547):1719-1723.
[9]Ting C K,Lin W T,Huang Y T.Multi-objective tag SNPs selection using evolutionary algorithms[J].Bioinformatics,2010,26(11):1446-1452.
[10]Liao B,Li X,Zhu W,et al.A novel method to select informative SNPs and their application in genetic association studies[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2012,9(5):1529-1534.
[11]Liao B,Li X,Cai L,et al.A Hierarchical Clustering Method of Selecting Kernel SNP to Unify Informative SNP and Tag SNP[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2015,12(1):113-122.
[12]Li X,Liao B,Cai L,et al.Informative SNPs selection based on two-locus and multilocus linkage disequilibrium:Criteria of max-correlation and min-redundancy[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2013,10(3):688-695.
[13]Hung C L,Chen W P,Hua G J,et al.Cloud computing-based tag SNP selection algorithm for Human Genome Data[J].International journal of molecular sciences,2015,16(1):1096-1110.
[14]Wu C,Cui Y.Boosting signals in gene-based association studies via efficient SNP selection[J].Briefings in bioinformatics,2014,15(2):279-291.
[15]Mooney M,Wilmot B,McWeeney S.The GA and the GWAS:using genetic algorithms to search for multilocus associations[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2012,9(3):899-910.
[16]Jia P,Zhao Z.Network-assisted analysis to prioritize GWAS results:principles,methods and perspectives[J].Human genetics,2014,133(2):125-138.
[17]Gibson G.Hints of hidden heritability in GWAS[J].Nature genetics,2010,42(7):558-560.
[18]Jing P J,Shen H B.MACOED:a multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies[J].Bioinformatics,2015,31(5):634.
[19]Ritchie M D,Hahn L W,Roodi N,et al.Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer[J].The American Journal of Human Genetics,2001,69(1):138-147.
[20]Yang C H,Lin Y D,Yang C S,et al.An efficiency analysis of high-order combinations of gene-gene interactions using multifactor-dimensionality reduction[J].BMC Genomics,2015,16(1):489.
[21]Hemani G,Theocharidis A,Wei W,et al.EpiGPU:exhaustive pairwise epistasis scans parallelized on consumer level graphics cards[J].Bioinformatics,2011,27(11):1462-1465.
[22]Kam-Thong T,Pütz B,Karbalai N,et al.Epistasis detection on quantitative phenotypes by exhaustive enumeration using GPUs[J].Bioinformatics,2011,27(13):i214-i221.
[23]Sluga D,Curk T,Zupan B,et al.Heterogeneous computing architecture for fast detection of SNP-SNP interactions[J].BMC bioinformatics,2014,15(1):216.
[24]K?ssens J C,Wienbrandt L,González-Domínguez J,et al.High-speed exhaustive 3-locus interaction epistasis analysis on FPGAs[J].Journal of Computational Science,2015,9:131-136.
[25]Guo X,Meng Y,Yu N,et al.Cloud computing for detecting high-order genome-wide epistatic interaction via dynamic clustering[J].BMC bioinformatics,2014,15(1):102.
[26]Zhang Y,Liu J S.Bayesian inference of epistatic interactions in case-control studies[J].Nature genetics,2007,39(9):1167-1173.
[27]Mao W,Lee J.A combinatorial analysis of genetic data for Crohn’s disease[C]//Bioinformatics and Biomedical Engineering,2007.ICBBE 2007.The 1st International Conference on.IEEE,2007:1031-1034.
[28]Wang Y,Liu X,Robbins K,et al.AntEpiSeeker:detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm[J].BMC research notes,2010,3(1):117.
[29]Chattopadhyay A S,Hsiao C L,Chang CC,et al.Summarizing techniques that combine three non-parametric scores to detect disease-associated 2-way SNP-SNP interactions[J].Gene,2014,533(1):304-312.
[30]Ding X,Wang J,Zelikovsky A,et al.Searching high-order SNP combinations for complex diseases based on energy distribution difference[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2015,12(3):695-704.
[31]Zhang F,Boerwinkle E,Xiong M.Epistasis analysis for quantitative traits by functional regression model[J].Genome research,2014,24(6):989-998.
[32]Kam-Thong T,Azencott C A,Cayton L,et al.GLIDE:GPU-based linear regression for detection of epistasis[J].Human heredity,2012,73(4):220-236.
[33]Beam A L,Motsingerreif A,Doyle J.Bayesian neural networks for detecting epistasis in genetic association studies[J].BMC bioinformatics,2014,15(1):368.
[34]Lee I,Blom U M,Wang P I,et al.Prioritizing candidate disease genes by network-based boosting of genome-wide association data[J].Genome research,2011,21(7):1109-1121.
[35]Chen L S,Hutter C M,Potter J D,et al.Insights into colon cancer etiology via a regularized approach to gene set analysis of GWAS data[J].The American Journal of Human Genetics,2010,86(6):860-871.
[36]Braun R,Buetow K.Pathways of distinction analysis:a new technique for multi-SNP analysis of GWAS data[J].PLos Genetics,2011,7(6):e1002101.
[37]Askland K,Read C,O’Connell C,et al.Ion channels and schizophrenia:a gene set-based analytic approach to GWAS data for biological hypothesis testing[J].Human genetics,2012,131(3):373-391.
[38]Yang C H,Lin Y D,Chaung L Y,et al.Evaluation of breast cancer susceptibility using improved genetic algorithms to generate genotype SNP barcodes[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB),2013,10(2):361-371.
[39]Li X,Liao B,Chen H.A new technique for generating pathogenic barcodes in breast cancer susceptibility analysis[J].Journal of theoretical biology,2015,366:84-90.
[40]Holzinger E R,Ritchie M D.Integrating heterogeneous high-throughput data for meta-dimensional pharmacogenomics and disease-related studies[J].Pharmacogenomics,2012,13(2):213-222.
[41]Huang R S,Duan S,Bleibel W K,et al.A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity[J].Proceedings of the National Academy of Sciences,2007,104(23):9758-9763.
[42]Huang Y T,VanderWeele T J,Lin X.Joint analysis of SNP and gene expression data in genetic association studies of complex diseases[J].The annals of applied statistics,2014,8(1):352.
[43]Kang M,Zhang C,Chun H W,et al.eQTL epistasis:detecting epistatic effects and inferring hierarchical relationships of genes in biological pathways[J].Bioinformatics,2015,31(5):656-664.
[44]Shabalin A A.Matrix eQTL:ultra fast eQTL analysis via large matrix operations[J].Bioinformatics,2012,28(10):1353-1358.
[45]Giacalone G,Clarelli F,Osiceanu A M,et al.Analysis of genes,pathways and networks involved in disease severity and age at onset in primary-progressive multiple sclerosis[J].Multiple Sclerosis,2015:21(11).
[46]王吉光.復(fù)雜疾病的分子網(wǎng)絡(luò)模型研究[J].中國科學(xué):數(shù)學(xué) (中文版),2014,44(4):317-328.
[47]Fridley B L,Lund S,Jenkins G D,et al.A Bayesian integrative genomic model for pathway analysis of complex traits[J].Genetic epidemiology,2012,36(4):352-359.
[48]Mankoo P K,Shen R,Schultz N,et al.Time to recurrence and survival in serous ovarian tumors predicted from integrated genomic profiles[J].PLoS One,2011,6(11):e24709.
[49]Kim D,Shin H,Song Y S,et al.Synergistic effect of different levels of genomic data for cancer clinical outcome prediction[J].Journal of biomedical informatics,2012,45(6):1191-1198.
[50]Holzinger E R,Dudek S M,Frase A T,et al.ATHENA:the analysis tool for heritable and environmental network associations[J].Bioinformatics,2014,30(5):698-705.
[51]Dr?ghici S,Potter R B.Predicting HIV drug resistance with neural networks[J].Bioinformatics,2003,19(1):98-107.
[52]Liang M,Li Z,Chen T,et al.Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2014,12(4):928-937
Methods for mining omics data of complex diseases
LI Xiong
(School of Software,East China Jiaotong University,Nanchang 330013,China)
At present, for a single type of omics data, part of the real genetic and environmental factors associated with the tumor has been excavated, but some still may only be hidden in the complex genetic mechanism behind the tip of the iceberg, The key reason to lead to the limitations may be that the disease model is too simplistic, namely, to ignore the interrelationships between multi-level histological data. Studies thank that deepening the understanding of genome SNP data, further integrating omulti-source histological data, deeply understanding epistasis, heterogeneity and other phenomena, and thereby enhancing the ability of cancer risk assessment, is conducive to the realization of personalized medical goals. This paper analyzes the present data mining methods of complex diseases from the perspective of SNP data and multi-source data analysis.
SNP;genome-wide association study;system biology;machine Learning
1672-7010(2017)02-0012-07
2017-02-01
國家自然科學(xué)基金資助項目(61602174)
李雄(1985-),湖南邵陽人,講師,博士,從事數(shù)據(jù)挖掘、生物信息處理研究,E-mail:lx_hncs@163.com
TP311
A
邵陽學(xué)院學(xué)報(自然科學(xué)版)2017年2期