国产日韩欧美一区二区三区三州_亚洲少妇熟女av_久久久久亚洲av国产精品_波多野结衣网站一区二区_亚洲欧美色片在线91_国产亚洲精品精品国产优播av_日本一区二区三区波多野结衣 _久久国产av不卡

?

Two Paradoxes in Linear Regression Analysis

2016-12-09 08:30:48GeFENGJingPENGDongkeTUJuliaZHENGChangyongFENG
上海精神醫(yī)學(xué) 2016年6期
關(guān)鍵詞:醫(yī)學(xué)期刊生物醫(yī)學(xué)悖論

Ge FENG, Jing PENG, Dongke TU, Julia Z. ZHENG, Changyong FENG,3*

?Biostatistics in psychiatry (36)?

Two Paradoxes in Linear Regression Analysis

Ge FENG1, Jing PENG2, Dongke TU4, Julia Z. ZHENG5, Changyong FENG2,3*

Forward selection, backward elimination, univariate regression; multiple regression

1. Introduction

Linear regression is the most widely used statistical model in data analysis.[1]Wide availability and ease of use of statistical software packages, such as SAS, SPSS and R make the linear regression accessible to people without any formal statistical training. Although wise use of statistical methods such as linear regression helps us, even novices, develop a better understand of data and guide our decisions, it also causes confusion in interpretation of results and paradoxical findings.For example, we are often asked by our biomedical collaborators questions like “When I run the univariate regression of Y on the predictor , the p-value is very small. However, if I add some other predictors in the model, is not signif i cant anymore. Why?” The same problem also occurs in logistic regression for binary outcome[2], log-linear regression for counting data[2],and Cox proportional hazards regression for survival data.[3]

A simple answer to this question is the different assumptions between the univariate and multiple regression models. However, this is not so meaningful for non-statisticians. This is discussed in Section 2.

In many medical studies, regression analysis involves a large of number of independent variables,or predictors. Model selection is required to find the predictors that are signif i cantly associated with an outcome, or dependent variable, of interest. Here is how the model selection was done in a recent paper published in JAMA Surgery[4]:

“The administrative database was then evaluated by means of univariate and multivariate logistic regression. First we identified variables that were associated (P < .20) with readmission, the dependent variable. These potential confounders were then entered in multivariate stepwise (backward elimination) logistic regression, with readmission as the dependent variable.A logistic regression model was constructed to identify patient factors associated with readmission.”

This forward selection procedure as the fi rst step to weed out “non-signif i cant” predictors has been become almost the gold standard for variable selection and has been used in many papers published in top medical journals.[5-24]The key idea of this method is fi rst to run a univariate regression on each predictor. If the p-value is less than some pre-specif i ed level, for example 0.1,then the predictor is used in the multiple regression.Otherwise, the predictor is assumed to have no signif i cant effect on the outcome. This method seems quite logical and intuitively meaningful. Indeed, it has been used and is still being used by the biomedical and other research communities. Is this a valid procedure?

In this paper we use linear regression analysis to show two paradoxes in regression analysis. In Section 2 we use some very basic theory to show how the univariate regression and multiple regression make different assumptions on the models. We use examples and simulation studies to show two paradoxes in regression analysis in Section 3. Section 4 brief l y discusses the transitivity of correlation. Our results clearly invalidate the model selection procedure widely used in biomedical research.

2. Basic theory

Let (Y, X1, ..., Xp) be a random vector, where X1, ..., Xpare called the covariates (independent variables),and Y is called the outcome (dependent variables).The regression of Y on (X1, ..., Xp) is the conditional expectation of Y given (X1, ..., Xp), denoted by E[Y|X1, ...,Xp] which is a measurable function of (X1, ..., Xp). Denote the function by g(X1, ..., Xp). Without knowing the joint distribution of (X1, ..., Xp, Y ), in general, the form of g(X1, ..., Xp) is unknown. In statistical analysis, we usually assume some mathematically tractable forms of g(X1, ..., Xp). For example, the linear regression analysis[1]assumes that

In the logistic regression analysis with 0-1 outcome[2],we assume that

In this paper we assume the outcome Y is continuous.Let

It is obvious that E[Y|X1, ..., Xp] = 0. We consider a stronger form of the liner regression model

and assume that given X1, ..., Xp, the variance of ε

which does not depend on (X1, ..., Xp). This assumption is also used in most statistical literature on linear model.[1]We further assume that Xk, k = 1, . . . , p, have finite second moments.

From (1) we have

Let Zk= E[Xk|X1] , k = 1, . . . , p. (It is clear that Zk= Xk).Then the regression of Y on X1is

which still has a linear form. Let Then

Although (3) has the same form as (1), they are fundamentally different in the error terms. Note that E[η|X1] = 0, Cov( Zk, η) = 0, k = 1, . . . , p. However, the conditional variance of η given X1is

Therefore, the conditional variance of η given X1is no longer a constant. This violates the fundamental assumption used in linear regression model.[1]

The univariate linear regression of on assumes the following form of the model

From (3) we know that generall

Suppose (Y, Xi1, ..., Xip), i = 1, . . . , n, is a random sample from (1). Let Letbe the least square estimate of the univariate regression of Yion X1iin (4). Then

and

3. Two paradoxes in linear regression analysis

In this section we show why the estimates of the coefficient of some covariates in the univariate regression and in the multiple regression do not match.More specif i cally, we show that in some cases, the estimate from the univariate regression is signif i cant,but the result from the multiple regression is not. On the other hand, in some cases, the result is signif i cant for the multiple regression but not for the univariate regression.

Suppose (1) is the true multiple regression model.The univariate regression model uses model (4) by assuming that= 0. This assumption is generally wrong unless E[Xk|X1] is a constant (k = 2, . . . , p). Hence,with a correct multiple regression model, the estimate of the univariate analysis is based on a wrong model.This is the reason why the results from univariate regression and multiple regression do not match.Furthermore, result (5) shows that there is no clear interpretation of the estimate in the univariate analysis.

We discuss two paradoxes related to univariate and multiple regressions through both theoretical derivations and simulation studies.

3.1 Signif i cant covariate effect in multiple regression but not in univariate regression

Let X2, X3, X4and ε be independent random variables with standard normal distributions. Consider the following model

which is 0 if and only if

From (5) we know that if (7) is true, the least square estimatorof the coefficient of the univariate regression of Y on X1will not be signif i cant, even though X1is necessary in specifying model (6).

Example 1.Let α1= -3/5, α2= 3, α3= 4, β1= 1, β2= 2 in (6).The true model is

Table 1 shows the simulation result of the estimates and standard deviations of the coefficient of X1in both univariate and multiple regressions after 10,000 replications. For a wide range of sample sizes, the least square estimator of the coefficient of X1in the multiple regression is very close to the true value, and the standard deviation decreases signif i cantly with the sample size. However, the estimate of coefficient in the univariate analysis is very close to 0 in all cases.

According to the practice in medical publications[4-24], X1will not enter the multiple regression. Table 2 shows the result of the least square estimates of the coefficients of X2and X3after X1is removed in (8). It is easy to see that the estimate of the coefficient of X2is dramatically biased in the multiple regression after X1is removed due to the univariate analysis.

3.2 Signif i cant covariate effect in univariate regression but not in multiple regression

Suppose X1, X2, X3and ε are independent standard normal random variables, and X4= β1X1+β2X2,where

Table 1. Estimate of the regression coefficientof X1

Table 2. Estimates of the regression coefficients of X2 and X3 with X1 being removed

Consider the following true model is

If (9) is expanded to include X4and the expanded model still satisf i es the conditions of the linear regression, then the regression equation becomes

From (9) and (10) we have

or

Example 2.Let α0= 0, α1= 1, α2= 2 in (9) and β1= β2=1, Table 3 shows the least square estimates of the coefficient of X4in both univariate and multiple linear regressions after 10,000 replications. For all sample sizes, the univariate regression shows that X4has very signif i cant effect on Y. However, in the multiple regression, the effect is not signif i cant.

4. Transitivity of correlation

Another issue around the regression analysis is the transitivity of the correlation in the interpretation.For example, some people may say like that: “Since factor A is highly correlated with outcome Y, and factor A and factor B are highly correlated, then B should be correlated with Y.” It seems very intuitive and reasonable that correlation is transitive. Unfortunately,this is not true. Here is a theoretical example. Suppose X and Z are independent standard normal random variables and Y=X+Z. It’s clear that the correlation between X and Y, and between Y and Z are both 0.707.However, the correlation between X and Z is 0.

Table 3. Estimate of the regression coefficient of X4

In our Example 2, the correlations between X4and X1and Y are 0.707 and 0.408, respectively. However,we proved in Section 3.2 shows that X4has no role in the multiple regression if X1and X2are in the model although X4is not a linear combination of X1and X2.

5. Discussion

Regression analysis in medical research usually involves many predictors (independent variables). The model selection is needed to pick covariates having signif i cant effect on the outcome. A widely used method in medical publications[4-24]is first to screen those covariates through univariate analysis. If a covariate is not significant in the univariate regression analysis,it will not enter the multiple regression analysis. The underlying assumption of this method is that is a covariate is significant in the multiple regression only if it is significant in the univariate regression analysis.Our results indicate that this assumption is wrong.A covariate may be very signif i cant in the univariate regression but has no role in the multiple regression (see Example 2 in Section 3). On the other hand, a covariate is a necessary part of a multiple regression but may be not correlated with the outcome (see Example 1 in Section 3). The initial univariate screening method totally ignores the correlation among covariates.There is no theoretical work to support this method.Our simulation results clearly show that the multiple regression results after the univariate screening may be dramatically biased and misleading. The biomedical community should stop using this procedure in their research and publications.

Funding

None

Conflict of interest statement

The authors report no conflict of interest related to this manuscript.

Author’s contribution

Ge Feng and Changyong Feng: theoretical derivation and revision

Jing Peng, Dongke Tu, and Julia Z. Zheng: Simulation and manuscript drafting

1. Seber GAF, Lee AJ. Linear regression analysis (2nd ed).Hoboken, NJ: Wiley; 2003

2. Agresti A. Categorical data analysis (2nd ed). Hoboken, NJ:Wiley; 2002

3. Cox DR. Regression models and life-tables (with discussion).J R STAT SOC. 1972; B. 34:187-220. doi: http://dx.doi.org/10.2307/2985181

4. McIntyre LK, Arbabi S, Robinson EF, Maier RV. Analysis of Risk Factors for Patient Readmission 30 Days Following Discharge From General Surgery. JAMA Surgery. 2016; (Epub ahead of print). doi: http://dx.doi.org/10.1001/jamasurg.2016.1258

5. Bardia A, Sood A, Mahmood F, Orhurhu V, Mueller A,Montealegre-Gallegos M, et al. Combined epiduralgeneral anesthesia vs general anesthesia alone for elective abdominal aortic aneurysm repair. JAMA Surgery. 2016;(Epub ahead of print). doi: http://dx.doi.org/10.1001/jamasurg.2016.2733

6. Barlesi F, Mazieres J, Merlio JP, Debieuvre D, Mosser J, Lena H,et al. Routine molecular prof i ling of patients with advanced non-small-cell lung cancer: results of a 1-year nationwide programme of the French Cooperative Thoracic Intergroup(IFCT). Lancet. 2016; 387: 1415-1426. doi: http://dx.doi.org/10.1016/S0140-6736(16)00004-0

7. Brooks GA, Kansagra AJ, Rao SR, Weitzman JI, Linden EA,Jacobson JO. A clinical prediction model to assess risk for chemotherapy-related hospitalization in patients initiating palliative chemotherapy. JAMA Oncology. 2015; 1(4): 441-447; doi: http://dx.doi.org/10.1001/jamaoncol.2015.0828

8. Cronin PR, DeCoste L, Kimball AB. A multivariate analysis of dermatology missed appointment predictors. JAMA Dermatology. 2013; 149(12): 1435-1437. doi: http://dx.doi.org/10.1001/jamadermatol.2013.5771

9. Fivez T, Kerklaan D, Mesotten D, Verbruggen S, Wouters PJ,Vanhorebeek I, et al. Early versus late parenteral nutrition in critically Ill children. N Engl J Med. 2016; 374(12): 1111-1122. doi: http://dx.doi.org/10.1056/NEJMoa1514762

10. Geng E, Kreiswirth B, Burzynski J, Schluger NW. Clinical and radiographic correlates of primary and reactivation tuberculosis: a molecular epidemiology study. JAMA.2005; 293(22): 2740-2745. doi: http://dx.doi.org/10.1001/jama.293.22.2740

11. Hole J, Hirsch M, Ball E, Meads C. Music as an aid for postoperative recovery in adults: a systematic review and meta-analysis. Lancet. 2015; 386: 1659-1671. doi: http://dx.doi.org/10.1016/S0140-6736(15)60169-6

12. International CLL-IPI working group. An international prognostic index for patients with chronic lymphocytic leukaemia (CLL-IPI): A meta-analysis of individual patient data. Lancet Oncology. 2016; 17(6): 779-790. doi: http://dx.doi.org/10.1016/S1470-2045(16)30029-8

13. Leon MB, Smith CR, Mack MJ, Makkar RR, Svensson LG,Kodali SK, et al. Transcatheter or surgical aortic-valve replacement in intermediate-risk patients. N Engl J Med.2016; 374(17): 1609-1620. doi: http://dx.doi.org/10.1056/NEJMoa1514616

14. Li Y, Stocchi L, Cherla D, Liu X, Remzi FH. Association of preoperative narcotic use with postoperative complications and prolonged length of hospital stay in patients with crohn disease. JAMA Surgery. 2016; 151(8): 726-734. doi: http://dx.doi.org/10.1001/jamasurg.2015.5558

15. Lorant V, Deli?ge D, Eaton W, Robert A, Philippot P, Ansseau M. Socioeconomic Inequalities in Depression: A Meta-Analysis. Am J Epidemiol. 2003; 157(2): 98-112. doi: http://dx.doi.org/10.1093/aje/kwf182

16. van der Meer AJ, Veldt BJ, Feld JJ, Wedemeyer H, Dufour JF,Lammert F, et al. Association between sustained virological response and all-cause mortality among patients with chronic hepatitis C and advanced hepatic fi brosis. JAMA.2012; 308(24): 2584-2593. doi: http://dx.doi.org/10.1001/jama.2012.144878

17. Mingrone G, Panunzi S, De Gaetano A, Guidone C, Iaconelli A, Nanni G, et al. Bariatricmetabolic surgery versus conventional medical treatment in obese patients with type 2 diabetes: 5 year follow-up of an open-label, single-centre,randomized controlled trial. Lancet. 2015; 386: 964-973. doi:http://dx.doi.org/10.1016/S0140-6736(15)00075-6

18. Nelson KB, Ellenberg JH. Antecedents of cerebral palsy:I. univariate analysis of risks. Am J Dis Child. 1985;139(10): 1031-1038. doi: http://dx.doi.org/10.1001/archpedi.1985.02140120077032

19. Nelson KB, Ellenberg JH. Antecedents of cerebral palsy:Multivariate analysis of risk. N Engl J Med. 1986; 315(2): 81-86. doi: http://dx.doi.org/10.1056/NEJM198607103150202

20. NICE-SUGAR Study Investigators. Hypoglycemia and risk of death in critically ill patients. N Engl J Med. 2012; 367(12):1108-1118. doi: http://dx.doi.org/10.1056/NEJMoa1204942

21. Pag?s F, Berger A, Camus M, Sanchez-Cabo F, Costes A,Molidor R, et al. Effector memory T cells, early metastasis,and survival in colorectal cancer. N Engl J Med. 2005;353(25): 2654-2666. doi: http://dx.doi.org/10.1056/NEJMoa051424

22. Schwed AC, Boggs MM, Pham XD, Watanabe DM,Bermudez MC, Kaji AH, et al. Association of admission laboratory values and the timing of endoscopic retrograde cholangiopancreatography with clinical outcomes in acute cholangitis. JAMA Surgery. 2016; (Epub ahead of print). doi:http://dx.doi.org/10.1001/jamasurg.2016.2329

23. Templin C, Ghadri JR, Diekmann J, Napp LC, Bataiosu DR, Jaguszewski M, et al. Clinical features and outcomes of takotsubo (stress) cardiomyopathy. N Engl J Med.2015; 373(10): 929-938. doi: http://dx.doi.org/10.1056/NEJMoa1406761

24. Wood GC, Benotti PN, Lee CJ, Mirshahi T, Still CD, Gerhard GS, Lent MR. Evaluation of the association between preoperative clinical factors and long-term weight loss after roux-en-y gastric bypass. JAMA Surgery. 2016;(Epub ahead of print). doi: http://dx.doi.org/10.1001/jamasurg.2016.2334

Ge Feng is a graduate student in the School of Geophysics and Oil Resources at Yangtze University,Wuhan, Hubei, China. His research interest includes statistical analysis in rock physics.

線性回歸分析中的兩個悖論

Feng G, Peng J, Dongke TU, Zheng JZ, Feng C

向前選擇,向后消除,單變量回歸,多元回歸

Regression is one of the favorite tools in applied statistics. However, misuse and misinterpreta-tion of results from regression analysis are common in biomedical research. In this paper we use statistical theory and simulation studies to clarify some paradoxes around this popular statistical method. In particular, we show that a widely used model selection procedure employed in many publications in top medical journals is wrong. Formal procedures based on solid statistical theory should be used in model selection.

[Shanghai Arch Psychiatry. 2016; 28(6): 355-360.

http://dx.doi.org/10.11919/j.issn.1002-0829.216084]

1School of Geophysics and Oil Resource, Yangtze University, Wuhan, China

2Department of Biostatistics & Computational Biology, University of Rochester, Rochester, NY, USA

3Department of Anesthesiology, University of Rochester, Rochester, NY, USA

4School of Philosophy, Wuhan University, Wuhan, China

5Department of Microbiology and Immunology, McGill University, Montreal, QC, Canada

*correspondence: Dr. Changyong Feng. Mailing address: Department of Biostatistics and Computational Biology, University of Rochester, 601 Elmwood Ave., Box 630, Rochester, NY, USA. Postcode: NY 14642. E-mail: Changyong_feng@urmc.rochester.edu

概述:回歸是應(yīng)用統(tǒng)計學(xué)中最受歡迎的工具之一。然而,回歸分析結(jié)果的誤用和誤解在生物醫(yī)學(xué)研究中是常見的。本文運(yùn)用統(tǒng)計理論和模擬研究來說明有關(guān)這種普遍使用的統(tǒng)計方法的一些悖論。我們還特別指出在頂級醫(yī)學(xué)期刊發(fā)表的很多文章中廣泛使用的模型選擇程序事實(shí)上是錯誤的。模型選擇使用哪一種步驟化程序需基于可靠的統(tǒng)計理論。

猜你喜歡
醫(yī)學(xué)期刊生物醫(yī)學(xué)悖論
芻議“生物醫(yī)學(xué)作為文化”的研究進(jìn)路——兼論《作為文化的生物醫(yī)學(xué)》
視神經(jīng)炎的悖論
山西醫(yī)學(xué)期刊社簡介
山西醫(yī)學(xué)期刊社簡介
山西醫(yī)學(xué)期刊社簡介
山西醫(yī)學(xué)期刊社簡介
靈長類生物醫(yī)學(xué)前沿探索中的倫理思考
海島悖論
“帽子悖論”
國外生物醫(yī)學(xué)文獻(xiàn)獲取的技術(shù)工具:述評與啟示
合阳县| 仙居县| 龙口市| 利津县| 红安县| 武邑县| 遂平县| 阿拉善左旗| 滦南县| 泸定县| 苏尼特左旗| 西峡县| 囊谦县| 荔浦县| 教育| 洮南市| 潮州市| 民勤县| 延庆县| 达尔| 共和县| 中阳县| 庄浪县| 韶山市| 永新县| 无棣县| 徐州市| 明星| 桂阳县| 丰台区| 登封市| 巴马| 五河县| 隆德县| 陵川县| 海阳市| 错那县| 镇原县| 玛纳斯县| 千阳县| 神木县|