A New Approach to Learn the Equivalence Class of Bayesian Network

2015-12-20 09:14:16ZHANGYingxia張盈俠YANGYoulong楊有龍CUIJianfei崔劍飛

Journal of Donghua University(English Edition) 2015年2期

ZHANG Yingxia（張盈俠），YANG Youlong（楊有龍），CUI Jianfei（崔劍飛）

1 School of Mathematics and Statistics，Xidian University，Xi'an710126，China

2 School of Computer，Xidian University，Xi'an710126，China

Introduction

Bayesian networks （BNs），which are introduced by Pearl［1］， graphically encode probabilistic relationships among a set of variables.Thus，they have become a powerful tool for representation with uncertainty and been used in numerical fields，including system biology［2］，medicine［3］， and artificial intelligence［4］.With the assumption of causal sufficiency，which assures that no set of equal to or more than two variables share a hidden common cause，the true structure could be represented by a directed acyclic graph（DAG）（see section 1for definition）［5］.In this case，a node represents a variable and directed edge indicates a direct causal relationship between two nodes.However，it is a difficult problem to find a DAG which could well fit the observational data.What's more，only using non-interventional data，identifying the exact DAG is generally impossible.It greatly limits our ability to infer the causal graph from uncontrolled data alone［6］.Hence，traditional algorithms often generate a partial directed acyclic graph （PDAG），which consists of undirected and directed edges.

Generally speaking，it is too complex to construct a true model graph for humans，since it is an NP-hard problem for exact inference of a DAG.Even though，people have proposed a lot of methods for structure learning end over the last two decades，which can be broadly categorized into three main types.（1）Constraint-based methods use the causal relationships among variables，which are detected by conditional independence（CI）tests，to construct a PDAG.There are many constraint-based methods，such as IC algorithm［7］，PC algorithm［5］，and TPDA algorithm［8］.The common procedure of these algorithms is that they all find the skeleton of the BN first，and then orient partial edges.However，the efficiency of the constraint-based algorithm is assured by the accuracy of CI tests which is sensitive to noise and sample size［9］.（2）Scored-based algorithms find a structure well suitable for the observational data，which is measured by a scoring function.However，it is very difficult to do an exhaustive search through a large number of all possible networks which are super exponential in the number of variables among the graph.（3）Hybrid methods combine these two methods together to find a BN，which is another kind of efficient structure learning algorithm.Hybrid methods such as SC algorithm［10］，MMHC algorithm［11］，and COS algorithm［12］，first find a super structure of the skeleton using a constraint-based algorithm，and then search an optimal network which has a high score using a score-based algorithm.This strategy can lead to a BN which has a higher score，and can decrease the size of search space.

Motivated by the aforementioned analysis，this paper presents a new constraint-based approach to learn the equivalence class of BN.Firstly，it learns a skeleton of BN in a more efficient way which reduces the number of CI tests compared with the PC algorithm.Secondly，partial edges will be oriented.It returns a representation of BN competitive to the PC algorithm do but is shown to significantly outperform the PC algorithm in running time and the number of CI tests.Thus，our algorithm is denominated as FPC，which is really faster than the PC algorithm.In addition，it shows that the FPC algorithm also outperforms the TPDA algorithm in structure errors when the sample size is small through a large number of experiments（see section 3）.

This paper is organized as follows.Section 1introduces preliminaries.Section 2describes the proposed algorithm.Section 3 presents the numerical experimental results and comparison with other classic algorithms.In section 4，we sum up the paper with some conclusions and further work.

1 Preliminaries

In this section，some relevant definitions are presented used throughout.

Definition 1 （Directed acyclic graph（DAG））［13］A graph G＝（V，E）is said to be a DAG if G＝（V，E）is a directed graph such that there is not a directed path（α1，α2，…，αn）where αi∈V，（i＝1，2，…，n）andα1＝αn.

A DAG G ＝（V，E）is said to be a BN with regard to（w.r.t）a probability distribution P，if it satisfies the Markov condition （local Markov property）that every variable x ∈V is independent of any subset of its nondescendant variables conditioned on the set of its parents［9］.

Definition 2 （Skeleton）［13］The skeleton of a directed graph G＝（V，E）is obtained by making the graph undirected.That is，the skeleton of G is the graph G′＝（V，E′），where（α，β）∈E′?（α，β）∈Eor（β，α）∈E.

Fig.1 The skeleton（d）of（a），（b）and（c）

Definition 3 （V-structure）A triple of nodes（x，z，y），which satisfies the condition that x and y meet head to head at z，while x and y are not directly connected in the graph，forms a V-structure.

Definition 4 （Faithfulness）［14］Let P represent a joint probability distribution on the variable set V.A DAG G＝（V，E）is faithful to P，if and only if every independence present in P is entailed by G and the Markov condition holds.A distribution Pis faithful if and only if there is a DAG Gsuch that Gis faithful to P.Particularly，Pand BN are faithful to each other，if and only if every conditional independence entailed by BN corresponds to certain Markov condition presented in P.

Conditioned on faithfulness and if only noninterventional data are used，we cannot detect the exact DAG as the orientation of edges cannot be detected except in V-structures［15］.

Definition 5 （d-separation）Let G＝（V，E）be a DAG，Z be a set of nodes in V，x and y be two distinct nodes in VZ，and they are said to be d-separated if all trails between x and y are blocked by Z.

Under the faithfulness condition of distribution P and DAG G，the conditional independence in P and dseparation in G are equivalent，i.e.（x⊥y ∣z）G?（x⊥y ∣z）P.In addition，we also say G is a perfect map of P.We can use the dependence relationship of x and y in Pto detect the relationship of x and y in G，thus we could use CI test to detect the relationships among variables in P，to discover the relationships among the variables in G.

Definition 6 （Equivalent BN structure）［7］Two DAGs，G1and G2are said to be equal or Markov equivalence，if and only if they have the same skeletons and the same set of Vstructures.Then，the corresponding BNs，N1and N2，are said to be equal，denoted as N1?N2.

Fig.2 Equivalence classes of BN：（a），（b）and（c）

In Fig.2，（a），（b）and（c）are equivalence classes of BN，as they have the same skeleton and V-structures.

2 Proposed Algorithm

In this section，we attempt to describe our proposed algorithm in detail.The new algorithm，F(xiàn)PC，can be largely divided into two phases.Beginning with a complete undirected graph，we first learn a skeleton of the BN from the observational data by CI tests.In the second phase，the set of V-structures are discovered，and the corresponding edges will be oriented.Thus，the skeleton and V-structures will be aggregated to complete the PDAG.The proposed method is illustrated for discrete case.

The FPC algorithm starts with a fully connected graph where unnecessary edges get iteratively deleted one at a time.For each variable x ，the conditional independence relationships between xand its neighbors are tested along the existing edges by conditioning on all subsets of the current neighbors of x（Nbx）.For every element of Nbx（y），each edge x-yis deleted if there exists a separate set S，which is a subset of Nbxand makes the dependence of x and y to be insignificant.

For compact presentation，we declare some notations as described in the algorithm later.Nbxdenotes the neighbour set of variable x.And the condition set，which is used for CI tests，is denotes as S.

The FPC algorithm contains two phases.Firstly，it learns a skeleton of BN.The inspiration of this step comes from Ref.［16］，which is said more efficient than other relevant algorithms.In the second phase，the algorithm discovers the set of V-structures，which is the same as the PC algorithm does.The new algorithm can decrease the number of conditional independence tests，and run more quickly than the PC algorithm.In addition，compared with the classic TPDA algorithm，it decreases the number of total structure errors when the sample size is small.We will show the efficiency of FPC algorithm through multiple simulations on the standard benchmark network.

3 Experiments

To evaluate the proposed algorithm，we have done some experiments with the Alarm network.At first，we show some declarations about the experiment that all the simulations are performed on Matlab on a computer equipped with Pentium（R）Dual-core CPU E5800with 3.20-GHz processing speed，1.96GB of RAM，and Windows XP Pro.

In this paper，we will focus on the PC and TPDA algorithms as case study.For comparison，we restrict our experiments to discrete simulated data.And Chi-Square test is used，while the significance levelα is set to 0.05.In addition，the TPDA algorithm results reported are performed using the publicly available Causal Explorer Toolkit［17］.

In order to cover a wide variety of possible cases，1 000to 15 000data points are generated randomly for experiments.Each method is used to recover multiple skeletons or PDAGs of varying sizes for the same network.And each size of datapoints will be implemented three times to obtain an average number，which is used for comparison.

In the experiment，the number of CI tests，the structural error and the order of CI tests will be shown in figures.As the CI tests only work in the first phase，the numbers of added edges and missing edges will be shown only.

Table 1contains the number of CI tests，the structure errors，the order of CI tests for the ALARM network，and we arrive at the facts as follows.

Table 1 Results of experiments for alarm network

The FPC algorithm has the structure errors competitive to the PC algorithm，but decreases the number of CI tests significantly，which means the proposed algorithm will run more quickly.In addition，it is observed that both the two algorithms have fewer added edges and more missing edges.And we arrive at the rule that when the sample size increases from 1 000to 15 000，the total structure error decreases quickly.

For convenience，F(xiàn)ig.3shows an average plot for the number of CI tests of each algorithm while the number of data points is from 1 000to 15 000.Based on Fig.3，the proposed algorithm，F(xiàn)PC，is shown to consistently improve the speed of running time，while it decreases the number of CI tests greatly compared with the PC algorithm.

Fig.3 Average number of CI tests with Alarm network

Another important point is that，F(xiàn)PC does not decrease the order of CI tests.It may be the main reason that these two algorithms have the competitive structure errors.As high order CI tests are said to be unreliable，we will devote ourselves to doing a research in decreasing the order of CI tests later.

Figure 4shows the average number of missing edges of the TPDA algorithm in recovering the skeletons of the Alarm network where the number of observations varies from 1 000to 15 000.In addition，F(xiàn)ig.5shows the average number of added edges in the same case.Note that，the FPC algorithm outperforms the TPDA algorithm in terms of the average number of added edges，but fails in terms of the average number of missing edges.However，the FPC algorithm is shown to consistently improve the accuracy（see Fig.6）compared with the TPDA algorithm when the number of observations is small.

Fig.4 Average number of missing edges with Alarm network

Fig.5 Average number of added edges with Alarm network

Fig.6 Average number of structure errors with Alarm network

4 Conclusions

In the paper，a new constrained-based algorithm，F(xiàn)PC，has been proposed for learning BN structure.Under the faithfulness assumption，it usually returns an equivalence class of BN.Compared with PC，the total number of CI tests is infavor of FPC.However，the number of structure errors of FPC is competitive with PC.In addition，the FPC algorithm has improved the accuracy of skeleton recovery compared with the TPDA algorithm when the sample size is small.Then，how to decrease the order of CI tests to obtain a more accurate BN is an important problem for us to research in the future.What's more，how to relax the assumption of faithfulness is also an open problem.

［1］Pearl J.Probabilistic Reasoning in Intelligent Systems：Networks of Plausible Inference［M］.San Francisco，USA：Morgan Kaufmann Publishers，1988.

［2］Friedman N.Inferring Cellular Networks Using Probabilistic Graphical Models［J］.Science，2004，303（5659）：799-805.

［3］Cowell R G，David P，Lauritzen S L，et al.Probabilistic Networks and Expert Systems［M］.New York：Springer-Verlag，1999.

［4］Russell S J，Norvig P.Artificial Intelligence：a Modern Approach［M］.3rd ed.Upper Saddle River：Prentice Hall，2009.

［5］Spirtes P，Glymour C，Scheines R.Causation，Prediction，and Search ［M］.2nd ed.London，England：The MIT Press，2001.

［6］Pearl J.Causality：Models，Reasoning，and Inference［M］.Cambridge：Cambridge University Press，2000.

［7］Vermas T，Pearl J.Equivalence and Synthesis of Causal Models［C］.In Proceedings of the 6th Annual Conference on Uncertainty in Artificial Intelligence，New York，USA，1990：220-227.

［8］Cheng J，Greiner R，Kelly J，et al.Learning Bayesian Networks from Data： an Information-Theory Based Approach［J］.Artificial Intelligence，2002，137（1／2）：43-90.

［9］Mahdi R，Mezey J.Sub-local Constraint-Based Learning of Bayesian Networks Using a Joint Dependence Criterion［J］.Journal of Machine Learning Research，2013，14（1）：1563-1603.

［10］Friedman N，Nachman I，Peer D.Learning Bayesian Network Structure from Assive Datasets：the “Sparse Candidate”Algorithm ［C］.Proceedings of the 5th Conference on Uncertainty in Artificial Intelligence，Stockholm，Sweden，1999：206-215.

［11］Tsamardinos I，Brown L E，Aliferis C F.The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm［J］.Machine Learning，2006，65（1）：31-78.

［12］Perrier E，Imoto S，Miyano S.Finding Optimal Bayesian Network Given a Super-structure［J］.Journal of Machine Learning Research，2008，9（10）：2251-2286.

［13］Koski T，Noble J M.Bayesian Networks：an Introduction［M］.Chichester，United Kingdom：John Wiley ＆Sons，2009.

［14］Aliferis C F，Statnikov A，Tsamardinos I，et al.Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I：Algorithms and Empirical Evaluation［J］.The Journal of Machine Learning Research，2010，11：171-234.

［15］Pellet J，Elisseeff A.Using Markov Blankets for Causal Structure Learning［J］.The Journal of Machine Learning Research，2008，9：1295-1342.

［16］Fu S K，Desmarais M.Feature Selection by Efficient Learning of Markov Blanket［J］.Lecture Notes in Engineering and Computer Science，2010，2183：302-308.

［17］Aliferis C F，Tsamardinos I，Statnikov A R，et al.Causal Explorer：a Causal Probabilistic Network Learning Toolkit for Biomedical Discovery，Discovery Systems Laboratory（DSL）［D］.Tennessee，USA：Vanderbilt University，2005.

Journal of Donghua University(English Edition)2015年2期