Ziyang Li |Feng Hu |Chilong Wang |Weibin Deng |Qinghua Zhang
Chongqing Key Laboratory of Computational Intelligence,Chongqing University of Posts and Telecommunications,Chongqing,China
Abstract The purpose of relation extraction is to identify the semantic relations between entities in sentences that contain two entities.Recently,many variants of the convolution neural network(CNN)have been introduced to relation extraction for the extracting of features—the quality of the neural network model directly affects the final quality of relation extraction.However,the traditional convolution network uses a fixed convolution kernel,so it is difficult to choose the size of the convolution kernel dynamically,which results in networks with weak representation ability.To address this,a novel CNN is designed with selective kernel networks and multigranularity.In the process of feature extraction,the model can adaptively select the size of the convolution kernel,that is,give more weight to the appropriate convolution kernel.It is then combined with multigranularity convolution to obtain more abundant semantic information.Finally,a new pooling method is designed to obtain more comprehensive information and improve model performance.Experimental results indicate that this method is effective without excessively deep network layers,and it also outperforms several competitive baseline methods.
Relation extraction is one of the core tasks of text mining and information extraction[1],that is,identifying two entities from a sentence,then finding the relation between the two entities and finally expressing the extracted relation as a triple form[2] <entity 1,relationship,entity 2>.For example,given the sentence,‘Steve Jobswas the cofounder and CEO ofApple’,a relation classifier aims to predict the relation ‘founder’.
In recent years,much effort has been invested in relation extraction [3].With the rapid increase in data volume,it has become quite time-consuming to use manual annotation.To address this issue,distant supervision has been proposed.Mintz et al.[4]first proposed the application of distant supervision for the relation extraction task.The relationship in the knowledge base is used to label the training corpus.Meanwhile,many neural networkshavebeen proposed to extract sentencefeaturesduring the task of distantly supervised relation extraction.The convolution neural network(CNN)[5-7]is one of the first in-depth models to be applied to relation extraction.
Zeng et al.[8]proposed the piecewise convolutional neural network(PCNN)with multi-instance learning,which showed a significant improvement in effect.The sentence is divided into threesegmentsaccordingtothetwogivenentities,andapiecewise max pooling layer is devised.Huang et al.[9]proposed the CNN withresiduallearning(ResCNN)fordistantlysupervisedrelation extraction,which can show that deeper convolutional models help extract signals from noisy inputs.The method allows the network to select network features not affected by noise in classifying the relationship.However,CNN in these networks uses a fixed filter size.These networks cannot choose the convolution kernel size adaptively or explore a combination of convolution kernelsofdifferentsizes,andtheinfluencesofdifferentfiltersizes on networks and results are ignored.Moreover,max pooling and piecewise max pooling [8] extract the maximum value of the convolution features;they ignore strong features that often appear,and they do not capture overall information.
We propose a selective kernel and multigranularity neural network (SK-MGNet) based on selective kernel networks(SKNets)[10]and multigranularity[11,12]for distantly supervised relation extraction.Using the SK-MGNet,the model can adaptivelyadjustfiltersizeduringrelationextractionandcombine filters of different sizes to obtain different granularity information.In addition,we propose multipooling to capture more comprehensive information and the strong features that frequently appear.We evaluate theNew York Times-Freebase data set[13],and the results show that the model does not need a too-deep network to obtain relatively better results and improve relationextractionperformance.Ourcontributionsarethreefold:
(i) We are the first to consider SKNets for weakly supervised relation extraction.
(ii) We use the idea of multigranularity to combine different granularity convolution kernels,and the model can adaptively select the size of convolution kernels.
(iii) We propose a new pooling method called multipooling to obtain more information and better results.
Relation extraction is basic,important work in natural language processing and can be applied in text summarisation,intelligent question answering,machine translation,knowledge mapping and other fields [14].Relation extraction can be divided into traditional methods and deep learning methods.The traditional methods of relation extraction can be categorised as supervised,semi-supervised,weaklysupervisedandunsupervised.However,the traditional methods suffered from feature extraction error propagation,so relation extraction based on deep learning developed rapidly and achieved good results.In traditional supervised relation extraction,Zhou et al.[15] used a support vector machineasaclassifiertostudytheinfluenceofvocabulary,syntax and semantic features on entity semantic relation extraction.The training data of the traditional supervised method in relationextractionisgeneratedbymanualannotation.Eachpiece of annotation information is based on statement level.However,manual annotation is needed for a large number of training data,which requires tremendous manpower and material resources and has low efficiency.Thus,relation extraction methods based on semi-supervised,weakly supervised and unsupervised were proposed.The training data of traditional semi-supervised relation extraction is different from all data in the fully supervised relation;it has only a small amount of tagged data.Unsupervised relation extraction was first proposed by Hasegawa et al.[16]at the Meeting of the Association for Computational Linguistics.However,the traditional methods just mentioned all use natural language processing tools,which easily introduce errors and also suffer from feature extraction error propagation.All of these have an influence on the effectiveness of the relation extraction.To solve these problems,researchers began to apply deep learning to relation extraction.According to different levels of data set annotation,the task of relation extraction based on deep learning can be divided into supervised and distantly supervised relation extraction.Supervised relation extraction based on deep learning was first proposed by Liu et al.[17]This method used CNN for relation classification task for the first time.Zeng et al.[8]used the classic CNN structure and introduced location information.Then distantsupervision[4]wasproposedtoalleviate the problem of increasing data volume.It can also reduce dependence on the manual annotation corpus and enhance domain mobility.They also introduced multi-instance learning and proposed the PCNN.The sentence is divided into three segments with two entities as the boundary,and the max pooling results for each segment are extracted after convolution.This method exhibited greatly improvement in effectiveness.Lin et al.[18]introduced attention over instance learning(ATT)into the relation extraction task of distant supervision.In addition,the attention mechanism has now become important for distantly supervised relation extraction.Qin et al.[19]considered dynamic selection strategies for robust distant supervision.
In the 1960s,Hubel and Wiesel found that unique network structure can effectively reduce the complexity of the feedback neural network when they studied the neurons used for local sensitivity and direction selection in cat cortex.Based on this,they proposed the CNN.The classic structure of the CNN includes an input layer,a convolution layer,a pooling layer,a full connection layer and an output layer.Generally speaking,the convolution and pooling layers are the main improvements for researchers.In recent years,CNN and its variants have achieved great success in the vision field[20].The first work of the CNN wastheLeNet-5modelproposedbyLeCun[6].Inthismodel,the convolution and pooling layers were alternately connected to conduct the input image forward,and finally the probability distribution of the output was obtained through the full connection layer.The structure is a prototype of the CNN that is widely used at present.Krizhevsky et al.[21]proposed AlexNet,which had a five-layer convolutional network,about 650,000 neurons and 60 million trainable parameters.He et al.[22]proposed ResCNN,which used residual networks and identity shortcut connections.Its purpose was to solve the degradation problems of deep networks.Szegedy et al.[23-25]proposed a basic module of CNN called Inception.The traditional network isbasicallyastackofconvolutionlayers,andeachlayer usesonlya single filter size.In fact,multiple filters of different sizes can be usedinthesamelayer toobtaindifferentscalefeatures.Itisbetter to combine these features than to use a single filter.Zhang et al.proposed ShuffleNet[26];its purpose was to solve the problem the features of different groups fusing at the last moment.Hu et al.proposed SENet [27],a model that considered that the weight of features produced by different channels may be different when combined.The squeeze-and-excitation block proposed in this model started from the relationship between feature channels and considered that the importance of each channel was not the same.The model hoped to find the interdependence between feature channels,that is,to determine the importance of each feature channel automatically by learning.Finally,the model enhanced useful features and suppressed features that were not useful for current tasks.In addition to the above research on improvement of the convolution layer,the pooling layer has also been of value to researchers.The common methods of pooling include max pooling,average pooling and so on.In response to the problems that existed in max pooling,Kmax and Chunk-max pooling were proposed.Many visual methods were used for reference in natural language processing.For example,Conneau et al.[28]used VCDNN for text classification,and Huang et al.[9]proposed a deep residual network model to solve the problem of increasing noise in the deep network.
The SKNet [10] was proposed in CVPR 2019 and is different from other CNNs.It can dynamically select the convolution kernel size and adaptively adjust the size of the reception domain according to multiple input information scales.It is mainly divided into three operations:split,fuse and select.Split refers to the complete convolution of input vectors with different convolution kernel sizes.Fuse aims to obtain a global representation for selection weights and generates a gate mechanism for controlling the flowof information into different branches in the next convolution layer.Select aims to select different information sizes for soft attention between channels.
Zadeh et al.[29]proposed the concept of granularcomputing for the first time in 1997.When people deal with complex information,they usuallydivide it into severalsimple blocks according to their own characteristics and performance,and every divided block is regarded as a granularity.According to Zadeh,the concept of information granules exists in many fields.Information granules have different forms in different fields.From the view of artificial intelligence,granular computing is a structured solution model to simulate human thinking and solve large-scale complex problems.Starting from the needs of practical problems[30],people observe and analyse the same problem from very different granularities.In the natural language field,multigranularity can also be introduced to solve problems.For example,Rei et al.[31]combined the different granularities of monitoring objectives to better learn the overall language representation and composite functions.
In this section,we describe a novel SK-MGNet architecture and novel scheme of pooling that is called multipooling for distantly supervised relation extraction.Figure 1 shows the model architecture,which is a split of two convolution kernels.Specifically,it has three modules:vector representation,SK-MGNet and multipooling.We typically transform word tokens into lowdimensional vectors while using neural networks.The vector representation includes word and position embeddings.Then,the SK-MGNet combines the concepts of SKNets and multigranularity.The module contains the selective kernel and multigranularity fusion layers.The selective kernel layer can adjust the size of its convolution kernels according to the multi-scale adaptation of the input information.The multigranularity fusion layer can fuse the features acquired by the selective kernel layer and the features extracted by different filter sizes.The SKMGNet can be combined at will,and other SK-MGNets can be easily introduced to increase network depth;this is one of the advantages of the model.Multipooling is a new pooling method for relational extraction that we introduce,and it can extract more strong features and retain location information.
In this case,we select the size of only the two convolution kernels,but it is easy to expand this to multiple branches.
In relation extraction,natural language data that cannot be recognised by computer should be transformed into vector representation.This includes the representation of words as word vectors and the representation of word position information as position vectors.
Word embedding:word embedding is a distributed representation of words that aims to map words to a lowdimensional vector space.Word embeddings can be obtained by a pretrained vector matrix,V∈R|V|×dw,where|V|is the size of matrixV,that is,the size of vocabulary,anddwis the dimension of the word vector.
Position embedding:position embedding is the mapping of position information to a low-dimensional vector.For embedding,we use the relative position feature,that is,the relative distance between words and two entities in a sentence.Then,the two relative distances are mapped into two real valued vectors ofdpdimension initialised randomly.
Then,the word vector and position vector of each sentence are spliced to obtain the vector representation of the sentence,S∈Rl×d,wherelis the sentence length,dis the dimension of vector representation of the sentence andd=dw+dp×2.Specifically,there are two position embeddings,one for entity1 and the other for entity2.Finally,we concatenate the word embeddings and position embeddings.
To improve the network’s representation ability and integrate the information extracted by multiple filters,we propose a novel architecture that combines the ideas of SKNets and multigranularity for distantly supervised relation extraction.
3.2.1|Convolution
In neural networks,the convolution operation is a common method of obtaining local features.Convolution kernels of different scales can extract different local features obtained from multiple angles.More appropriate features can improve relation extraction task quality.To select the convolution kernel adaptively,we need to design the convolution kernel at different scales and then use several convolution kernels at different scales for the convolution operation.
FIGURE 1 The structure of our model:selective kernel and multigranularity neural network(SK-MGNet)and multipooling.Other SK-MGNet means that the SK-MGNet comprising the selective kernel and multigranularity fusion layers can be stacked in multiple layers
The convolution operation involves filterw∈Rm×d,wheremis filter size.In this case,we use a convolution size of two branches,m={m1,m2},mj≤l,wherelis sentence length.We regard sentenceSas a sequence,{x1,x2,……xl},xi∈Rd,and letxi:i+mj-1refer to the concatenation of words,xi,xi+1,…xi+mj-1.The convolution operation then extracts features fromxi:i+mj-1and can be formulated as
wherebisbiasandfisanonlinearfunctionthatincludestheReLU activation function.Finally,features are obtained from filters of different sizes,C={c1,c2}.Under the assumption of usingnfilters,c1={c11,c12,……,c1n},c2={c21,c22,……c2n}(note that here we havenfilters).
3.2.2|Selective kernel layer
After obtaining the features of filters of different sizes,the basic idea of the process offuseandselect[10]is to use gates to control information flows from multiple branches to the next layer of neurons so they can adjust the size of their convolution kernels according to the multi-scale adaptation of the input information.
Thefuseprocess is as follows.To adaptively select the kernels,we need to obtain all the feature information,so we first fuse the features:
wherec1,c2∈Rl×nandnis the number of filters.
To generate channel statisticsQ∈Rn,global average pooling is used to embed the global information.Specifically,the channel statistics are calculated by shrinkingUthrough spatial dimensions 1×l:
wherefgpdenotes global average pooling,nis the number of filters,andlis sentence length.
After that,in order to select features accurately and adaptively,a feature,Z∈Rn,is created through a simple full connection layer:
whereff cdenotes a fully connected layer andδis the ReLU function.
ThenSelectuses cross channel soft attention to adaptively select different information space scales and perform a softmax operation on channelwise digits:
wherean,bndenote the respectivesoft attentionvalues forc1,c2.
After obtaining the weight coefficient,we add the attention weight to each kernel to obtain the final feature.In fact,we use the information summarised by the multiple scale feature to guide the allocation of the characterisation of the kernel that we focus on:
So far,we have obtained the result that convolution kernel size can be adjusted adaptively through selective kernel layerT.T=[T1,T2,……,Tn],Tn∈Rl,Note that convolution kernels {3,5} are shown in this example,which can be extended to a combination of filters of different sizes.
3.2.3|Multigranularity fusion layer
In fact,in the cognition and processing of real-world problems,human beings often adopt strategies to observe and analyse problems from different levels.We observe and analyse the same problem from different levels,and granularity is used to extract different-sized objects.
We can use multiple filter sizes to obtain features at different scales and then combine these features.Therefore,to improve network representation ability and obtain more information,we combine the idea of multigranularity to fuse the features acquired by the selective kernel layer and those extracted by different filter sizes:
where ⊕is the concatenation operator.
The multigranularity convolution combination can capture the features of different angles and highlight strong features many times.
In the architecture,we use only two layers of SK-MGNet,which does not require a deeper network to achieve better results.It can reduce calculation requirements and better emphasise appropriate feature extraction so that strong features appear many times.The SK-MGNet can be combined at will or embedded in other network structures.
3.2.4|Multipooling
Max pooling is the most commonly used.It takes the highest score as the reserved value of the pooling layer and discards all other eigenvalues.The maximum value means that only the strongest features are retained,while weak features are discarded.However,the shortcomings of max pooling are clearly recognised.Feature location information is completely lost in this step,while location information in the relation extraction task is often more important.Another disadvantage is that some strong features appear frequently,but max pooling extracts only the maximum value and thus ignores other strong features.
Therefore,we propose a new pooling method dubbed multipooling for relational extraction.As shown in Figure 2,it takes the two highest values of each feature map and the mean pooling result of the feature map;we then concatenate them:
In the equation above,max2means extracting the two highest values of each feature map,andavemeans mean pooling.
Incorporating this,the SK-MGNet will highlight the strong features many times,and multipooling will obtain more strong features and retain location information.Assuming n filters,the output after multipooling isH=
After we obtainH,these features are passed to a fully connected softmax layer to predict the final relations.
Our experiments are intended to provide evidence that supports the proposition that SK-MGNet and multipooling can lead to increased performance.We first introduce the experimental settings and data set and then compare our method's performance with several baselines.
FIGURE 2 Multipooling.Max2 means extracting the two highest values of each feature map,so it can retain part of the location information and obtain more efficient features.Mean pooling can reflect global and location information
We use the word embeddings of Lin et al.[18],which have been trained on theNew York Times-Freebase corpus.The dimension of word embeddings is 50.The input text is padded to a fixed size of 100.The TensorFlow random gradient descent optimiser is used for training.The batch size is 160,the initial learning rate is 0.1,the split is 2,and the convolution kernel sizes are 3 and 5.Table 1 lists all of our superparameters.
To measure the effectiveness of our model on relation extraction,we use the data set released by Riedel[13].This data set is generated by aligning entity pairs from freebase with theNew York Timescorpus.The data set contains 52 actual relations and a special relation,NA,that affirms no relation between the two entities mentioned.The training data includes 522,611 sentences,281,270 entity pairs and 18,252 relational facts,and the test data includes 172,448 sentences,96,678 entity pairs and 1950 relational facts.We report both the aggregate precision/recall curves and the Precision@N (P@N).In addition,because the paper contains many abbreviations,we provide abbreviations and annotations in Table A1.
To demonstrate the effect of the SK-MGNet,we compare our method with CNN,PCNN and ResCNN without utilising ATT[18] and multipooling.In Figure 3,we observe that theperformance of the SK-MGNet is significantly better than that achieved by Zeng et al.[8]and Huang et al.[9].Comparing our method with CNN and its variants (PCNN,ResCNN),we conclude that the SK-MGNet can obtain more abundant semantic information and is better at extracting sentence features.In addition,our network,with fewer layers,achieves performance comparable to that of other state-of-the-art methods.In the experiment,we use just two layers of the SK-MGNet to outperform other deeper networks(e.g.The ResCNN for relation extraction proposed by Huang et al.[9]used a model constructed with nine convolutional layers).However,the number of layers of the SK-MGNet can be easily changed and extended to deeper layers.
TABLE 1 Parameter settings
FIGURE 3 Comparing the selective kernel and multigranularity neural network with the convolution neural network (CNN),piecewise CNN,and CNN with residual learning
To demonstrate that the SK-MGNet can obtain better performance by adaptively selecting filters,we compare our network with the network concatenated by different filters without a selective kernel.The filter sizes of simple concatenating include {3,5},{3,3,5},{3,5,5}.
As shown in Figure 4,the performance of our method is much better than that of simple filter concatenating of different sizes.This means that although different filters can extract different local features,simply concatenating them together cannot significantly improve performance nor can it highlight more useful information and the strong features that frequently appear.The selective kernel layer can derive more appropriate features by adaptively selecting the filter sizes.
In Figure 5,we compare multipooling with max pooling and piecewise max pooling,which shows that multipooling achieves better results than those of max pooling and piecewise max pooling.From the curves,we observe that CNN with multipooling is better than CNN with max pooling and PCNN,and the performance of the combination of SKMGNet and multipooling(SK-MGNet-M)is much better than that of combining the SK-MGNet and max pooling.This result demonstrates that multipooling is beneficial and can capture more comprehensive information and frequently appearing strong features for relation extraction.
Several models from previous work and SK-MGNet-M with ATT(SK-MGNet-M+ATT)are compared in Figure 6.From the curves,we observe that the SK-MGNet-M combined the sentence-level attention can improve model performance in distant supervision,and SK-MGNet-M+ATT can outperform all previous baseline methods.In Table 2,we compare the our models' performance to state-of-the-art baselines.We show that SK-MGNet-M+ATT outperforms all other models.Specifically,the baselines include PCNN with ATT(PCNN+ATT),bidirectional recurrent neural network(RNN) with ATT (BiRNN+ATT),extracting information from noisy data through residual networks and identity shortcut connections with ATT (ResCNN+ATT),using adversarial training to improve PCNN with ATT(PCNN+ATT+Adv) [32] and CNN with reinforcement learning to improve the effect of relation extraction(CNN+RL) [33].The result verifies the effectiveness of our proposed SK-MGNet and multipooling methods for distantly supervised relation extraction.
FIGURE 4 Comparing the selective kernel and multigranularity neural network with the network concatenated by different filters without a selective kernel
FIGURE 5 Comparing multipooling with max pooling and piecewise max pooling
FIGURE 6 Comparing the selective kernel and multigranularity neural network and multipooling+attention over instance learning with previous work
TABLE 2 Precision@N for relation extraction with different models
We develop a novel model by combining the SKNet and multigranularity for distantly supervised relation extraction.The model can adaptively select the size of the convolution kernel and obtain more features from convolutions of different granularities,highlighting strong features.The experimental results show that the accuracy of the proposed model is better than that of comparative models in the literature.
ACKNOWLEDGEMENTS
This work is supported by the National Key Research and Development Program of China (Program No.2018YFC0832100,Project No.2018YFC0832102) and National Natural Science Foundation of China (No.61876201).
APPENDICES
Because the paper contains many abbreviations,we use the abbreviation and annotation table for illustration.
TABLEA1 Abbreviation and annotation
CAAI Transactions on Intelligence Technology2021年2期