Yuling Xing | Jia Zhu
South China NormalUniversity,Guangzhou,China
Abstract Action recognition based on 3D skeleton data has attracted much attention due to its wide application,and it is one of the most popular research topics in computer vision.The 3D skeleton data is an effective representation of motion dynamics and is not easily affected by light,scene variation,etc.Previous research on action recognition has mainly focused on video or RGB data methods.In recent years,the advantages of combining skeleton data and deep learning have been gradually demonstrated,many impressive methods have been proposed,especially GCN-based methods.In this survey,we first introduce the development process of 3D skeleton-data action recognition and the classification of graph convolutional network,then introduce the commonly used NTU RGB+D and NTU RGB+D 120 datasets.Finally,a detailed review of existing variants of three mainstream technologies is provided based on deep learning and their performance was compared from three dimensions.To the best of our knowledge,this is the first research to integrate the research of GCN-based method and its various evolutionary methods.Comparative investigation of existing variants of research in action-recognition task from different perspectives is made,a generic framework is described,state-of-theart practices are summarized,and the emerging trends of this topic are explored.
3DSkeleton-basedActionRecognitionHuman action recognition based on skeletons is a very popular research topic in computer vision,which has been widely used in video understanding,video monitoring,human-computer interaction,robot vision,autonomous driving,virtual reality,etc.In the past few years,with the rapid development of 3D skeleton data acquisition,a large number of researches on action recognition based on skeleton data have proliferated.The skeleton data consists of 3D coordinates of multiple spatial and temporal skeleton joints,which is an effective representation of motion dynamics.It not only can be easily collected by low-cost depth sensors,but also can be directly obtained from 2D images using video-based pose estimation algorithms,thus it has attracted extensive attention.Compared with RGB and optical flow,skeleton data has the advantage of high computational efficiency because the amount of data is smaller.In addition,skeleton data is robust to illumination changes and background noise,and is invariant to camera views.Dynamic human skeletons usually carry rich and important information,which is complementary to appearance and light flow.
One of the challenges of action recognition is how to correctly model spatial-temporal information.On the one hand,in many previous 3D action recognition methods,the bag-of-words model often tends to overemphasize spatial information.On the other hand,some methods based on Hidden Markov Model(HMM)or recurrent neural networks(RNN)may overemphasize temporal information,concentrating on designing hand-crafted feature descriptors[1,2]to model temporal dynamics in sequences.For example,Wang et al.[1]extracted the 3D joint position and the local occupancy pattern,and then they processed with Fourier Temporal Pyramid(FTP)to represent the temporal dynamics of the actions.Vemulapalli et al.[3]employed Dynamic Time Warping(DTW)and FTP to handle issues such as rate variations,temporal misalignment,noise,etc.However,hand-crafted features are always shallow and dataset-dependent.In recent years,with the continuous development of deep learning methods in most existing computer vision tasks,it has demonstrated its surprising performance.Consequently,a great deal of effort has been directed to the deep learning method based on data-driven features(Figure 1 shows the general framework).
Deep learning architectures can learn hierarchical representation to perform pattern recognition and show impressive results in many pattern recognition tasks.For instance,recurrent neuralnetworks(RNNs)with Long-Short Term Memory(LSTM)have been employed to model skeleton data for 3D action recognition[4–7].Although RNN-based approaches present excellent results in 3D action recognition tasks due to their power of modeling temporalsequences,such structures lack the ability to efficiently learn the spatial relations between the skeleton joints[8].To take advantage of the spatial relations,a hierarchicalstructure was proposed by Du et al.[9].The authors represent each skeleton sequence as 2D arrays,in which the temporaldynamics of the sequence are encoded as variations in columns and the spatialstructure of each frame is represented as rows.Then,the representation is fed to the convolutional neural networks(CNNs)which have the natural ability to learn structural information from 2D arrays.Such type of representations is very compact,encoding the entire video sequence in one single image.However,representing the skeleton data as a vector sequence or 2D arrays hardly capture the deep correlations between body joints,resulting in missing abundant and usefulmotion information.To better obtain the joint dependencies,some methods[10–12]construct a skeleton graph whose vertices are joints and edges are bones,and apply graph convolutional networks(GCNs)to extract correlated features.Table 1 shows the comparison of traditional methods and three deep network frameworks methods.
GraphConvolutionalNetworkThere is an increasing interest in generalizing convolutions to the graph domain.Advances in this direction are often categorized as spectral approaches and spatialapproaches.1)the spectral perspective[13].Spectral approaches work with a spectral representation of the graphs.The convolution is mainly realized by Graph Fourier Transform.In brief,it uses the Laplacian matrix of a graph to derive its Laplacian operator in the frequency domain,and then analogizes convolution in Euclidean space in the frequency domain to derive the formula of graph convolution.2)the spatial perspective[14,15].The core of spatial convolution is to aggregate the information of neighbour nodes,and the convolution operation is directly defined on the connection relation of each node based on the spatialconvolution method,which is more similar to convolution in traditionalconvolution neural network.Bruna et al.[14]first proposed a spectral graph-based extension of convolutionalnetworks to graphs.In follow-up work,Defferrard et al.[16]defined graph convolutions using Chebyshev polynomials,which could remove the need to compute the eigenvectors of the Laplacian.GraphSAGE[17]replaced full graph Laplacian with learnable aggregation functions,which could generate embeddings for unseen nodes.It also used neighbor sampling to alleviate receptive field expansion.Chen et al.[18]proposed an efficient variant of GCN based on importance sampling.Instead of sampling neighbors for each node,FastGCN directly sampled the receptive field for each layer.Liao et al.[19]exploited multi-scale information by raising S to a higher order.GCN has developed rapidly in the past two years and has a wide range of applications,such as text classification[17,20],relationship extraction[21],image classification[22],KG alignment[24],and socialnetworking[25].
FI GUR E 1 The general framework of skeletonbased action recognition using deep learning methods
TABL E 1 The comparison of traditionalmethods and three deep network framework methods
The rest of this survey is organized as follows.In Sec.2,we introduce the most widely used in-door action recognition datasets at present,the NTU RGB+D 60 and NTU RGB+D 120 datasets.In Sec.3,we detailed introduce three major network architectures of action recognition based deep learning and skeleton data.In each network architecture,we present its variants that try to release the limitations and additionally,we introduce several novel methods proposed recently.In Sec.4,we show the performance comparison of the mentioned methods from three perspectives.Finally,we make a conclusion and discussion in Sec.5,including the contributions of this review,challenges and emerging trends.
FI GUR E 2 The joint label of the NTU RGB+D 120 dataset
NTURGB+D60[4]is currently the largest and most widely used in-door action recognition dataset,which contains 56,880 action clips in 60 action classes.The dataset contains RGB videos,depth map sequences,3D skeleton data and infrared(IR)videos,four different modalities of data for each sample.Here,we only use the skeleton data.These data are captured by three Microsoft Kinect V2 camera at 30 fps,they set at the same height but aimed from different horizontal angles:-45°,0°,+45°.The camera can provide 25 3D locations of joints as labeled and shown in Figure 2.These actions are performed by 40 volunteers aging from 10 to 35.For evaluating the models,two standard evaluation protocols are recommended:Cross-Subject(CS)and Cross-View(CV).In Cross-Subject,40,320 samples performed by 20 subjects are separated into a training set,and 16,560 samples belong to the test set.Moreover,Cross-View assigns data according to camera views,training clips in this set come from the camera views 2 and 3,and the evaluation clips are allfrom the camera view 1,where training and test set have 37,920 and 18,960 samples,respectively.NTURGB+D120[26]is the most recent large-scale 3D action recognition dataset captured under various environmental conditions and consists of 114,480 RGB+D video samples captured using the Microsoft Kinect sensor.As in NTU RGB+D 60[22],the dataset provides RGB frames,depth maps,infrared sequences and skeleton joints.It is composed of 120 action categories performed by 106 distinct subjects in a wide range of age distribution.There are two different evaluation protocols:Cross-Subject,which split the 106 subjects into training and testing;and Cross-Setup,which divides samples with even setup IDs for training(16 setups)and odd setup IDs for testing(16 setups).The performance is evaluated by computing the average recognition across all classes(Figures 2 and 3 show the sample and the joint labelof the NTU RGB+D 120 dataset.).
F I GUR E 3 The sample of the NTU RGB+D 120 dataset
The existing reviews have compared the previous action recognition technologies from the perspective of RGB or network architecture,as well as from the perspective of manual features and depth features.However,there is no survey of these latest GCN-based methods that have developed rapidly in the past two years.Therefore,we not only give an exhaustive discussion and comparison among RNN-based,CNN-based and GCN-based methods in this survey,but also focus on comparing relevant technologies based on GCN-based and its variants that have brought rapid improvement in effect in recent two years.For these three architectures,we will introduce some improvements of the later methods based on some shortcomings of the previous methods.Additionally,we introduce several novel methods proposed recently and give our unique opinions.
Recently,recurrent neural networks(RNNs)which can handle the sequential data with variable lengths[27,28],have shown their strength in language modelling[29],video analysis[30–32],and RGB-based action recognition[33].On the one hand,the application of these networks have also shown promising but very limited achievements in skeleton-based action recognition[4,34].In order to address the well-known problems of gradient vanishing and exploding problems,in independently recurrent neural network(IndRNN)[34],they regulated the gradient backpropagation through time and allowed the network to learn long-term dependencies.In addition,neurons in the same layer are independent of each other and connected across layers,which can explain the behaviour of neurons in each layer.On the other hand,the key factors of this task lie in two aspects:the intra-frame spatial representation for joints and the inter-frame representation for temporal evolutions.Because they only paid attention to the temporal dynamic information and ignored the strong dependencies among the skeletal joints in the spatial domain.To model the dynamics and dependency relations in both temporal and spatialdomains,Liu et al.[35]proposed a spatiotemporal LSTM network(ST-LSTM)include an extra trust gate.Since the tree structure can better represent the adjacency configuration of the joints in the skeletaldata,they proposed a traversalprocedure by following the tree structure(Figure 4)to exploit the kinematic relationship among the body joints to better modelspatialdependencies.
The trust gate can tell about when and how to update,forget,or remember the internal memory content as the representation of the long-term context information.The mentioned methods generally use relative coordinate systems dependent on some joints,and modelonly the long-term dependency,while excluding short-term and medium-term dependencies.Lee et al.[16]transformed a human skeleton into the human cognitive coordinate system by using the Gram-Schmidt process to obtain the robustness to scale,rotation and translation,and then extracted salient motion features from the changed skeletons instead of raw skeletons(Figure 5).The network not only considered long-term dependence,but also joined short-term and medium-term dependencies.
FI GUR E 4 The process of transforming the human skeleton data to a tree structure.[35]
FI GUR E 5 The process of skeleton data transformation and feature representation.[36]
Different from others,Liu et al.[35]noticed few people consider the interference of noise data.When all joints are taken as input,the noise of irrelevant joints will reduce performance,so more attention should be paid to informative ones.However,the original LSTM network does not have explicit attention ability.Consequently,Liu et al.[35]proposed the Global Context-Aware Attention LSTM(GCA-LSTM)with a recurrent attention mechanism,which is better capable of selectively focusing on the informative joints in each frame by using a global context memory cell.Besides,they also introduced a two-stream framework to achieve higher accuracy of action recognition,which leveraged coarse-grained attention and fine-grained attention.Similar to the work[35],Si et al.[37]added an attention mechanism to enhance the information of key joints and the proposed Attention Enhanced Graph Convolutional LSTM Network(AGC-LSTM),which can explore the symbiotic relationship between spatial and temporal domains,rather than just capture the distinguishing features of spatial structure and temporal dynamics.The author also proposed a time hierarchical structure to increase the temporal receptive field of AGC-LSTM layer,which enhanced the ability to learn advanced semantic representation and significantly reduced the computational cost.
In the current skeleton-based action recognition literature,RNN and LSTM networks are mainly used to model the longterm context information across the temporal dimension by representing motion-ba-ed dynamics.It is difficult for them to learn high-level features from skeletons directly since the temporal modekling is done on the raw input space[38].However,there are also strong dependency relations among the skeleton joints in the spatial domain,and the spatial dependency structure is usually discriminative for action classification.Benefiting from the excellent ability of the CNN model to extract high-level information,more and more literature adopted CNN to learn skeleton spatial-temporal features and achieved impressive performance in recent years.As the forerunner of skeleton image representations,Du et al.[9]took advantage of the spatialrelations to propose a hierarchicalstructure.The authors represented each skeleton sequence as a matrix,each row corresponds to a chain of concatenated skeleton joint coordinates from the framet.Hence,each column corresponds to the temporal evolution of the jointj.At this point,the matrix size isJ×T×3,whereJis the number of joints for each skeleton,Tis the total frame number of the video sequence and three is the number coordinate axes(x,y,z).The values of this matrix are quantified into an image and normalized to handle the variable-length problem.Finally,they used their representation as an input to a CNN model composed of four convolutional layers and three max-pooling layers.After the feature extraction,a feed-forward neuralnetwork with two fully-connected layers is employed for classification.What's more,such type of representation is very compact since it encodes the entire video sequence in a single image.Following the work[9],Ke et al.[39]proposed an improved representation of skeleton sequences where the 3D coordinates are separated into three gray-scale images,and then applied deep CNN on them.In two-stream CNN[40],a skeleton transformer module was introduced to learn a new representation of skeleton joints.Furthermore,they proposed a two-stream CNN in which one stream's input is the raw coordinate and the other stream's input is motion data obtained by subtracting joint coordinates in each two consecutive frames[40].Wang et al.[41]presented a skeleton representation to represent both spatial configuration and dynamics of joint trajectories into three texture images through color encoding,named Joint Trajectory Maps(JTMs).The authors applied rotations to the skeleton data to mimicking multi-views and also for data enlargement to overcome the drawback of CNNs usually being not viewed invariant.They also encoded the motion magnitude of joints into saturation and brightness claiming that changes in motion result in texture in the JTMs,which are generated by projecting the trajectories onto the three orthogonal planes.Finally,the authors individually finetuned three AlexNet[42]CNNs(one for each JTM)to perform classification.
Undoubtedly,the methods mentioned above are complicated in processing and are prone to lose important information.To overcome this shortcoming,Caetano et al.[43]proposed a new skeleton image representation method as the input of the CNNs,named SkeleMotion(see Figure 6).The temporal dynamics is firstly encoded by explicitly using motion information in multiple temporal scales to calculate the magnitude and orientation values of the skeleton joints.They trained a tiny CNN modelwith only three convolutional layers and two fully connected layers to greatly improved the training speed.Similar to the work[40],it calculated the difference of moving joints on consecutive frames to try to carry out motion coding on skeleton images.Then,Caetano et al.[44]further introduced the Tree Structure Reference Joints Image(TSRJI)for a skeleton representation,combining the use of reference joints and a tree structure skeleton.
費(fèi)瑞斯州立大學(xué)西密歇根英語語言學(xué)院注重英語語法教學(xué),初級(jí)(Beginning Level)、一級(jí)(Level 1)、二級(jí)(Level 2)和三級(jí)(Level 3)的強(qiáng)化英語課程體系中均設(shè)置有語法課。初學(xué)者(Beginning Level)的課程設(shè)置中語法課占8個(gè)課時(shí);一級(jí)(Level 1)、二級(jí)(Level 2)和三級(jí)(Level 3)的課程設(shè)置中語法課均是4個(gè)課時(shí)。目的是幫助學(xué)生掌握既符合規(guī)則又富有成效的口語和書面溝通技能。教學(xué)內(nèi)容如下:
Many of the early works have noticed the advantages of cooccurrence features and attempted to design and extract them from skeleton sequences.The Recurrent Neural Networks(RNNs)with Long-Short Term Memory(LSTM)neurons are prevalently used to model the time series of the skeleton to obtain co-occurrence features[4,45,46].For example,Zhu et al.[46]proposed an end-to-end fully connected deep LSTM network,which took the skeleton as the input at each time slot and introduced a novel regularization scheme to learn the co-occurrence features of skeleton joints.Although the performance has improved,it is difficult to learn high-level features directly from the skeletons.In order to learn high-level features,Caetano et al.[43]proposed CNN-based methods to recognize the underlying action,but it is difficult to explore cooccurrences from all joints efficiently.Combining the advantages of the co-occurrence feature and the high-level feature of the above method,Li et al.[47]proposed an end-to-end convolutional co-occurrence feature learning framework.The authors used CNN to learn hierarchicalco-occurrence features from skeleton sequences automatically,where features are first aggregated gradually from point-level features to global cooccurrence features,is shown superior over local co-occurrences.Then added a global spatial aggregation scheme to gradually aggregate the contextual information of different levels after transpose operation.Finally,the two-stream framework is introduced to fuse the skeleton motion feature explicitly.In fact,the spatial structure information among skeleton joints is hard to be utilized effectively by both the RNN-based and CNN-based methods,though researchers proposed some additional constraints or dedicated network structures to strenuously encode the spatial structure of skeleton joints.
FI GUR E 6 The process of SkeleMotion representation.[43]
Graph Neural networks(GCNs),which generalize convolutionalneural network(CNN)to graphs of arbitrary structures including the skeleton graph.Recently,the graph convolutional network(GCN)based method was proposed and attract attention owing to its achievement of high performance.GCN-based methods represent joints as vertices and their natural connections in the human body as edges and then calculate convolution based on vertices connected by edges.Severalworks[8,12,48,49]provided reasonably consistent evidence that a graph structure is more suitable than a sequence vector or 2D pseudo-image for the human body skeleton.
Yan et al.[12]was the first work to apply GCNs to model the skeleton data and proposed ST-GCN model,which includes a spatial graph and temporalgraph to input a sequence of skeletons directly and extract features from joints on both the intra-frames and the inter-frames.Specifically,they constructed a spatial graph based on the natural connections of joints in the human body and added the temporal edges between corresponding joints in consecutive frames.A distance-based sampling function was proposed for constructing the graph convolutional layer,which is employed as a basic module to build the final spatiotemporal graph convolutional network(ST-GCN).Inspired by the idea that the human skeleton is a combination of multiple body parts,Thakkar et al.[50]and Li et al.[51]proposed different approaches to divide the body parts.The one is Thakkar et al.[50]defined a partbased graph convolutionalnetwork(PB-GCN).It divided the skeleton graph into four subgraphs,instead of the whole skeleton being regarded as a single graph,because the partbased GCN can learn the important information of each part and the relations across different parts.Furthermore,they used relative coordinates and temporal displacements to replace 3D joint coordinates as node features to boost recognition performance.The other is Li et al.[51]proposed a spatio-temporal graph routing(STGR)scheme to model the semantic connections among the joints in a disentangled way(Figure 7).For the imbalance problem of joint connection for the fixed human skeleton,they introduced a spatial graph router(SGR)and temporalgraph router(TGR),which can adaptively learn the intrinsic high-order connectivity relationships for physically-apart skeleton joints.SGR discovered the connectivity relationships among the joints based on sub-group clustering along the spatialdimension,and TGR explored the structural information by measuring the correlation degrees between temporal joint node trajectories.However,the human skeleton is a whole structure,and after it is cut into several parts,much internal semantic information between joints will be lost.Especially,graph learning in STGR-GCN [51]has high computation complexity,and the spatial graph is constructed on clusters,each of which is assigned a weight and thus may not capture implicit pairwise spatial relationship among joints.
Different from PB-GCN[50]and STGR-GCN[51]focus,Shiet al.noticed that heuristically predefined and representing only the physical structure of the human body cannot be guaranteed to be optimal for the action recognition task.What's more,the topology of the graph applied in ST-GCN is fixed over all the layers,it is a great challenge to capture changeable human structure in a complex scene,it lacks the flexibility and capacity to model the multilevel semantic information contained in all of the layers.To solve these above problems,2s-AGCN[11]proposed a novel adaptive graph convolutional network,which parameterizes the graph structure of the skeleton data and embeds it into the network to be jointly learned and updated with the modelunder data-driven.To be specific,they innovatively divided the adjacency matrix of the graph into three parts,the first part represents the physical structure of the human body,the elements in the second part are parameterized and optimized together with other parameters in the training data,and the third part can learn a unique graph for each sample of the dataset.Due to the work[52]proved bone information(the directions and lengths of bones)has a good modality for skeleton-based action recognition.In order to capture abundant information of the skeleton data,the bone information is formulated as a vector pointing from its source joint to its target joint.
FI GUR E 7 The overview of spatio-temporalgraph router.[51]
Existing graph-based methods always represent the skeleton as an undirected graph and modelthe bones and joints with two separate networks,which cannot fully exploit these dependencies between joints and bones.To solve this problem,Shi et al.[53]proposed a noveldirected graph neuralnetwork(DGNN),which first represented the skeleton data as a directed acyclic graph with joints as vertexes and bones as edges,then can easily be modelled by the constructed directed graph to propagate the information in adjacent joints and bones and updated their dependencies in each layer.This novel method offers intriguingly good performance even when the undirected graph is highly valued.Liet al.[10]proposed the actional-structural graph convolution network(AS-GCN)by generating the skeleton graph with actional links and structural links.They introduce an encoder-decoder structure to capture action-specific latent dependencies and extend the existing skeleton graphs to represent higher-order dependencies.What's more,it is a multi-task learning model,an additional future pose prediction head used in this work can capture more detailed patterns through self-supervision.In addition,the multi-task learning model is also a promising research direction in the future.Based on 2s-AGCN[11]method,in order to help the model paying more attention to the important information,MS-AAGCN[54]added the attention mechanism to design a spatial-temporal-channel(STC)attention module to adaptively recalibrate the activations of the joints,frames and channels for different data samples.In exiting methods,the module is plugged in each graph convolutionallayer,with a smallnumber of parameters yet encouraging performance improvement.
All these mentioned methods assume that the complete skeleton joints can be well captured,while the incomplete case is not considered.However,it is often difficult to obtain a complete skeleton sequence in real scenarios,for example,students may be occluded by desks and chairs or other students observed.Meanwhile,when facing incomplete skeletons,traditional methods willhave varying degrees of performance deterioration.
Therefore,Song et al.[55]proposed a multi-stream richly activated GCN(RA-GCN)to learn distinctive features of currently unactivated joints in multiple streams by utilizing class activation maps(CAM)to solve the problem of how to recognize action with incomplete skeletons.They construct a synthetic occlusion dataset based on the NTU RGB+D 60 dataset,in which the’incomplete skeletons'defined as spatially occluded or temporally missed skeleton features.However,it stilllacks flexibility.Then Yu et al.[56]proposed a method for noise-robust skeleton-based action recognition,called Predictively Encoded Graph Convolutional Network(PeGCN),which learn a representation by predicting the perfect sample from the noisy samples in latent space via the autoregression model.To some extent,it has been addressed how a model processes noisy skeleton samples(Figure 8).
FI GUR E 8 Illustrations of various types of noisy skeletons.[56]
FI GUR E 9 The conventionalrepresentation of temporal graph(left).The extend representation of the temporal graph and connect neighboring multiple vertices as well as the same vertex on the inter-frame(right).[57]
The traditional GCNs-based action recognition methods[10,11,12]all input the entire skeleton sequence in the feedforward network and are all single-pass feedforward networks,so it is impossible for the low-level layers to access the semantic information of the high-level layers.It is worth noting that when the whole skeleton sequence as the input,useful information is usually buried in motion-irrelevant and undiscriminating clips,and will increase the computational complexity of the model.Therefore,Yang et al.[59]first introduced feedback mechanism into GCNs,and proposed Feedback Graph Convolutional Network(FGCN),which adopted a multi-stage temporal sampling strategy to avoid feeding with the whole skeleton sequence,and extracted effective features from skeleton data in a coarse-to-fine progressive process for action recognition(Figure 11 shows the detailed architecture of the proposed FGCB local network,which is the core component of the FGCN mode.).
F I GURE 1 0 Illustration of the GVFE module structure:it is composed of J TCN blocks.[58]
Since the graph learning in STGR-GCN[51]has high computation complexity,and the spatial graph is constructed on clusters,each of which is assigned a weight and thus may not capture implicit pairwise spatial relationship among joints.Consequently,Gao et al.[48]proposed a graph regression based GCN(GR-GCN)model to pose an optimization problem on the graph structure,the optimized graph not only connected each joint to its neighboring joints in the same frame strongly or weakly,but also linked with relevant joints in the previous and subsequent frames,which enforces the sparsity of the underlying graph for efficient representation.
F I GUR E 1 1 The detailed architecture of the proposed FGCB local network.[59]
FI GUR E 1 2 (a)It shows the original spatialand temporal modeling on skeleton graph sequences of GCN-based methods.(b)It is proposed to capture cross-spacetime correlations of current node and neighbor.(c)Disentangling nodes at spatial-temporalneighborhoods based on the distance of the nodes can effectively capture multi-scale features.[49]
However,these existing approaches extract multi-scale structuralfeatures and long-range dependencies by performing graph convolutions with higher-order polynomials of the skeleton adjacency matrix.Though the adjacency polynomial thus increases the receptive field of graph convolutions by making distant neighbors reachable,this formulation suffers from the biased weighting problem.Specifically,on skeleton graphs,this means that a higher polynomial order is only marginally effective at capturing information from distant joints since the aggregated features will be dominated by the joints from local body parts.This is a critical drawback limiting the scalability of existing multi-scale aggregators.Liu et al.[49]proposed a novel method using two pathways to improve performance of action recognition.In the G3D pathway,integrating the disentangled aggregation scheme and sliding temporal window provide a powerful feature extractor(MS-G3D,see Figure 12).Specifically,different hop matrices aggregation and dilated temporal convolution provide the multi-scale receptive fields across both spatial and temporal dimensions.In the factorized pathway,stacked one MS-GCN layer and two MS-TCN layers to capture spatial-temporal information.Besides,to lower the computational costs due to the extra branches in MS-TCN,they deployed a bottleneck design and use different dilation rates instead of larger kernels for larger receptive fields.In the spatial and temporal domain,the direct multi-scale aggregation of features and the larger receptive fields further improve model performance.
In this section,we first introduce the experimentalsettings of several methods,then compare the performance of the above methods from three perspectives:accuracy,parameters and computational complexity(GFLOPs,Giga FLoating-number Operations).
For the length of input skeleton sequences,the RNN-based methods[37,60]always setN=100.The GCN-based method all set 300 as the max number of frames in each sample,for samples with less than 300 frames,they repeat the samples untilit reaches 300 frames[11,12,49].The initiallearning rate(LR)of the GCN-based method is almost always 0.1,and it is about 1000x that of the RNN-based method.For the batch size,the CNN-based method VA-CNN[61–63]and TSRJI[44]are 32 and 1000,the RNN-based methods IndRNN[34]and VA-RNN[61–63]are 128 and 256,respectively.The batch size of GCN-based methods is always 32 or 64,depending on their device of GPU.In general,all experiments are enough to run on a cluster with four Nvidia GTX 1080 TiGPUs.Besides,the GCN-based methods almost trained with SGD with momentum 0.9.We choose seven action recognition algorithms shown in Table 2 for our comparison of the feature dimension.The feature dimensions of RNN-based and GCN-based methods are usually 512 and 256,respectively,except for MS-G3D which is 384.
Firstly,we compare the accuracy of the mentioned methods for skeleton-based action recognition tasks on the NTU RGB+D 60 dataset and the NTU RGB+D 120 dataset.The results are shown in Table 3 and Table 4,respectively.The methods used for comparisons divide the RNN-based methods[37,60],CNN-based methods[43,47,61–63]and GCN-based methods[10–12,49,55],and split them with a line in the table of results.Besides,the comparison of the average accuracy of these three network architectures on the NTU RGB+D 60 dataset is illustrated in Figure 13.Evidently,the GCN-based methods outperform the existing methods of the RNN-based and CNN-based by a sizable margin,which proves the effectiveness of the GCN method in the field of action recognition.In the GCN-based method,we observed that the simultaneous use of joint information and bone information[11,49]of skeleton data is conducive to improving the accuracy of recognition.
Secondly,we discuss the network parameters with nine state-of-the-art methods including three network architectures for skeleton-based action recognition.As is shown in Figure 14,among the CNN-based methods,the number of parameters of HCN[47]is the least,and VA-CNN[61–63]is the largest.In the RNN-based methods,although AGC-LSTM[37]achieves a comparable performance of the state-of-the-art method,the number of parameters of AGC-LSTM[37]and SR-TSL[60]are larger than the GCN-based methods.The accuracy of ST-GCN[12]is the lowest,but the number of parameters is less than other methods except for HCN.In order to further improve the recognition performance of the GCN-based method,2s-AGCN[11]and MS-G3D[49]fused joint stream and bone stream,so the number of parameters shown in the figure is twice that of single flow in their respective models.The MS-G3D achieved the bestperformance and its number of parameters is fewer than most methods.
TABL E 2 The implementation details of different methods
TABL E 3 Accuracy of recognition(%)on NTU RGB+D 120 dataset
Finally,as shown in Table 5,we compare the computational complexity of some methods that have published the code,and ranked them in order of accuracy.The computational complexity of these methods are pretty heavy,typically over 30 GFLOPs for one action sample.Some works[37,55]even reach 100 GFLOPs.The latest work MS-G3D[49]achieves the highest accuracy without increasing computational complexity.Therefore,how to balances accuracy and computational complexity is a problem to be solved.
TA B LE 4 Accuracy of recognition(%)on NTU RGB+D 60 dataset
F I GUR E 1 3 The average recognition accuracy(in percentage)of methods using RNN-based,CNNbased and GCN-based methods for cross-subject and cross-view recognition
FI GUR E 1 4 Comparison of the parameters of different methods
TABL E 5 Comparison of computational cost on NTU RGB+D 60 Cross-view task
Over the past few years,action recognition has become a popular and practical application in computer vision tasks.The skeleton data and deep learning methods are powerful and effective tools in this domain,helping to make significant progress.This progress is attributed to the expressiveness of skeleton data,the flexibility of models and the high efficiency of training algorithms.Herein,the main contributions are as follows:(1)we conduct a comprehensive review and summarize state-of-the-art practices of 3D skeleton action recognition based on deep learning methods,covering the latest algorithms within RNNbased,CNN-based and GCN-Based techniques.Then we describe a general framework of action recognition methods based on 3D skeleton data and deep learning.(2)To the best of our knowledge,this is the first work that integrates the research based on the GCN method and its various evolutionary methods.(3)Especially,we compare the performance of the existing action recognition methods from three perspectives of accuracy,parameters and computationalcomplexity.
For skeleton-based human action recognition,one of the challenges is the large diversity of viewpoints of the captured human action data.The position of the camera and the direction of human action are the two reasons for this problem.Other challenges are how to make full use of the dependencies between joints,how to optimize the spatialtemporal graph,and how to make good use of bone information.These challenges are still problems that researchers are facing and need to be studied and solved in the future.
In terms of future development directions,occlusion and self-occlusion,efficient real-time detection,lightweight models,applications on mobile devices and multi-task learning are potential directions worthy of study.What's more,the interpretability of action recognition models is also a promising direction worth studying.Besides,accuracy on the NTURGB+D dataset is already so high that it is hard to make further improvements.Future study should pay more attention to some larger and more complex datasets like the NTURGB+D 120 dataset.
ORCID
YulingXinghttps://orcid.org/0000-0002-4860-532X
JiaZhuhttps://orcid.org/0000-0002-5959-390X
CAAI Transactions on Intelligence Technology2021年1期