Qing-xin Shi ,Chang-sheng Li ,Bao-qiao Guo ,Yong-gui Wang ,Huan-yu Tian ,Hao Wen ,Fan-sheng Meng ,Xing-guang Duan ,*
a School of Mechatronical Engineering,Beijing Institute of Technology,Beijing,100081,China
b Beijing Advanced Innovation Center for Intelligent Robots and Systems,Beijing Institute of Technology,Beijing,100081,China
c State Key Laboratory of Explosion Science and Technology,Beijing Institute of Technology,Beijing,100081,China
Keywords:Robot applications Object detection Vehicle inspection Identity verification You only look once (YOLO)
ABSTRACT With the increasing number of vehicles,manual security inspections are becoming more laborious at road checkpoints.To address it,a specialized Road Checkpoints Robot (RCRo) system is proposed,incorporated with enhanced You Only Look Once(YOLO)and a 6-degree-of-freedom(DOF)manipulator,for autonomous identity verification and vehicle inspection.The modified YOLO is characterized by large objects’ sensitivity and faster detection speed,named“LF-YOLO”.The better sensitivity of large objects and the faster detection speed are achieved by means of the Dense module-based backbone network connecting two-scale detecting network,for object detection tasks,along with optimized anchor boxes and improved loss function.During the manipulator motion,Octree-aided motion control scheme is adopted for collision-free motion through Robot Operating System (ROS).The proposed LF-YOLO which utilizes continuous optimization strategy and residual technique provides a promising detector design,which has been found to be more effective during actual object detection,in terms of decreased average detection time by 68.25%and 60.60%,and increased average Intersection over Union(IoU)by 20.74%and 6.79% compared to YOLOv3 and YOLOv4 through experiments.The comprehensive functional tests of RCRo system demonstrate the feasibility and competency of the multiple unmanned inspections in practice.
As an effective approach of tight up public security,manual security inspections are onerous for the police officers at road checkpoints and cities of entry because with the increasing number of private vehicles,more and more people choose to drive to other places [1].Especially during the holidays,the security inspections are laborious at large cities of entry.Therefore replacing human labor with intelligent robots can provide significant cost savings,which has aroused the research interest of many scholars [2].
Representative vehicle inspections using robot include license plate recognition through the mobile robot in parking lots [3,4],algorithms of plate recognition[5,6],and automated under-vehicle inspection based on visual sensor [1,3].It will be better to take recognition of body-type and exterior-color into account,which is helpful for sifting out modified and fake-licensed vehicles [7].Identity verification using a robot system includes face recognition[8,9]and matching rate between the face and ID card[4].However,there is little literature on autonomous system inspecting people in vehicle by a robot.The above studies are all based on the mobile robot which does not have enough degree-of-freedom (DOF) to interact well with human.
Visual information is needed during robot motion[10].The deep learning technique builds bridges between the robotics,computer vision,and machine learning communities [11].Since the region proposal generation stage is completely dropped [12],You Only Look Once (YOLO),a friendly open source solution based on deep learning method,directly predicts objects using a small set of candidate regions and can be used in real-time system[13].YOLOs of several available versions [14-16] have been widely used in many fields,such as integrating YOLOv4 into Simultaneous Localization and Mapping system [17],detecting threat objects with YOLOv2 from X-ray baggage images [18],detecting under poor lights based on YOLOv3 through thermal imaging technique[19]or Retinex image enhancement algorithm [20],and detecting pedestrian by YOLO at night [21].The YOLO-based detection of human and vehicle optimizes the measurement process without loss of accuracy by using code aperture measurements that can deal with high dynamic ranges of lighting conditions [22,23].The method achieves good performance through mid-wave infrared videos even in low lighting environments [24].On this basis,a real-time system with wireless transmission demonstrates the practical value through the applications of rescue operations and fire damage assessment[25].Moreover,the prediction of stock movements further illustrates YOLO's ability to learn features of image [26].Although the use of existing YOLO technique seems to be able to solve the classification problems,it may not be appropriate for some regression problems which require more precise Intersection over Union (IoU) during manipulator motion[27,28].
To improve the performance of YOLO,it is commonly used to enrich training dataset [29],adjust convolutional kernels [30] or anchor boxes [31],modify the network architecture [32,33],and optimize the function according to their specific requirements during training[6,34].The following issues often arise when using the collected training dataset.Firstly,the lack of diversity about objects or backgrounds [35,36] will lead to undesired model generalization.Secondly,high cost and time consuming [29] are also unsatisfactory.The clustering of anchor boxes facilitates better regression,but the method with greater randomness relies too much on the initial value [15,37],which results in inaccurate final values.The modification and expansion of network architecture will influence the detection accuracy [38,39] while the detection speed should be also discussed [6,33]which is always the focus of researches [32].With the above analysis,it can improve the performance of object detection by constructing specific network architecture and using suitable algorithms,which means a designed classifier according to task requirements is feasible.More importantly,the detection accuracy of the shortened network can be further discussed.
Based on the motivation of replacing human labor with robot,we propose an autonomous Road Checkpoints Robot(RCRo)system with multiple inspections for security precaution.In order to meet the requirements of both speed and accuracy in the RCRo system,we also propose an improved YOLO model characterized by large objects’sensitivity and faster detection speed at security inspection scene.We refer to this method as“LF-YOLO”.The design of RCRo system is described,the LF-YOLO is constructed and analyzed in detail.The major contributions and novelties of this paper are threefold:
1) An autonomous RCRo system is presented,which performs the multiple security inspections including identity,three features(i.e.license plate,body-type,and exterior-color)of vehicle,and under-vehicle inspection at road checkpoints.To the best of our knowledge,it is not found the literature about placing such a robot system at checkpoints or ports of entry.
2) To help streamline the inspection at road checkpoints,we formulate a multi-inspection line when using our RCRo system.The qualitative experiments for the RCRo system illustrate the system is able to replace human labor.
3) The proposed LF-YOLO model is used to locate the window of the inspected vehicle.The Dense-based network connecting two-scale network with optimized anchor boxes and improved loss function is designed for increasing detection efficiency and enhancing localization accuracy at our detection scene.The promising performance of the detector is verified through ample experiments.
The rest of this paper is arranged as follows.Section 2 describes the design of the RCRo system.Sections 3 and 4 present the construction of the LF-YOLO model and vision-based motion strategy for the manipulator,respectively.Experiments including detection performance of LF-YOLO and functional tests of the RCRo system are presented in Section 5,with conclusions summarized in Section 6.
The layout of our RCRo system is shown in Fig.1(a)-(b).The RCRo system is divided into under-vehicle inspection,vehicle-features inspection,identity verification,and integrated control according to different functions.
The under-vehicle inspection,including under-vehicle device(MV-PD030001,Hikvision Co.,Ltd.),gas sensor,and radiant sensor,checks suspicious packages,toxic gases,and explosives.Considering the actual arrangement of the specific checkpoint lane,it is possible to fix the under-vehicle device to the ground instead of using a mobile robot as vehicles have to pass the specific lane.
The vehicle-features inspection is carried out by vehiclefeatures device (DS-TCG225,Hikvision Co.,Ltd.).With this device,modified and fake-licensed vehicles will be screened out through the extracted features of license plate,body-type,and exteriorcolor.For the same reason,it is also fixed to the ground.
The identity verification comprises manipulator,manipulator control system,identity device(DS-K5606,Hikvision Co.,Ltd.),and depth camera (RealSense D415),which is used to not only realize vehicle window detection-based manipulator motion,but also recognize people face and give a matching rate between the face and ID card for sifting fake and replaced ID card.As the specific lane is narrow,the space for mobile robot motion may be insufficient.Besides,every driver parks at different angles,which means more DOFs can be added to robot to achieve more postures so that the drive-in inspection can be realized.As a result,we use a 6-DOF manipulator (6 Kollmorgen integration joints) and an identity device attached to manipulator extremity.The manipulator motion is directly implemented by the manipulator control system via Controller Area Network open (CANopen).Visual information is acquired by the depth camera and transmitted to the manipulator control system through the Topic on Robot Operating System(ROS)node.
The integrated control includes a master computer and a barrier gate.The primary responsibilities are receiving inspection results from under-vehicle,vehicle-features,and identity through Transmission Control Protocol(TCP),analyzing the security according to the collected information,notifying the manipulator control system when to detect,move and return through TCP,and controlling the barrier gate lift or not.
The flowchart of security inspection is shown in Fig.1(c).To begin with,start the RCRo system,which means the manipulator will be on standby with an initial posture.When a vehicle enters the inspection area and passes through the under-vehicle device,our system will automatically detect and discriminate whether the bottom of the vehicle is safe or not.Then,the vehicle will enter the camera view of the vehicle-features device which will sift out modified case or fake-licensed case.When the vehicle stops,the master computer will notify the manipulator control system to perform window detection and manipulator motion.In this way,the manipulator extremity equipped identity device will move to the vehicle window under visual guidance.After that,identity device will recognize the human face and check ID card information when the person rolls down the window,faces the device,and swipes the ID card.And then,the master computer receives the identity information and notifies the manipulator control system to perform manipulator return.The manipulator which returns to the initial posture is ready for the next vehicle inspection.Finally,our designed application will process the inspection results from the devices and sensors;the multi-inspection results will be displayed on the Graphical User Interface (GUI).Our system will control the barrier gate when the multi-inspection results are all qualified,otherwise,the system will not control it and inform the police for further inspection.
Fig.1.The RCRo system.(a) Layout of the RCRo system (angle 1);(b) Layout of the RCRo system (angle 2);(c) Flowchart of security inspection.
With such a system,it is possible to cover the security inspections of vehicle and people without human labor at road checkpoints.
In this section,LF-YOLO with high localization accuracy and high detection efficiency is designed and characterized by robustness of large object detection.
The designed LF-YOLO directly extracts features from input image,predicts the position and probability of vehicle window through the entire image feature,and transforms the positioning problem of vehicle window into a regression problem to realize end-to-end detection.In YOLOv3,the multi-scale strategy is used,which means an input image is divided into S×Sgrid cells,S×Sgrid cells,and S×Sgrid cells,respectively.Three scales are related as follows:
where l denotes the pixel of input square image.If the center of an object falls into a grid cell,the corresponding grid cell is responsible for detecting the object.
IoU,representing the similarity between predicted box(bounding box) and labeled box (ground truth),is attained by:
where A and B are the areas occupied by predicted box and labeled box,respectively.However,IoU does not reflect the distance between the two objects.For example,if the IoU is equal to 0,we cannot know whether the two objects are in the vicinity of each other or very far from each other.In our RCRo system,the accuracy of bounding box determines the success rate of the manipulator motion.Consequently,the Generalized Intersection over Union(GIoU)[34]algorithm is introduced(Eq.(3))in our paper to address the weakness of IoU.
where C is the smallest external rectangle enclosing A and B,and the second item to the right of Eq.(3) indicates the ratio between the area occupied by C excluding A and B and divide by the total area occupied by C.Unlike IoU,which only focuses on overlapping area,GIoU focuses on not only overlapping area,but also other nonoverlapping area,thus GIoU better reflects the overlap between two objects.
As the size of anchor box directly affects the speed and accuracy of object detection,it is important to set anchor boxes parameters according to labeled vehicle window boxes.Considering the randomness of K-means algorithm during selection of initial cluster center,we use K-means++algorithm with less randomness which can effectively reduce the clustering deviation caused by randomness of initial selection.
For reducing the Euclidean distance error brought by the anchor boxes parameters,the IoU between the labeled box and the anchor box is used as a new objective function to replace Euclidean distance.The objective function value represents the deviation between labels and cluster center of anchor box.The objective function D is:
where box denotes labeled box,cen is cluster center of anchor box,n represents the number of labeled boxes,and k is the number of clustering labels.According to Eq.(3),the GIoU-based objective function Dcan be expressed as:
The loss function is one of the most important parts which determines the detection performance of the network.The smaller the loss value,the better the robustness of the model.The loss function is attained by:
So far,most detection frameworks have not used IoU in the optimization of the loss function.Since IoU can be back propagated,it can be directly used as a loss function for optimizing weights[40].However,in all non-overlapping cases,IoU has zero gradient,which cannot be optimized in loss function.GIoU,in contrast,has a gradient in all possible cases including non-overlapping situations[34].We use GIoU to replace the coordinate error and size error of predicted bounding box in Eq.(6),then,the Gloss function can be written as:
There are four considerations in terms of the network design.First of all,Original YOLOv3 network and YOLOv4 network contain 106 and 161 layers,respectively.It is clear that a deeper network increases both training time and detection time.However,we only need to detect large vehicle window at the detection scene,which indicates too many layers are not fit for feature extraction in this work.This is the reason why we do not improve the network architecture based on YOLOv4.
Secondly,in YOLOv3,the Residual Network (ResNet) is used in its backbone for solving the problem,which deep network is difficult to train.In general,ResUnit,basic unit of ResNet,is constructed by using a cross-layer connection that spans two or three layers,while the Dense module,basic unit of Dense Convolutional Network (DenseNet) [32],is allowed to form cross-layer connections between any of the two non-adjacent layers.As a result,we replace Dense module with ResUnit,which not only extends the residual performance with fewer layers but also makes network more efficient.
Thirdly,three scales for grid cells are adopted in YOLOv3,i.e.S×S,S×S,and S×S.It is characterized by smaller number of grid cells is good at detecting larger object.Combined with the placement of the D415 camera in our RCRo system,the vehicle window is a larger object in a captured image.Thus we discard the biggest number of grid cells S×Sin LF-YOLO.In this way,the depth of network is reduced which can also decrease computational complexity.
Finally,the four convolutional layers containing batch normalization processing before each output layer are replaced with two ResUnits.In such design,the number of network layers is not increased;disappearance of gradients is avoided;reuse of features is achieved.The network architecture of LF-YOLO is shown in Fig.2(a),and the architecture of each basic module in LF-YOLO is demonstrated in Fig.2(b).
There are five down-sampling operations included in YOLOv3.Each down-sampling followed by several ResUnits is realized by a convolutional operation with Stride=2,i.e.S=2.Backbone of YOLOv3 contains total 74 layers.Here,we replace all ResUnits following each down-sampling with one Dense module,which constructs a 65-layer backbone network.
A basic component DBL consists of a convolutional layer(Conv),a batch normalization processing (BN),and an activation function(Leaky ReLU).DBL 1×1 denotes the size of convolutional kernel is 1×1.DBL S=2 represents the stride of convolutional operation is 2.Zero Padding operation is a convenient method to enhance the edge feature.
Fig.2.Network architecture of the LF-YOLO.(a) Main network;(b) Five basic modules in main network.
Dense module contains four DBL1&3 and four Concat [41] operations.In each Concat operation,the feature maps from each previous Concat operation and DBL1&3 will be concatenated in order to increase depth,which requires all feature maps have same size.In this manner,the deeper feature maps can be obtained with fewer network layers.The idea of network in network is referred to form a 1×1 convolutional kernel in Dense module,which greatly reduces the number of channels in each convolution.Firstly,the number of parameters is reduced.Secondly,the computational complexity is also decreased.
For the outputs from two scales,Y1 represents the output S×Sof scale one,which is down-sampled 32 times.Y2,the output S×Sof scale two,is up-sampling the output of layer 71 and concatenating the output of layer 52,which is eventually down-sampled 16 times.For convenience of the readers,parameters of each layer about the LF-YOLO are listed in Table A1 in Appendix.
After accomplishing the design of the visual detector,the control of manipulator motion with the eye-to-hand configuration is adopted based on ROS.LF-YOLO model and depth camera-based vision system are integrated into ROS,which are responsible for window detection.The pipeline of vision and motion is shown in Fig.3.
When a captured two-dimensional (2D) image is inputted into LF-YOLO,the 2D pixel coordinates of the detected rectangular box about vehicle window will be output.The corresponding threedimensional (3D) coordinates of the vehicle window will be found through Software Development Kit (SDK) [42].The transformation between camera coordinate system and robot coordinate system is written as:
where P=[x,y,z]denotes the coordinates in robot coordinate system,and P=[x,y,z]represents the coordinates in camera coordinate system.θ=-0.21 rad and φ=-0.174 rad denote the attitude parameters for the homogeneous transformation matrix,and x=-140 mm,y=460 mm,and z=189 mm are position parameters.
Octree is used to convert 3D point clouds into an obstacle model which is visualized in ROS through the Planning Scene.Particularly,to avoid collision between manipulator and vehicle,all objects in the depth camera view are considered as obstacles in our design.Besides,we consider a safe distance δ=60 mm,which means a normal vector with magnitude of 60 mm is constructed at the center of the detected rectangular box.The vector points in direction away from the vehicle.Eventually,we set the endpoint of the vector as the terminal position for the manipulator motion.
The trajectory planning of manipulator is implemented with MoveIt! A trajectory that avoids environmental obstacles can be generated by using OMPL algorithm in motion_planner.This trajectory is published as Action and received by an embedded board.After that this trajectory is sent to each drive of manipulator via CANopen.In this way,the manipulator extremity will move to vehicle window.
Fig.3.Pipeline of vision and motion for manipulator.
Inspired by the task scenario[43],we take the transparent and mirrorlike vehicle window into consideration.As a result,we cannot directly use the depth value which corresponds to the central point of the detected rectangular box.When the window is transparent,the depth value measured by depth camera corresponds to a point inside of vehicle.When the window looks like a mirror,this depth value corresponds to a point on an object which is visible through the mirrorlike window.Apparently,both cases are undesired,so that we calculate the depth value according to lower left and lower right corners of the detected rectangular box(depth values of two corners are denoted by xand xin camera coordinate system)and assume that the depth value is equal to the average of these two depth values.Finally,the terminal object position Pin camera coordinate system can be described as:
where UR denotes the upper right corner of the detected rectangular box.
Two experiments were conducted to evaluate our system.In the first experiment,the overall performance of LF-YOLO was verified through ample tests quantitatively.In the second experiment,the RCRo system was used to test its functions of multiple inspections qualitatively.
In this subsection,improved methods of YOLO were experimentally verified through a step-by-step strategy which was used to verify the performance of each adopted method.The performance of YOLOv3 and YOLOv4 was given for comparison.We also did analyses and discussions about the test results.
5.1.1.Object detection platform
The workstation configured GPU of NVIDIA GeForce RTX 2080Ti was responsible for both training network models and detecting objects.The CPU is Intel Core i7-9700.The operating system used was Ubuntu 18.04 with ROS Melodic.The workstation was also installed with OpenCV 3.4.6,Darknet,CUDA 10.0.130,and cuDNN 7.6.0.
5.1.2.Making of dataset
We made a specific dataset for training because the correctness of original image label directly affects the localization accuracy.2367 square images of Vehicles with different colors and types were randomly captured from different backgrounds including road,open parking lot,and underground parking lot.Moreover,referring to the various properties of vehicle windows [43],four cases that windows were transparent,black,mirrorlike,and driverinner were also considered.To enhance the dataset,the zooming operation and the adjustment of the lighting conditions were adopted during the dataset processing.The purposes of various image collection and dataset enhancement are meeting the diversity (see Fig.4) of training dataset and improving the generalization of the model.The procedure for making dataset is as follows.First,turn down the lightness of 900 images randomly selected from the 2367 images.Second,turn up the lightness of 200 images randomly selected from the other 1467 images.After the first and second operations,all 2367 images are randomly split into training dataset and test dataset at a ratio of 8:1.The test dataset,including 263 images,does not participate in training.Finally,the training dataset,including 2104 images,is carried out zooming operation which is a built-in function of the Darknet framework;hence the size of each square image is randomly selected in the range from 320 pixels to 608 pixels with a step of 32.
5.1.3.Clustering analysis
The use of original anchor boxes [15] will increase the training time and decrease the detection accuracy.To avoid these problems,new clustering analysis of labeled boxes was carried out,so as to get anchor boxes more suitable for real window size at detection scene.
There are two scales in our network.Each scale contains three anchor boxes,which means the total number of anchor boxes is 6,i.e.k=6.K-means++was applied to calculate the sizes of anchor boxes into Eq.(4) and Eq.(5),respectively.The results of K-means were also listed in Table 1.Compared with K-means,the values of two objective functions obtained by K-means++are all smaller.The clustered sizes of anchor boxes are also different.As a result,the clustering results of anchor boxes through K-means++are adopted to train our network in the following text.In addition,the sizes of anchor boxes obtained by two objective functions are obviously different,which is also different from original anchor boxes of YOLOv3.Whether these differences led to different detection performance would be validated after model training.
5.1.4.Results and overall performance of trained models
In this chapter,the improved methods of YOLO were experimentally verified through a step-by-step strategy.Step one presents the performance of the designed network and the clustering method with four models.Step two illustrates the performance of the GIoU with the other four models.
Fig.4.Samples of representative images captured and processed for training.(a) Road;(b) Open parking lot;(c) Underground parking lot;(d) Transparency;(e) Black;(f) Mirror image;(g) Driver in vehicle;(h) High lighting conditions;(i) Low lighting conditions.
Table 1 Results of clustering analysis.
In the process of training weight model,to accelerate the training speed and prevent overfitting effectively,basic training hyperparameters were experientially set as follows:batch was 128,subdivisions was 64,weight decay was 0.0005,initial learning rate was 0.001,ignore_threshold was 0.5,max_batches (number of iteration)was 50,200,and Class was 1 (only window).5.1.4.1.Step one:the roles of network and anchor boxes.In this way,four trained models were compared and discussed,which were original YOLOv3 network with original anchor boxes (named YOLOv3),original YOLOv3 network with IoU anchor boxes(named Model 1),improved network with original anchor boxes (named Model 2),and improved network with IoU anchor boxes (named Model 3).During training,values of Loss and IoU were dynamically recorded in Fig.5.
From Fig.5(a),four loss curves gradually decrease and all of the trends gradually become stable with the increasing training iteration.When the iteration reaches 50200,the loss values decrease to 0.06,0.055,0.05,and 0.048,respectively,which means the results of the four trained networks are all converged.According to the terminal loss values from large to small,the corresponding models are YOLOv3,Model 1,Model 2,and Model 3,respectively.By comparing four curves,two original networks converge earlier and also stop decreasing earlier than two improved networks.
From Fig.5(b),four IoU values at beginning of each curve are all about 0.All the IoU values gradually increase with the increase of training iterations.At the end of iteration,two terminal values of improved networks achieve 0.9.According to the results of Fig.5,the trends of all curves are generally in line with expectations,which preliminarily indicates the models are eligible.It also gives evidence that our improved network alleviates the vanishinggradient problem,which makes two loss values lower,while two loss values of YOLOv3 stop going down earlier.
After training,the final-weight files of each model were used to detect images for validating model performance.The vehicle window would be marked with bounding box automatically and the corresponding confidence value would be shown.Example results are shown in Fig.6,where the test images are randomly downloaded on the internet.
From left to right,each result of detected image corresponds to Model 3,Model 2,Model 1,and YOLOv3.As can be clearly seen,for the same image,the detecting result using Model 3 is closest to the ground truth in four different models.
Next,we evaluated the detection performance of Model 3 with final-weight file in detail.The other three models were utilized for comparison.Test dataset (named 263-dataset),containing 263 images,was respectively detected by the four models.Since we focused on the accuracy of window localization which would directly affect the manipulator motion,the IoU was utilized as the performance measurement.The IoU∈[0,1].Is calculated by Eq.(2).The results of detecting performance are shown in Fig.7(a),where the IoU is divided into seven intervals (i.e.IoU=0,IoU∈(0,0.5),IoU∈[0.5,0.7),IoU∈[0.7,0.8),IoU∈[0.8,0.9),IoU∈[0.9,0.95),IoU∈[0.95,1)),in which the number of corresponding IoU values is counted.The analyses of the results are listed in Table 2 where the average IoU represents the mean of IoU values including those of all tested images.
As is vividly seen in Fig.7(a),when using Model 3,the IoU values of 231 images are greater than 0.8;only 6 images are less than 0.5.Whereas,for the results of the other three models,the amount of IoU values above 0.8 are 203,219,and 223,respectively;the corresponding amount of IoU value below 0.5 are 10,7,and 7,respectively.
From Table 2,compared with the other three models,Model 3 has the smallest missing rate and the highest average IoU.Besides,the average detection time of Model 3 is the shortest;the weight file size is the smallest.The training time of this model is apparently shorter than that of both YOLOv3 and Model 1.
Based on the results of Figs.5-7(a) and Table 2,Model 3 has successfully converged,which has the highest localization accuracy and detection efficiency,the shortest training time,the lowest final loss value,and the smallest file size of trained weight in four models;hence Model 3 is deemed as the best detecting model in step one.
Fig.5.Dynamic values in training of the models in Step one and Step two.(a) Loss;(b) IoU.
Fig.6.Example results using (left to right) Model 3,Model 2,Model 1 and YOLOv3.Ground truth are shown by yellow lines and predictions are represented with cyan lines.
Besides,comparing the results using the same anchor boxes but with different network,the two (original boxes and IoU boxes)overall performance including average IoU,average detection time,and missing rate with our improved network are better than using original YOLOv3 network,which means the designed 65-layer backbone network based on DenseNet extends residual performance and leads to lower final loss value.It also makes network more efficient due to the decreased number of network layers.The two-scale network with small feature maps not only decreases the depth of network and computational complexity,but also leads to higher sensitivity of large objects,shorter training time,and shorter detection time.
Moreover,comparing the results using same network but with different anchor boxes,we find the two (original network and improved network) detection performance with clustered anchor boxes are better than using original anchor boxes,which illustrates optimized clustering method works in bounding box regression with higher accuracy.Therefore,and in this way,it makes sense for us to further discuss the adoption of GIoU-based clustering method.
5.1.4.2.Step two:the role of GIoU.On the basis of the results from step one,we knew that the network architecture we designed had better performance,and the clustered anchor boxes also made better effects.We applied GIoU algorithm on our designed model,which meant the GIoU anchor boxes and Gloss function were adopted to train our model.And then,we used six trained results for comparison,which were Model 2,Model 3,improved network with GIoU anchor boxes(named Model 4),improved network with Gloss function and original anchor boxes (named Model 5),improved network with Gloss function and IoU anchor boxes(named Model 6)and improved YOLO network with Gloss function and GIoU anchor boxes,i.e.LF-YOLO model.
The corresponding training values of Loss and IoU are also shown in Fig.5.The six terminal values of loss curves are all below 0.1,and the corresponding IoU values have all achieved 0.9 at the end of iteration,which means the four new models have converged.Moreover,terminal loss values that using Gloss function are less than that of using original loss function,which explains GIoU avoids the vanishing-gradient problem and provides continuous optimization.
Four new final-weight files of trained models are respectively used to detect 263-dataset.The detecting results are also shown in Fig.7(a),and the corresponding analyses of the results are listed in Table 2.From Fig.7(a),LF-YOLO achieves the IoU values of 80 images are greater than 0.95,the IoU values of 256 images are greater than 0.8,and 3 images are less than 0.5.Whereas,for the results of the other three models,the amount of IoU values above 0.95 are 41,48,and 58,respectively,the amount of IoU values above 0.8 are 243,247,and 250,respectively,and the corresponding amount of IoU values below 0.5 are 8,2,and 3,respectively.
From Table 2,compared with other designed models,LF-YOLO has the smallest missing rate.Although the average detection time of the model is not the fastest,it is only 3.97%longer than the fastest model,and it is considerably 3.15 times faster than YOLOv3.
To sum up,LF-YOLO has converged.It has the highest localization accuracy (to compare with YOLOv3,average IoU increases by 9.33%),the almost highest detection efficiency,and the almost shortest training time in six models.Therefore,the LF-YOLO can be deemed as the best detecting model including step one and step two in our experiment.
Besides,according to whether the Gloss function is used,the three results of models with Gloss function have higher average IoU values and lower missing rate values than the other three results of models without Gloss function.According to whether the GIoU anchor boxes are used,each result with GIoU anchor boxes has higher average IoU value and lower missing rate value than other two corresponding results with same loss function but different types of anchor boxes,which means Gloss function and GIoU-based clustering method are helpful to enhance the localization accuracy.Furthermore,it will further improve the localization accuracy when applying both Gloss function and GIoU-based clustering method.In addition,it also indicates that our improvement is feasible.
Fig.7.Distribution of IoU values for models in Step one and Step two and YOLOv4.(a) Test with 263-dataset;(b) Test with 333-dataset.
Table 2 Overall performance of models in Step one and Step two with 263-dataset.
5.1.5.Performance validation with more random test images
When verifying the generalization,despite none of the images in training dataset and test dataset are the same,the two datasets still have some similar characteristics,such as same background and same lighting conditions,as well as same camera.These similarities did make the results better,even if it was not deliberate for dataset maker.Then,we further increased the randomness of test dataset which considered images from the internet,different vehicles with different backgrounds,and different lighting conditions.
The new test dataset(named 333-dataset)was made up of 333 images including 166 images downloaded on the internet and 167 images captured outdoors.Fig.7(b)shows the performance of each model with 333-dataset.From Fig.7(b),IoU values of 249 images exceed 0.8 when using LF-YOLO.Only IoU values of 24 images are less than 0.5.In addition,other models including step one and step two have not achieved the localization accuracy of this model.Example results(Fig.8(a1)-(a5))of LF-YOLO show the robustness of detection to transparent windows,black windows,mirrorlike windows,driver-inner windows,and windows under low lighting conditions with 333-dataset.In Fig.8(a1)-(a5),the upper images were downloaded on the internet and the lower images were captured outdoors.On the basis of the above experimentations,these results fully illustrate LF-YOLO is the model with best overall performance,which is also suitable for our detection scene.
Apparently,the average IoU value of 333-dataset is not as good as that of 263-dataset for any model.The main factor is some selections of test images are not appropriate,which has been recorded.The further analysis shows images of IoU values below 0.5 are mainly downloaded on the internet.In each model,the numbers of images with IoU values below 0.5,downloaded on the internet,are 39(53),37(55),33(45),33(44),18(22),31(37),20(25),20(24),respectively (each number included in parentheses denotes the total number of images with IoU values below 0.5 in each model).It is mainly caused by two reasons.One is some images have low resolution;the other is some images are computer graphics from official websites of vehicles which are somewhat different from real vehicles,while all the images of our training dataset are captured from real vehicles.Nevertheless,LF-YOLO is still the best model as compared with the other seven models,which indicates the results are reasonable and the improvement of our model is successful.
Fig.8.Results of vehicle window detection.(a1)-(a5)Example results of five typical windows using LF-YOLO with 333-dataset;(b)P-R curves and corresponding AP values of each model with 333-dataset;(c1)-(c4) Example results of high IoU but low confidence.
5.1.6.Performance comparison of LF-YOLO with YOLOv4
The YOLOv4 was trained with the same training dataset and same hyperparameters and tested with 263-dataset and 333-dataset,respectively.The average IoU values of 18000-weight and final-weight were calculated.The train costs 216 h,where the 18000-weight costs 69 h.The results are shown in Fig.7.
As for the results with 263-dataset,the average IoU value of 18000-weight is 0.8826,which shows LF-YOLO has better localization accuracy with the same training time.The average IoU value of final-weight is 0.8606,lower than that of 18000-weight.We inferred the trained YOLOv4 was overfitting;hence the average IoU of each weight was calculated.The 33000-weight achieves the highest average IoU value of 0.9018.But it is still lower than that of the LF-YOLO.Besides,the file size of 33000-weight is 9.0 GB.The missing rate is 0.76%.
To avoid the aforementioned drawbacks of 263-dataset,the 333-dataset was tested and the results were given.The average IoU value of 18000-weight is 0.6774,which means LF-YOLO has a better localization accuracy with the same training time.The average IoU value of final-weight is 0.6145,lower than that of 18000-weight.The average IoU of each weight was calculated.The 35000-weight achieves the highest average IoU value of 0.7218.Similarly,it is still lower than that of our LF-YOLO.The average detection time of YOLOv4 is 25.28 ms.The file size of 35000-weight is 9.5 GB.The missing rate is 8.41%.
To summarize,LF-YOLO has the best generalization among all the models through two test datasets.This detector has also been found to be more effective during actual object detection,in terms of decreased average detection time by 68.25% and 60.60%,and increased average IoU by 20.74% and 6.79% with 333-dataset compared to YOLOv3 and YOLOv4.
5.1.7.Discussions of average IoU and mAP
In general,mean Average Precision (mAP) is used to measure model accuracy,so the reason why we use average IoU instead of mAP in this paper is detailed.In particular,our detecting model only contains one Class which means mAP=AP in this paper.The Precision-Recall (P-R) curves are shown in Fig.8(b).AP values of nine models with 333-dataset are listed in Table 3,where parameter of IoU threshold is set to 0.5.The AP value of Model 2 is lower than that of Model 1 while Model 2 has a higher average IoU than Model 1.Similarly,the AP of YOLOv4 is higher than that of Model 4 while YOLOv4 has lower average IoU than Model 4.When calculating mAP,all bounding boxes and their confidence are involved,while some of the bounding boxes contain high IoU but low confidence (see Fig.8(c1)-(c4)).Such bounding boxes will not be output when executing an actual object detection command [44].Because YOLO has no means to know where the ground truth is,but to rely on high confidence.
As we can see,those bounding boxes with high IoU but low confidence do nothing in object detection tasks while influence mAP value.Therefore,as our performance measurement,the average IoU also drops the bounding boxes with high IoU but low confidence[38].For example in Fig.8(c1)-(c4),bounding box with confidence of 80.6% and bounding box with confidence of 91.74%will be output for the first two images.No box will be output for the last two images.It seems that the mAP value is really high,whereas the average IoU value is not(e.g.,in Model 6,the mAP is as high as 0.9169 while the IoU value is only 0.7465).But IoU is a direct and practical measurement.In addition,such cases like Fig.8(c1)-(c4)are probability of rare events after all.The results of high confidence are still reliable.Consequently,we chose average IoU for measurement.
We conducted a security multi-inspection experiment to test the function of the RCRo system with corresponding inspection line.The experimental scenario is shown in Fig.1(a)-(b).Two cases were considered.Meanwhile,other functions and methods including 6-DOF Manipulator,LF-YOLO,real-time object detection,motion control,and manipulator motion without collision would be indirectly validated.
Case 1:A driver drove into the inspection area with his ID card,and kept vehicle window closed until manipulator moved to the window.
The test results are shown in Fig.9.From Fig.9(a),the window is correctly detected,and the corresponding depth values of bounding box are shown in Fig.9(b).The simulated motion planning of manipulator is displayed on ROS-Rviz shown in Fig.9(c);the blue cubes represent the simulated obstacle model.In this case,manipulator extremity successfully moves to the window without collision (see Fig.9(d)).When master computer received the driver's identity information,the manipulator returned to the initial posture.The multi-inspection results displayed on GUI are shown in Fig.9(e)where all inspection results are qualified.Finally,the barrier gate automatically lifts.
Case 2:Another driver drove into the inspection area with first driver's ID card,and kept vehicle window open all the time.
The result of window detection is shown in Fig.9(f).The multiinspection results (see Fig.A1 in Appendix) indicate the identity information is a mismatch.Finally,the barrier gate does not lift.As we can see from Fig.9(g)-(h),when DOF is more than three,Not only can manipulator move near vehicle window,but also the rotation of the fifth joint ensures the identity device places parallel to the window surface,which will provide people with more comfortable interaction.Furthermore,when driver parks at the angle parallel to the lane,the 3-DOF robot cannot reach the parallel posture to window while it is easy to realize by a 6-DOF manipulator.Therefore,it has advantage of using 6-DOF manipulator.
The feasibility of RCRo system is verified qualitatively through two representative tests.The RCRo system has completed the functional tests of the security inspections including identity,vehicle-features,and under-vehicle.Moreover,experiment 2 also illustrates the ability of the 6-DOF manipulator,effectiveness of the LF-YOLO algorithm and real-time detection,and feasibility of motion control scheme and manipulator motion without collision.Worth mentioning,the window is a typical mirrorlike window in case 1,and is a driver-inner situation in case 2.Both detecting results are desired,which shows the robustness of LF-YOLO due to the training dataset of diverse window properties.
Table 3 Overall performance of each model with 333-dataset.
Fig.9.Results of functional tests.(a)-(e) are results of Case 1:(a) Vehicle window detection;(b) Depth values corresponding to Fig.9(a);(c) Planned terminal posture of manipulator in ROS;(d) Actual terminal posture of manipulator;(e) GUI and multi-inspection results.(f)-(h) are results of Case 2 under different conditions.
In this work,the RCRo system is proposed for intelligent security inspection.The LF-YOLO-based real-time object detection,visionbased motion control,and motion without collision are incorporated into the RCRo system.The inspections including identity,vehicle-features,and under-vehicle are also integrated into the RCRo system which can execute multi-inspection work at road checkpoints.
LF-YOLO is specifically constructed for large object detection with faster speed.From ample experimental results,the Densebased network connecting two-scale network with small feature maps not only decreases the depth of network and computational complexity,but also leads to lower final loss value,shorter detection time,and higher detection accuracy;optimized clustering method works in bounding box regression with higher accuracy;the use of GIoU obtains further improvement of average IoU owing to more accurate anchor boxes and continuous optimization during training.
The detector achieves both ideal detection efficiency and localization accuracy in detecting large objects through experiments.Especially,the average detection time merely costs 9.96 ms.Furthermore,in LF-YOLO,the results of detection speed and average IoU,as well as training time are significantly better compared with those in YOLOv3 and YOLOv4.
It can be concluded from ample vehicle window detection through experiments 1 and 2 that LF-YOLO keeps the robustness when processing transparent,black,mirrorlike,and driver-inner windows,which confirms the training dataset containing diverse vehicle windows is useful.The above results demonstrate the adaptability of LF-YOLO at our detection scene.
The feasibility of the RCRo system is verified through functional tests.The results show the possibility of autonomous inspections covering identity,vehicle-features,and under-vehicle and the reasonableness of formulated inspection line,as well as the correctness of multi-inspection results.The functional tests also indicate the selection of a 6-DOF manipulator is reasonable and better than using the 3-DOF robot.To sum up,our RCRo system can replace human labor and carry out multi-inspection work.
Besides,we do not take account of on-line motion planning at present,which means the manipulator is unable to sense the sudden appearance of anything when it moves.Further work of this paper would be solving this problem.
This work was supported by the National Key Research and Development Program of China(grant number:2017YFC0806503).
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Supplementary data to this article can be found online at https://doi.org/10.1016/j.dt.2021.04.004.
Table A1 LF-YOLO network and its parameters.
Fig.A1.The multi-inspection results of Case 2.