ZHANG Rui
(Robotics and Computer Vision Laboratory,Dept.of Electrical and Computer Engineering,Queen’s University,Canada)
Natural scene,also called really-world scene,generally refers to a real physical environment with specific time and space boundary.We always live in some specific natural scene,and we can experience many different kinds of natural scenes in daily life.Consequently,our visual perception system and cognition system possess powerful natural scene analysis abilities in the long-term natural evolution.A famous natural scene analysis ability of our brain,called natural scene classification (which means to identify the type of natural scene),has been regarded as one of important technologies for the next generation computer vision systems.Natural scene classification can be used in many cutting-edge computer vision applications such as smart video surveillance,unmanned vehicles,autonomous mobile robotics,and content-based image retrieval (CBIR)[1,4].In those intelligent systems,natural scene classification usually serves as an important component of various high-level vision tasks.
To classify the natural scene always involves two kinds of capabilities of human brain:visual perception and cognition.Visual perception mainly refers to the visual information processing of human visual system (HVS).Visual perception roughly falls into two stages:①receive the visual signal from the outside world;②extract effective visual information from the visual signal.The visual information is transmitted to the cognition system in our brain for further processing,which is called visual cognition.The major purpose of visual cognition is to identify and understand the visual information captured during visual perception.Without visual perception,the brain has no way to acquire the visual information and cognize what the world looks like;if the visual perception abilities are inadequate or disordered,then the brain will go wrong in visual cognition to the external world.On the other side,if we lose the visual cognitive abilities,we will completely fail to identify and understand what we are looking at;also,if our visual cognitive abilities are inadequate or disordered,we will have trouble to recognize things and comprehend what happens in front of our eyes.Therefore,any of successful visual activities relies on good visual perception and cognition links.The cooperation between visual perception system and cognition system in our brain make us possess efficient,accurate and robust visual abilities.
However,most of visual perception and cognition mechanisms when the brain conducts natural scene classification are still unknown.This issue has been perplexing researchers in the field of computer vision for decades.In the early stage,Marr’s era,the study of scenes mainly focused on indoor scenes.Since the indoor scenes contain abundant objects,it had been widely accepted that the object identification was the basis of the scene classification.It is believed that different collections and arrangements of the objects were corresponding to different types of scenes.In this“objects to scene”strategy,the perception of scenes is identified as the perception of objects and the cognition of scenes is to analyze the collections and arrangements of objects.As increasing types of scenes are investigated,the outdoor scenes become common.The outdoor scenes are usually composed of large geologic surfaces and physical substances,such as sky,water,grass,ground and buildings.Those surfaces and substances in the scene are distinct regions without regular shapes or contours.Hence,it is impossible to extract certain stable structural features via the object identification from outdoor scenes.Consequently,a new“regions to scene”strategy was proposed.In this methodology,the scene perception is decomposed into the perceptions of regions in scene,and the scene cognition is to synthesize the results of regions identification.Both hypotheses above seem consistent with our daily experience,but the computer implementations encounter great challenges.Up to now,object recognition is still an open question.The objects in a natural scene are always random and diverse,so it is difficult to fix a set of objects that only corresponds to a specific scene.While,on the region-based strategy the major problem comes form the unstable performance of region segmentation.For natural scenes with simple spatial constitution,the region-based approach is ideal;while as the complexity of scene contents is increasing,the performance of region segmentation sharply decrease.The above issues lead to rethinking of the rationality of the“parts before whole”thinking,researchers begin to refocus on the intrinsic visual perception and cognition mechanisms of the human brain when it execute the natural scenes categorization.
Actually,a very important phenomenon of human scene understanding has been found for decades.Psychologists discovered that human possess a truly remarkable ability to recognize the content of complex scene images with a brief glimpse[16].Unfortunately,scientists had never found a way to copy this kind of human ability into the computer vision system.Until recent years,psychologists reconfirmed the truth that human can perform rapid natural scene categorization without preceding object recognition or region segmentation[17].Neuroscientists also provide powerful evidence.They use functional magnetic resonance imaging (fMRI)to photograph the brain’s activity when human conduct rapid natural scene categorization.The results of fMRI reveal that the rapid natural scene categorization only activates the Parahippocampal Place Area (PPA),a cortical area that is responsible for dealing with all kinds of visual activities relevant to scenes,while the cortical area for object recognition will not be activated.These findings break people’s traditional beliefs of scene categorization,and bring a new perspective to us.It has been widely accepted that human can achieve natural scene categorization in a very short time (less than 200ms),so the natural scene categorization is definitely independent of object recognition or region recognition.In fact,psychological studies also indicate that the perception of natural scene types is usually earlier than the object perception or region perception in most of scene analysis activities.When conducting the rapid natural scene classification,the brain usually perceives the global features of the scene at first,and then perceives different levels of local details according to need.Of course,the more complex are the scenes or visual tasks,the more local details are needed by the brain.Interestingly,researchers[13]also found that some kinds of global features carried the semantic information of natural scene types.These particular global features are called gist features,which are the integration of the overall visual information in the natural scene.Psychologists assert that human visual perception system can grasp the gist features of a scene in a very short time,and human cognition system is responsible for building a bridge between gist features and priori semantic knowledge.Hence,we can roughly divide human rapid scene categorization into two steps.The first step is that our visual perception system extracts the gist features from a specific scene;and the second step is cognitive mapping,in which the gist features are mapped into corresponding semantic categories by our cognition system.
Although the existence of gist features is affirmed,the biological process to generate gist features is unknown.So far,‘gist features’is still a concept for most of computer vision scientists.They find their ways to build real gist features from natural scene images.These gist features may be in quite different forms,but they share a common trait:they are holistic features of scene images.However,different kinds of gist features contain different semantic information and discrimination information.It is found that some gist features can perfectly correspond to high-level semantics of natural scene images[13].It has been widely believed that if we can obtain excellent gist features from natural scene images,then we can easily identify the semantic categories of natural scene images via their gist features.Therefore,these days most researchers put their focus on building effective gist features from natural scene images.While on the other hand,many researchers also realize the significance of cognition in natural scene categorization.The most common cognitive mapping strategy is to build mappings between gist features and semantic categories of natural scene images.This approach is based on similarity measurement theory,which is consistent with our daily experience very well.The similarity measurement,however,is only the basic level of our cognition abilities.Memorization,reasoning,learning are the most important parts of cognition.Recently,the reasoning mechanism of cognition has received much more attention.Researchers have successfully adopted the Bayesian method to imitate the reasoning mechanism when human identify different types of natural scenes[14].Whereas the rest of important cognition mechanisms still get less attention yet.
In this paper,we focus on the biologically inspired approach to simulate human capability of rapid natural scene categorization.We will investigate feasible ways to simulate some important visual perception and cognition mechanisms;and we will propose strategies to combine those mechanisms for improving the performance of automatic natural scene categorization.The rest of the paper is organized as follows:Section 2 describes our biologically inspired visual perception model,which is used to extract gist features from natural scene images.Section 3 discusses three kinds of cognition mechanisms,and how to introduce them into the natural scene categorization.
In Section 4,we will observe the performance of our method in some public natural scene datasets.Finally,Section 5 is our conclusion.
For psychologists,to perceive the“gist”of a natural scene is the key of rapid natural scene categorization in humans.While for computer vision scientists,to extract the gist features from a natural scene image is the key of rapid natural scene categorization in computers.A gist feature contains the holistic spatial structure information of a natural scene image.The“holistic”in here has two levels of meanings:in the lower level,it is opposite to“l(fā)ocal”or“regional”;in the higher level,it means“mixed”or“sum”.Apparently,there is no difficulty to extract a gist feature from a natural scene image in the lower level.However,there is a challenge when extracting the gist feature from a natural scene image in the higher level.The challenge is determined by the nature of natural scene images.
Natural scene images are photographs that record actual spatial structures of the real world.Those 3D spatial structure signals in real world are projected as 2D spatial structure signals in natural scene images.Therefore,any of natural scene images can be regarded as a mixture of diverse 2D spatial structure signals,each of which takes specific spatial structure information (e.g.spatial scale and orientation)about the natural scene.However,it is impossible to extract genuine spatial structure information straightly from the original natural scene image,since all these 2D spatial structure signals in the image are mixed in a“natural”way.The“natural”way means a random and complex way,so the rules are almost unknowable.
Fortunately,however,our brain knows how to decompose mixed 2D spatial structure signals in the natural scene images.Our visual perception system has evolved a special neural mechanism to deal with complex visual signals from natural scenes.In the primary visual cortex (V1)of the brain,there are plenty of so-called simple cells.Every single simple cell has an oval-shaped receptive field,which can perceive a specific type of 2D spatial structure signal that enters our retina.Different types of simple-cell receptive fields are different in size or principal direction,or both.Correspondingly,the 2D spatial structure signals acquired by simple-cell receptive fields are varied in scales and orientations.Consequently,a set of simple-cell receptive fields is similar to a spatial filter bank that possesses oriented and spatial bandpass characteristics.With this biological filter bank in V1,our visual perception system can decompose any complex 2D spatial structure signals in natural scene images.
Figure 1 Model of rapid natural scene categorization
Inspired by this visual perception mechanism in V1,computer vision scientists have designed many mathematical tools to stimulate the same function in the past few years.In most of emerging methods,2D Gabor wavelet transform has been acknowledged as the most successful mathematical model for imitating the oriented and spatial bandpass properties of the simple-cell receptive fields in V1.Recently,the 2D Gabor wavelet transform has already been successfully used for decomposing 2D spatial structure signals from natural scene images[18].However,the fact is that the 2D Gabor wavelet transform is not suitable for rapid natural scene categorization because of its low computational efficiency.Therefore,some kinds of 2D Gabor-like wavelet transform are proposed in recent years.Just like the“plane vs.bird”case,these 2D Gabor-like wavelet transform maybe have no perfect biological similarities as 2D Gabor wavelet transform,while they take advantage of the same“biodynamic principle”.
One of the promising alternatives is the 2D dual-tree complex wavelet transform (DT-CWT)[10],which has been reported that preserves most of the desired properties of 2D Gabor wavelet transform and meanwhile possesses less redundancy and lower computational complexity[2-3].In this paper,we adopt 2D double-density dualtree complex discrete wavelet transform (2D DD-DT CWT),an upgraded version of 2D DT-CWT,as the spatial filter bank for decomposing the 2D spatial structure signals from natural scene images.2D DD-DT CWT has excellent oriented and spatial bandpass properties like Gabor,so it can stimulate the visual perception mechanism in V1 very well.Figure 2 illustrates the typical Gabor-like wavelets associated with the 2D DD-DT CWT.On the other hand,2D DD-DT CWT also has a well-design wavelet frame,which provides much higher computational efficiency than the 2D Gabor wavelet transform.
Figure 2 Typical wavelets associated with the 2D DD-DT CWT
The 2D DD-DT CWT is the hybrid of the 2D DT-CWT and double-density DWT (DD-DWT),hence it is also an overcomplete DWT[9].The 2D DT-CWT possess better performances than 2D Gabor wavelet transform due to its lower computational complexity[2-3],and the DD-DWT enhances the high-frequency resolution of the DWT because of its double HF filters structure.Therefore,the 2D DD-DT CWT simultaneously possesses the advantages of 2D DT-CWT and DD-DWT.
The 2D DD-DT CWT is based on a special designed iterated spatial filter bank.A spatial filter bank contains row filters and column filters,which respectively manipulate the row pixels and column pixels in an image.The 2D DD-DT CWT has a parallel four-tree structure,in which each tree is a 2D DD-DWT.For creating different filtering characteristics in each tree,the 2D DD-DWT in each tree has distinct filter sets for the rows and columns.Actually,these four 2D DD-DWTs can also be divided into two groups,each group consists a 2D DD-DT real DWT which is based on two distinct scaling functions φh(n),φg(n)and four distinct wavelets ψh1(n),ψh2(n),ψg1(n),and ψg2(n)[11].The wavelets ψh1(n)and ψh2(n)are offset one half from one another,as are ψg1(n)and ψg2(n):
where the four wavelets ψh1(n),ψg1(n),ψh2(n),ψg2(n)respectively form two approximate Hilbert transform pairs:
Note that the properties of the 2D DD-DT real DWT are determined by the choosing of row filters and column filters.Therefore,different (scaling and wavelet)filters combinations in rows and columns can lead to distinct bandpass effects.Through meticulous choosing,the two 2D DD-DT real DWTs that consist of the 2D DD-DT CWT are designed to be conjugated.Consequently,2D DD-DT CWT possesses the approximate shift invariance that is an expected property for image decomposition.
Via the iterated spatial filter bank of 2D DD-DT CWT,a natural scene image can be decomposed into diverse 2D spatial structure signals with different spatial scales and orientations.The hierarchical decomposition of 2D DD-DT CWT follows a fixed pattern.Each round of decomposition generates thirty-six subband signals,including four low-frequency (LF)2D spatial structure signals and thirty-two high-frequency (HF)2D spatial structure signals.The four LF 2D spatial structure signals are approximations of the original natural scene image,with mixed scales and orientations.These thirty-two HF 2D spatial structure signals share single spatial scale,but belong to sixteen different orientations.Note that the iterated spatial filter bank of 2D DD-DT CWT for firststage decomposition is different from the higher-stage ones.As showed in Figure 3,the first-stage spatial filter bank has only one input,which is the original natural scene image.The higher-stage spatial filter bank that is illustrated in Figure 4,however,has four distinct inputs that correspond to the four LF 2D spatial structure signals generated in the previous stage.The LF 2D spatial structure signals are iteratively decomposed because they still contain too many mixed 2D spatial structure signals that need to be isolated.Consequently,the HF 2D spatial structure signals are reserved because they possess discriminative information about the spatial structure of natural scenes,such as spatial scales and orientations.
Since each stage of decomposition contains a downsampling operation with factor 2,the spatial scale of 2D spatial structure signals decreases with increasing stages of decomposition.However,the orientations of 2D spatial structure signals in different stages are constant.Therefore,there is only one tunable parameter for 2D DDDT CWT,namely the number of stages of decomposition.This fact makes 2D DD-DT CWT easy to use and more robust.But the optimum number of stages varies with some attributes of natural scene images,such as size.The issue of adaptive parameter selection for 2D DD-DT CWT will be discussed in next section.
Figure 3 First-stage spatial filter bank of 2D DD-DT CWT
Figure 4 Higher-stage spatial filter bank of 2D DD-DT CWT
When a natural scene image is successfully decomposed by the 2D DD-DT CWT,those dominant signal components related to the spatial structure of scene are reserved while the non-significant ones are discarded.According to Figure 1,the next critical step is to extract 2D spatial structure information from each of 2D spatial structure signals.However,even the latest researches can’t tell us the exact way that the information is encoded in those 2D spatial structure signals.While inspired by the nature of natural scene images and the notion of“gist”,we conceive an approach to decode useful information from those 2D spatial structure signals.
As is well known,the natural scene images contain a variety of complex statistical characteristics.So we can infer that the underlying information in 2D spatial structure signals should possess certain statistical properties.On the other hand,we know that the perception of scene gist must be finished in a very short time.Therefore,we can assume as following:①since“gist”possesses holistic information,it should be consisted of some kinds of statistical features,②since the perceptive process is brief,the feature extraction approaches should be relatively simple,and ③each of these features should have small size.
Based on these assumptions,we propose a statistical feature extraction method that can rapidly extract a hybrid statistical feature from each of 2D spatial structure signals.Each hybrid statistical feature is composed of two different kinds of statistical features,the wavelet entropy (WE)features and the wavelet co-occurrence matrix (WCM)features.The WE is an application of the Shannon entropy in the wavelet domain.The WE method provides a useful measure for analyzing and comparing the statistical characteristics of wavelet-domain signals.The 2D spatial structure signals belong to 2D wavelet-domain signals,since they are generated by 2D DD-DT CWT.However,there are many different forms of WE,each of which is suitable for a specific type of waveletdomain signal.By experiment,we found that the following formula is more suitable for the 2D DD-DT CWT signals:
where:s is the 2D wavelet signal,the size of which is m×n,|· | means to calculate the absolute value,p is the power,and E(s)is the WE of s.
Similar to the notion of the WE,the WCM is an extension of the gray-level co-occurrence matrix (GLCM)to the 2-D wavelet transform.The GLCM is a classical statistical method in texture analysis,which carries both distributions of the intensities and information about relative position of neighboring pixels in an image.Given an HF subband image I,of size M×M,we calculate the WCM of which by the following equation:
where the offset (Δx,Δy)is specifying the distance between the pixel-of-interest and its neighbor,which makes the co-occurrence matrix sensitive to direction.By setting the offset vector,we can compute the WCM C(i,j)in different directions.
In our model,the computation of the WCM features for each HF 2D spatial structure signal consists of two steps.In the first step,we compute four WCMs of the HF 2D spatial structure signal in four directions,0°,45°,90°,and 135°.In the second step,we calculate four statistical features,“Contrast”,“Correlation”,“Energy”and“Homogeneity”,from each one of the four WCMs:
Therefore,we can acquire 16 WCM statistical features from each of HF 2D spatial structure signals.At the hybrid stage,we combine the WCM features and the WE feature of each HF 2D spatial structure signal into a 17-dimensional hybrid statistical feature.At last all these single hybrid statistical features are put together into the“gist”feature in concatenation.Figure 5 shows the basic processes of the proposed visual perception model above.By this model,a given natural scene can be converted into a one-dimensional“gist”feature.
Figure 5 Our proposed visual perception model
For a newborn infant who has normal visual perception system,rapid natural scene categorization is an impossible task.Because there is no priori knowledge about natural scene in his mind and the brain has never build up a mapping between scene appearance and scene semantic categories.Therefore,it is obviously that the visual cognition is the critical step between visual perception and natural scene categorization.In the process of visual cognition,the brain will fully utilize the scene spatial structure information that is obtained in the visual perception stage.And the brain will create a cognitive map by combining the priori knowledge and learning process.The cognitive mapping process is the key link of associating the physical visual signals with the abstract semantic concept.
Apparently,human brain has established a special cognitive mechanism in long-term rapid natural scene categorization.This cognitive mechanism involves a set of priori knowledge about natural scenes and some highly effective cognitive mapping strategies.
Therefore,the computer vision scientists are working at building a similar priori knowledge library and imitating those highly effective cognitive mapping strategies.In this paper,we will investigate two aspects of cognitive mechanisms that are likely involved in rapid natural scene categorization.Firstly,we will investigate the feedback mechanism of the visual cognition to the front-end visual perception link.Secondly,we will investigate how the visual cognition to exploit the priori knowledge to realize more effective cognitive mapping.Correspondingly,we build related model to stimulate these visual cognition mechanisms.
When we are observing the physical world,the observation scales we used are very important.The observation scales determine what we can see.For example,the task is to observe a mountain and a stone in the scene.If we set the observation scale in a hundred-meter,we can only see the mountain without the stone.While if we choose the centimeters as the observation scale,we can only see the stone without the mountain.Further,if the observation scale is hundred miles we can neither see the mountain nor the stone.Or,more interestingly,if the observation is conducted on atomic-scale,we will see that there is no difference between a mountain and a stone.Therefore,we can get the effective discriminative information only when we choose a set of observation scales that most suitable for the observation problem.Too large or small observation scales can only waste our computational resource or reduce the discriminative information.
On the other hand,our brain always works in the most effective way of energy-saving.For instance,when the brain is performing visual activities,the extent of visual perception is restricted by the need of visual cognitive task.Usually,the visual perception process starts from the large scales.If the cognitive task can be accomplished by a rough view,then the brain will only perceive the visual signals in some large scales.But when the brain finds that these large-scale visual signals are not enough to meet the requirements of the visual cognitive task,it will perceive more visual signals in some small scales.Meanwhile,the visual characteristics (e.g.size,resolution and complexity)of different visual objects also influence the amount of visual signals for the visual cognition.Consequently,the visual perception is a dynamics process,which can be controlled by the cognition process.This is called top-down visual mechanism.
In rapid natural scene categorization,human brain works in the same way.First,visual perception system acquires spatial structure signals on a large scale.These visual signals are corresponding to those rough spatial structures.When the natural scene categorization task is relatively simple,such as natural landscape vs.artificial environment,those rough large-scale spatial structure signals may provide enough identification information for correct categorization.In this case,the brain will not acquire any more spatial structure signals on smaller scales since the cognitive task has been successfully completed.While as the categorization task becomes more difficult,the identification information provided by the large-scale spatial structure signals may become insufficient,and there is a need for more detailed information as supplement.Under such circumstances the brain will put more focus on smaller spatial scales and require the visual perception system to acquire spatial structure signals on those scales.These small-scale spatial structure signals are combined with the large-scale ones in the visual perception system,and the multiscale signals can provide more powerful identification information for the natural scene categorization task.If above multi-scale spatial structure signals can not provide enough identification information yet,the brain will drive the visual perception system again to perceive more signals on further smaller scales until the new multiscale signals are capable for categorization task.Note that,however,the effective identification information will not increase as the number of perceptive scales.Excessive small-scale signals will lead to the decline in the proportion of effective identification information in the multiscale signals.This is because that the too small scale signals can not completely describe any specific spatial structure,and they can weaken the differences between various types of natural scenes.Consequently,the brain will choose a set of optimal multiscale signals by comparing the categorization accuracy of various combinations of multiscale signals.
According to this principle,we propose a cognitive feedback model that can choose the optimum decomposition stages of 2D DD-DT CWT by feedback of cognition result.The cognition result,in brief,is the categorization accuracy.The cognitive feedback is a process to constantly adjust the scales of perception according to the cognition result,and determine an optimum set of observation scales which can maximize the categorization accuracy.We have proved by experiments that 2D DD-DT CWT has only one optimum number of decomposition stages.The optimum number is only related with the qualities of natural scene images,such as size and resolution.It means that those natural scene images with same qualities possess the same optimum number of decomposition stages for 2D DD-DT CWT.Hence,with only a small natural scene dataset,we can rapidly identify the optimum number of decomposition stages for any kind of natural scene images.By adding the cognitive feedback,the visual perception model proposed in section two can realize adaptive optimization,which will lead to robustness and practicality.The cognitive feedback model is illustrated in Figure 6.
Figure 6 Cognitive feedback model
Human beings love to cognize things in a progressive way,which is called the“coarse-to-fine”mode.This is particularly obvious when we categorize things.When categorizing an object,we usually prefer to firstly identify the major category it belongs to,and then further identify the minor category it belongs to.By this successive refinement method,we can narrow the scope of investigation and finally ascertain a very specific category of the target object.For instance,when we try to categorize a beetle we will use this“coarse-to-fine”mode.This mode works well because many irrelevant and interferential category choices are successively discarded,that can simplify a complex categorization task and greatly raise the categorization accuracy and efficiency.On the other hand,however,the“coarse-to-fine”mode must base on the precondition that the target objects possess multiple category labels.This precondition is easy to satisfy because an object can always be categorized in different kinds of semantic viewpoints.Different semantic viewpoints lead to different semantic category labels,which can be corresponding to various semantic scales in coarse-to-fine categorization.Therefore,using the multiple semantic category labels we can achieve coarse-to-fine categorization of things.So in other words,the coarse-tofine categorization is the identification process of multiple semantics of target object.
It is widely believed that the same cognitive mechanism exists in the rapid visual categorization of natural scene.In rapid natural scene categorization,human tend to identify the major category of a scene,such as“indoor”or“outdoor”,in the first glimpse,while the meticulous recognition of specific scene category appears in subsequence.For each natural scene,it usually simultaneously possesses multiple semantic labels,which can provide us much more priori knowledge than a single semantic label does.For example,a room scene can have multiple semantic labels as following:①artificial environment,②indoor,and ③office.Utilizing the priori knowledge about the natural scenes as the constraint conditions,we are able to identify the category of a specific natural scene in a more efficient and accurate way.Table 1 illustrates the priori knowledge on natural scene multiple semantics according human’s daily experiences.
Table 1 Priori knowledge on natural scene multiple semantics
According to Table 1,any of natural scene images can be given multiple semantics labels.The relationships between a natural scene and its multiple semantics labels can also be regarded as constraint conditions.These constraint conditions are very important for rapid natural scene categorization because they can help us to eliminate many interfering possibilities on category identification.For instance,natural sceneries can only belong to outdoor scenes rather than indoor scenes;and a street scene can only be the indoor scene rather than the outdoor scene.Human master these semantic rules about natural scene during long-term of cognitive learning,and use them in everyday visual scene cognition.This fact inspires us to design a rapid natural scene cognition model which can utilize this kind of human knowledge about natural scene.
The key of the model is to build a set of relationship rules for multiple semantics of natural scene.Under this set of rules,each natural scene image simultaneously belongs to multiple semantic categories,and there are certain fixed relationships between these multiple semantic categories.These relationships conform to the prior knowledge about natural scene that illustrated in Table 1.However,the hierarchical relationships between different semantic categories can not be obtained from Table 1.While these hierarchical relationships are the key of the“coarse-to-fine”mode when human conduct rapid natural scene categorization.Table 2 gives the semantic scales and cognitive priority of different semantic categories.Usually,those semantic categories with large semantic scales have higher cognitive priority,while the semantic categories with small semantic scales have lower cognitive priority.As illustrated in Table 2,“natural environment”and“artificial environment”have largest semantic scales,that means this two semantic categories have the most abstract description for the natural scene images;“indoor”and“outdoor”have smaller semantic scales than the formers,that means this two semantic categories have more specific semantic description on natural scenes images;semantic categories such as coast,street and kitchen have smallest semantic scales,therefore they have most detailed semantic description on natural scenes images.
Table 2 Semantic scales and cognitive priority of various semantic categories
Thus,we can build our multiple semantics based rapid cognition model according to those relationship rules illustrated in Table 1 and Table 2.According to Table 2,the model firstly categorize a natural scene image on the largest semantic scales,namely natural environment and artificial environment;and then,according to TABLE 1 the model would categorize this natural scene image on the mid semantic scales,namely indoor and outdoor,when this natural scene image is identified as artificial environment in the first step;finally this natural scene image will be categorized on the smaller semantic scales by the model,and be given the most specific semantic description such as coast,street and kitchen.By the step-by-step categorization on multiple semantics of a natural scene image,this model can identify the exact category of this natural scene image in high efficiency and accuracy.And the whole categorization process conforms to the cognitive process of human rapid natural scene categorization.Figure 7 shows the multiple semantics based rapid cognition model.
Figure 7 Multiple semantics based rapid cognition model
We evaluate our method on three public natural scene (NS)image datasets,the 8 categories NS dataset[13],the 13 categories NS dataset[14]and the 15 categories NS dataset[15].The 8 categories dataset is composed of 8 categories of color images (each image is 256 ×256 pixels):coast (360 pictures),forest (328 pictures),highway (260 pictures),inside-city (308 pictures),mountain (374 pictures),open-country (410 pictures),street (292 pictures),and tall-building (356 pictures).The 13 categories dataset inherits from the 8 categories dataset,added another 5 categories scene images (average image size is 300 ×256 pixels):office (215 pictures),bedroom (216 pictures),kitchen (210 pictures),living-room (289 pictures),and suburb (241 pictures).The 15 categories NS dataset inherits from the 8 categories and 13 categories dataset,adding another 2 categories scene images:store (315 pictures)and industrial (311 pictures).Some illustrated pictures from three natural scene datasets are below.
Figure 10 Example images of 15 categories NS dataset
In our experiments,all images from three datasets are converted into grayscale and resize into uniform 256 ×256 pixels.All experiments are repeated ten times with different randomly selected training and test images.In each run,100 images per category are selected randomly for training and the rest for testing.Both of training and testing are performed with a support vector machine (SVM)(the same as[13]).The average accuracy of per-category is recorded for each run,and the final result is reported as the mean and standard deviation of the results from the individual runs (the same as[15]).
In this section,we test the performance of the cognitive feedback model on the 8 categories NS datasets.The aim is to confirm that the effectiveness of the cognitive feedback model and the necessity to select the optimum decomposition stages of 2D DD-DT CWT.Specifically,the experiments on cognitive feedback model reveal the relationships between the optimum decomposition stages of 2D DD-DT CWT and the traits of natural scene image.There are three comparative experiments below.The first experiment is designed to investigate the influence of different natural scene image size on the optimum decomposition stages of 2D DD-DT CWT.The second experiment is designed to investigate the influence of different natural scene image resolution on the optimum decomposition stages of 2D DD-DT CWT.The third experiment is designed to demonstrate that a mini natural scene dataset can substitute for its corresponding large natural scene dataset to select the optimum decomposition stages of 2D DD-DT CWT.Here the assumption is that the optimum decomposition stages of 2D DD-DT CWT is only related with the qualities of the natural scene images rather than the number of the images in the natural scene dataset.In the mini natural scene dataset,there are 30 images on each natural scene category.The results of three experiments are as follow.
From Figure 11 we can find that the best average accuracies of classification vary with the size of natural scene images.When the size of natural scene images are 256×256 pixels,the best average accuracies of classification happens as the 2D DD-DT CWT possesses four decomposition stages.When the size of natural scene images are 128×128 pixels,the best average accuracies of classification happens as the 2D DD-DT CWT possesses three decomposition stages.When the size of natural scene images are 64×64 pixels,the best average accuracies of classification happens as the 2D DD-DT CWT possesses two decomposition stages.The optimum decomposition stages of 2D DD-DT CWT decreases with the increase of the size of natural scene images.It is prove that the size of natural scene images can affect the choice of optimum decomposition stages for 2D DD-DT CWT.Therefore,the cognitive feedback model can determine the optimum decomposition stage for 2D DD-DT CWT by finding the best average accuracies of classification on specific size of natural scene images via feedback comparison.
Figure 11 The relationship between average accuracy and decomposition stages of 2D DD-DT CWT on different size of natural scene images
Figure 12 shows that the average accuracies of classification of three kinds of natural scene images.Three groups of natural scene images have the same size of 256 ×256 pixels while with different resolutions.As showed in Figure 11,three groups of natural scene images have same variation trend on the average accuracies of classification and the best average accuracies of classification.That means that the resolution of natural scene images has no influence to the relationship between best average accuracies of classification and the optimum decomposition stages of 2D DD-DT CWT.Therefore we can only focus on the factor of image size rather than the image resolution when we chose the optimum decomposition stages for 2D DD-DT CWT.
Figure 12 The relationship between average accuracy and decomposition stages of 2D DD-DT CWT on different resolution of natural scene images
Figure 13 illustrated that the variation trend on the average accuracies of classification with varied decomposition stages of 2D DD-DT CWT between full 8 categories NS dataset and a mini 8 categories NS dataset (which is a small random subset of full 8 categories natural scene dataset).As we can see in Figure 12,the mini dataset has exactly the same variation trend with the full 8 categories natural scene dataset.That means we can conduct the cognitive feedback process on a mini natural scene instead of using its corresponding full dataset.This can greatly improve the efficiency of the computational process for searching the best average accuracies of classification and the optimum decomposition stages for 2D DD-DT CWT.It conforms to the principle of rapid natural scene categorization.
Figure 13 The relationship between average accuracy and decomposition stages of 2D DD-DT CWT on different size of natural scene datasets
In this section,we test the performance of the multiple semantics based cognition model on the three NS datasets respectively.Table 3 shows the average classification accuracies of our method for these NS datasets.On the 8 categories NS dataset,the average classification accuracy of our method is comparable to the state-of-art accuracy of Oliva[13]method (83.7%).And on the 13 categories NS dataset,the average classification accuracy of our method is more comparable,which is superior to the state-of-art accuracy of literature[14](76%)and[15](74.7%)(all comparisons are based on same experiment setup).
Table 3 Average accuracies of our method
By using the confusion matrix,F(xiàn)igure 14 illustrates the specific results of classification by our method,on three NS datasets respectively.In the confusion matrix,average classification rates for individual classes are listed along the diagonal.The entry in the ith row and jth column is the percentage of images from the class i that are misidentified as the class j.
From the confusion matrix for the 8 categories NS dataset in Figure 14 (a),we find that the images in the‘open-country’are the most challenging,which are most easy to be confused with the images in the‘mountain’and‘coast’.While the most easily identified categories are‘forest’and‘street’.From the confusion matrix for the 13 categories NS dataset in Figure 14 (b),we find that the 13 categories NS dataset is much more challenging than the 8 categories NS dataset,due to the added four indoor categories.For the classification of the 13 categories NS dataset,the most easily confused categories are respectively‘living-room’,’bedroom’,and‘kitchen’.From the confusion matrix for the 13 categories NS dataset in Figure 14 (c),we can see that the“industrial”scene is high confused with“store”scene,while the“store”scene has good recognition rate.
We have described a biologically inspired approach for scene images classification that is composed of both visual perception and cognition models.The visual perception can extract gist features from natural scene images.It consists of a multiscale 2D spatial signal decomposition module and a hybrid statistical features extraction module.However,the visual perception model needs to be added the adaptive function for practical application.The cognitive feedback model,designed for improving the visual perception model,can select the optimum decomposition stage for the visual perception model in a mini natural scene dataset.This cognitive feedback model has been proved that can mimic the top-down visual mechanism in our brain.To stimulate the human ability of rapid natural scene categorization,we propose the multiple semantics based cognition model,which can imitate human’s“coarse-to-fine”cognitive mechanism.By utilizing the priori knowledge of natural scenes,the multiple semantics based cognition model can recognize the multiple semantic labels of a natural scene image in a specific priority order.This model can make the categorization process of natural scene images more rapid and robust than the traditional mono-semantics cognition model dose.Experiments show that the approach that combines both of visual perception and cognition mechanisms can stimulate human activities in rapid natural scene categorization very well,and the categorization accuracies of this approach in several natural scene datasets are better than some previous classical methods.
[1]C.Siagian,L.Itti.Rapid biologically-inspired scene classification using features shared with visual attention[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2007,29(2):300-312.
[2]A.Eleyan,H.Ozkaramanli,H.Demirel.Complex Wavelet Transform-Based Face Recognition[J].Eurasip Journal on Advances in Signal Processing,2008(5):202-218.
[3]T.Celik,H.Ozkaramanli,H.Demirel.Facial feature extraction using complex dual-tree wavelet transform[J].Computer Vision and Image Understanding,2008,111(2):229-246.
[4]L.W.Renninger,J.Malik.When is scene identification just texture recognition[J].Vision Research,2004,44(19):2301-2311.
[5]S.Baker,T.Sim,T.Kanade.When is the shape of a scene unique given its light-field:A fundamental theorem of 3D vision[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2003,25(1):100-109.
[6]S.K.Shevell,F(xiàn).A.A.Kingdom.Color in complex scenes.Annual Review of Psychology,2008,59:143-166.
[7]C.Zetzsche,U.Nuding.Natural scene statistics and nonlinear neural interactions between frequency-selective mechanisms[J].Biosystems,2005,79(1-3):143-149.
[8]G.S.Yu,J.J.Slotine.Fast Wavelet-Based Visual Classification[C]//19th International Conference on Pattern Recognition,2008:526-530.
[9]I.W.Selesnick.The double-density dual-tree DWT[J].IEEE Transactions on Signal Processing,2004,52(5):1304-1314.
[10]I.W.Selesnick,R.G.Baraniuk,N.G.Kingsbury.The dual-tree complex wavelet transform[J].IEEE Signal Processing Magazine,2005,22(6):123-151.
[11]I.W.Selesnick.The double-density dual-tree DWT[J].IEEE Transactions on Signal Processing,2004,52(5):1304-1314.
[12]http://taco.poly.edu/selesi/DoubleSoftware/index.htm
[13]A.Oliva,A.Torralba.Modeling the shape of the scene:A holistic representation of the spatial envelope[J].International Journal of Computer Vision,2001,42(3):145-175.
[14]F.F.Li,P.Perona.A Bayesian hierarchical model for learning natural scene categories[C]//Proceedings-2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.[S.l.]:[s.n.],2005:524-531.
[15]S.Lazebnik,C.Schmid,J.Ponce.Beyond bags of features:Spatial pyramid matching for recognizing natural scene categories[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition,2006:2169-2178.
[16]Potter,M.C.Short-term conceptual memory for pictures[J].Journal of Experimental Psychology:Human Learning and Memory,1976,2:509-522.
[17]Marius V.Peelen,Li Fei-Fei,Sabine Kastner.Neural mechanisms of rapid natural scene categorization in human visual cortex[J].Eurasip Journal on advances in signal Processing,2009,462:2..
[18]Aude Oliva,Antonio Torralba.Modeling the shape of the scence:a holistic represeatation of the spatial envelope[EB/OL].[2001-06-25].http://people.csail.mit.edu/torralba/code/spatialenvelope/.