Hao-He Liu | Si-Qi Yao | Cheng-Ying Yang | Yu-Lin Wang
Abstract—In this paper,we propose a hybrid model aiming to map the input noise vector to the label of the generated image by the generative adversarial network (GAN).This model mainly consists of a pre-trained deep convolution generative adversarial network (DCGAN) and a classifier.By using the model,we visualize the distribution of two-dimensional input noise,leading to a specific type of the generated image after each training epoch of GAN.The visualization reveals the distribution feature of the input noise vector and the performance of the generator.With this feature,we try to build a guided generator (GG) with the ability to produce a fake image we need.Two methods are proposed to build GG.One is the most significant noise (MSN) method,and the other utilizes labeled noise.The MSN method can generate images precisely but with less variations.In contrast,the labeled noise method has more variations but is slightly less stable.Finally,we propose a criterion to measure the performance of the generator,which can be used as a loss function to effectively train the network.
Index Terms—Deep convolution generative adversarial network (DCGAN),deep learning,guided generative adversarial network (GAN),visualization.
Unsupervised learning is thought to be the general solution in extracting features from unlabeled data existing in vast quantities[1]and deriving a latent function mapping spatial or other features of training data to a series of labels defined in advance.After the invention of traditional generative adversarial networks (GANs)[2],many variants of GAN have emerged with improvements in performance and training stability[3].
As we know,a certain kind of input noise can generate only one kind of images.The generator in our work consists of fractionally-strided convolutions[4]and the batch normalization and activation function.The fractionally-strided convolutions transfer something that has the shape of the output of the convolution to something that has the shape of its input,while maintaining the connectivity pattern that is compatible with the said convolution[5].The output of the batch normalization and activation function is not changed for the same batch of input data.So the relation between the input noise vector and the generated image is a one-to-one mapping,which is the prerequisite of our research.
In this paper,the generator of GAN is denoted byG(p1×n,θg) or simplyG,where p1×nis the input of the generator andθgis a set of trainable parameters.The input noise,p1×n,is a row vector withncolumns.Each element of the vector is randomly sampled from a given distribution,such as the Gaussian distribution.GAN has been studied extensively,but it is still a black box for most researchers and users.There is very limited research trying to understand what GANs learn and how to visualize the intermediate representations in the multilayer of GANs[4].Two of the most prominent questions are how the generator represents or understands the data we feed in and how to judge whether a generator is good or not.To answer these questions,we develop an approach based on the deep convolution generative adversarial network (DCGAN)[4]and the convolutional neural network (CNN) classifier[6]to explore the relation between the input noise vector and the generated image.From the training procedure of GAN,we know that the training dataset is used only to train the discriminator instead of both the generator and the discriminator,i.e.,the generator does not learn information directly from the training set.Therefore,the generator itself must understand the pattern and structure of the training dataset.Taking a two-dimensional input noise vector,p1×2,as an example in this paper,we find that the input noise vectors leading to the same class of generated images tend to congregate in a specific manner.
Superior to the traditional approach randomly generating a bunch of images and then manually selecting the one which we want,we utilize the distribution feature of the input noise vectors to let the proposed GAN generate an image we exactly expect.We name the proposed GAN as guided GAN,and the proposed generator as the guided generator (GG).In order to filter out outliers,we introduce the Pearson correlation coefficient to score the similarity of two images.Experimental results show that GG manifests a satisfactory result.
The contributions of our paper include:1) Discovering,visualizing,and analyzing some interesting relations between input noise vectors and fake images;2) successfully making use of this relation to realize GG;3) introducing a criterion to measure the performance of the generator.
GAN has recently achieved impressive results in many research fields and application areas[2].Generally,researchers are more interested in how well the generator can cover the distribution of the real images;in other words,how realistic and vivid the fake images are.One of the most arousing and admirable studies recently done in this field is style-GAN[7],which can produce a high-resolution and hyper-realistic face by a style-based generator[7].Moreover,GAN could also be used to solve the denoising problem by training the generator to estimate the noise distribution over the input noisy images,so as to generate noise samples[8].
Although limited research is on noise vectors[9],the distribution of input noise vectors is important for the performance of a neural network.For example,by adding some latent codes and targeting salient structured semantic features of training data[1],infoGAN can successfully disentangle the writing styles on the MNIST dataset,and the hairstyle,presence/absence of eyeglasses,and the emotion on the CelebA face dataset.According to these results,we believe there must be some latent relationships between the input noise vectors and the features of generated images.
Until now,most visualization of GAN lies in the study of the internal layer and filter[4],training loss of the generator and discriminator,and distribution similarity of faked and real images.For example,the visualization of the internal filter shows how the feature learned by the kernels of the discriminator activated on the typical parts of a scene[4].The distribution similarity between the fake and real datasets can be visualized in GANLab[10].During the progress of training session,the distribution of training data and the distribution of generated data gradually overlap each other.In this way,we can directly understand the learning process of the generator[10].
Without well-trained GAN,a generator may not produce a fine projection from the input noise vector to the generated image.Among the GAN variants,DCGAN is the most prominent.In DCGAN,a generator generates fake examples,and a discriminator tries to decide whether the image is fake or not,just like traditional GANs.The generator and discriminator alternately train their networks for adversarial purposes with different loss functions.Since the advantages of CNN emerge into the DCGAN model and the fractionally-strided convolution is introduced for efficient up-sampling in DCGAN[4],DCGAN is simpler and more efficient than traditional GANs.Unlike the traditional multilayer perceptron,there is no fully connected(FC) layers in DCGAN,so the number of parameters is significantly reduced.In addition,DCGAN has advantages in network initialization and stability.Therefore,using DCGAN,we can generate images more similar to real images.
Noise variables,i.e.,elements of the input vector of the generator,play a key role.In most cases,researchers interpret the role of noise variables as reducing the certainty of our model.Because random noise carries fewer “structures”,by using the noise variable as the input,the bias and assumption in the early stage of the model can be avoided.
Inspired by infoGAN,we first build conventional GAN trained on the CelebA and CIFAR-10 datasets[11],[12],with a p1×16noise vector sampled from the Gaussian distribution,as shown inFig.1.
By deliberately modifying the noise vector,the generated image also changes nearly in a smooth and continuous way.This indicates that the spatial characteristics of two noise vectors,such as the regional or symmetry distribution,may affect the similarity of their generated images.In order to study the effect of noise variables’ values on the generated images,we adjust the element values of the input noise vector in two ways,as shown inFigs.2 (a)and(b),respectively.
Fig.1.Well-trained generator.
Fig.2.Images generated by partly different input vectors of the same size of p1×16:(a) setting one element value to 0 at a time to gradually change the row vectors and (b) setting one element at a time to half of its original value to gradually change the row vectors.
Fig.2shows 15 images generated by 15 different input vectors with the same size of p1×16.The element values of the input vectors are shown in different colors in the heat map.Each row of the 15×16 matrix represents an input vector p1×16,whose element values are indicated by the colors of its cells and the color bar.By feeding each row into the generator,we can generate a total of 15 fake images shown on the left side ofFig.2.The 15 input vectors inFigs.2 (a)and(b)all come from fine tuning of the noise vector inFig.1.
InFig.2 (a),in the corresponding position of adjacent two rows,only one element has a different value,the element value at the next row is set to 0.In other words,the elements in the lower triangular of this matrix are all set to zero.Feeding the pretrained generator with these 15 input vectors,we get a column of fake images,shown on the left side ofFig.2,where the data labeled next to each image are the Pearson correlation coefficients used to measure the similarity between the generated images and the real images.Especially inFig.2 (b),in the corresponding position of adjacent two rows,only one element has a different value.We change the value of one element in the current vector to half of the corresponding element in the previous vector.
The main goal of our model is to map an input noise vector,p1×n,to a generated image’s label.For this purpose,we first train two models separately,DCGAN and the classifier.Then we combine them for our mapping work.The scheme is shown inFig.3.
Fig.3.Proposed model exploring the relation between input noise vectors and generated images.
We train a DCGAN model on the MNIST dataset.After 20 epochs,we find the fake image shown inFig.4is clear enough,thus we stop training and separate the generator.During the training session,we save the parameters of the generator at an interval,that is,ten times during an epoch,for the follow-up learning process.In our work,we test the noise variables sampled both in the uniform distribution and the Gaussian distribution.We find that it is hard to generate satisfactory images by using hyper-dimensional noise vectors of the uniform distribution.This is because its overall distribution is too scattered,it is hard for the generator to find a pattern to cover the large-scale random noise points.
Fig.4.Faked digits by DCGAN.
At the initial stage of constructing a restorer to inversely reconstruct the corresponding input noise vector from the generated image,we train a typical decoder using densely connected convolutional networks[13].However,although a series of optimization techniques have been adopted,the training error is still too large to recover the true value of the noise vector.So we use the classifier instead of the decoder to act as our restorer.We choose a conventional CNN-based model as our classifier,which contains two convolution layers and an FC layer.Each convolution is followed by MaxPooling and the activation function (ReLu).Though our classifier is quite simple,it works well on the MNIST dataset.As shown inFig.5,the test accuracy is above 98.5%,which is enough for our requirement.
The overall structure of our system is shown inFig.3.We first sample a vector,p1×n,in the normal distribution,and then feed p1×ninto the generator to produce a fake image.Next,we feed the fake image to the classifier to yield the label of the fake image.Now,we get a pair of (p1×n,label).For a given generator,repeating this processmtimes,we getmvector-label pairs:
Fig.5.Training curve of classifier.
In order to facilitate the visualization of the relationship between input noise vectors and the labels of generated images,we set the parameternof p1×nto 2 so as to do the visualization easily in two-dimensional space.We randomly generate 6000 noise vectors to create a training data set.In addition,during the training session,the parameters of the generator at different intervals at different epochs are recorded.
Treating each element of a noise vector p1×2as a two-dimensional coordinate,we visualize the input noise vectors and the labels of generated images after 1,5,and 20 epochs,as shown inFig.6.Each point inFig.6represents one input noise vector p1×2with two element values as coordinates.Noise points of different colors correspond to different output image digits.FromFig.6,we can find that points corresponding to the same label (digit) are clustered in the same sector.For a better view ofFig.6,please visit our GitHub page in [14].
Fig.6.Generated digits and their input noise distributions.
At the initial stage of training,the generator cannot distinguish distinct digits’ features,so most of the digits it generates are alike or simply not digits.The classifier regards them as just the same digits or limited kinds of digits.After a few more epochs,the generator starts to realize the latent feature of the images,so does our classifier.At this time,we have all sorts of digits generated by our generator.However,the fake image is still not good enough because our classifier only sorts out images based on their features instead of overall shapes.That is why some clustering of the same color points is distributed in different sectors,such as the red points inFig.6.Interestingly,the distribution of input noise points gathers in several different sectors,and the noise points in different sectors correspond to different output numbers.
At the same time,we found that the image generated by using p1×2is literally the same as usingap1×2,whereais a scaling factor which can take any positive value.As demonstrated inFig.7,fake pictures generated by noise values at points H,I,and G are literally the same.
Thus,we conclude that it is not the absolute value of input noise that affects the image we get,but the relative values among the elements of input noise.InFig.7,for example,the slopekof a line determines the generated image.This is why the scatter of noise points,such as inFig.6andFig.8,is fan-shaped.For a two-dimensional noise vector,kcan be calculated with
Fig.7.Demonstration of points in a sector.The horizontal and vertical axes correspond to the range where we sample the random points.
wherep1×2(k) stands for thek’s dimension value of vector p1×2.
Fig.8visualizes the training results of our generator.We first choose 40 noise points every 2π/40 radians on a circle with a radius of 2.After that,we feed the values of each noise point into the generator trained after 25 epochs,and get 40 representative fake images,as shown inFig.8.
In the above experiments,the noise points are two-dimensional,that is,each noise point contains two element values.Next,we extend the above experimental results to higher dimensions.We use t-distributed stochastic neighbor embedding(t-SNE)[15]to visualize the distribution of the noise vector and related generated fake images.The result shows that the above assertion we put forward is also applicable.
Fig.8.Noise vector chosen every 2π/40 radians on a circle with a radius of 2.
We have found that the noise variables are not randomly distributed but based on the feature or class of the image it generates.Therefore this can guide us to select a suitable input noise vector to generate the exact fake image we want.
InFig.8,the input noise points are well clustered rather than distributed in divided tiny slices,so we can select a representative noise point for each expected output label.The most effective way to select a representative noise point,or most significant noise (MSN),is simply to select any point on the middle radial line in the sector in which the desired output label is located.
For example,for a group of classified points labeled with purple inFig.8,assuming the slope range of radial lines in the sector isφ0toφ1,we chooseφ=(φ0+φ1)/2 as MSN.As an example,any noise point sampled near or on line OE inFig.7can generate an image more like a real digit,as shown inFig.9.
Fig.9.Generated images with the MSN method.
In most situations,feeding the noise vectors in the same sector to the generator,the generated images look like with only a little different details.So we try to use the labeled noise to generate the image we want.For example,if we need a fake image with the label ofk,we can directly pick out the noise points,which are pre-labeled by the classifier withk.Then these noise points and their adjacent points can be used by the generator to generate the writing numberk.
However,since the classifier does not work accurately,we introduce a criterion called the Pearson correlation coefficient to measure the similarity between the generated image and the standard image,so as to filter out those images that are not satisfactory enough,as shown inFig.10.
Fig.10.Image generating and filtering.
InFig.10,there are ten standard images representing digits 0 to 9,respectively.Each standard image is obtained by averaging 1000 randomly-selected training images of the same kind.We calculate the correlation coefficient between the generated image and its corresponding standard image as follows:
whereX=imagefake,Y=imagestandard,Cov(X,Y) is the covariance of the fake image and the standard image,andσXandσYare the standard deviations of these two images,respectively.
InFig.10,only those images very similar to standard images can pass the filter,so the final result is very satisfactory.The only hyper-parameter needed to be set is the threshold of the filter for each class of images.
As demonstrated inFig.8,if we tell the generator that we need some fake pictures of digit “6”,then a bunch of handwritten digits of “6” will be generated.Compared with those images generated by the MSN method,these generated images are more like handwritten digits in the real world,with more details and styles,as seen inFig.11.
Fig.11.Faked images with filtered labeled noise.
We aim at generating a picture as real as possible,so the difference between a fake image and its corresponding real one must be reduced during the training phase.Therefore,we come up with a new criterion for our GAN to measure the loss,by combining the Pearson correlation coefficient with our proposed model.The loss function is described as follows:
Fig.12.Loss of the generator in GAN.
wherenis the number of fake images chosen to evaluate the generator.As an example,nis 40 inFig.9.is theith fake image belonging to classk.is the classkstandard image used to evaluate.tdenotes the total number of kinds of images the generator can generate.Generally,the more kinds of images our generator can generate,the better performance the generator has.InFig.9,t=10.λis a hyper-parameter which can change the punishment degree fort.We use this criterion to evaluate the generator in our DCGAN model.As shown inFig.12,the overall loss decreases during the training session.We made animation to demonstrate the change of the input noise distribution during the training session,and the source code is available on GitHub (https://github.com/haoheliu/Guided-GANVisualization).
In this paper,we visually reveal the relationship between input noise and the label of the image generated by GAN.The visualization based on our proposed model illustrates the training process of the generator in a very intuitive way.We also study the relation between the performance of the generator and the visualization result.We find that the features of this result,such as the aggregation pattern,can show the capability of the tested generator.
Using the distribution characteristics of different kinds of fake images,GG can be constructed.GG can successfully generate images we expect.The output of GG based on the MSN method is more stable but less varied.The output of GG based on labeled noise has better variations but slightly less precision.
Finally,a criterion is proposed to evaluate GAN performance.This criterion can also be used as a loss function in the training process.Since the loss function contains similarity information between the generated image and the corresponding standard image,it may greatly improve the performance of the generator.
Disclosures
The authors declare no conflicts of interest.
Journal of Electronic Science and Technology2022年1期