ZHENG Kai-yi, ZHANG Wen, DING Fu-yuan, ZHOU Chen-guang, SHI Ji-yong,Yoshinori Marunaka, ZOU Xiao-bo*
1. School of Food and Biological Engineering, Jiangsu University, Zhenjiang 212013, China 2. Department of Molecular Cell Physiology, Kyoto Prefectural University of Medicine, Kyoto 602-8566, Japan
Abstract The near-infrared spectra has been widely used in the food region with advantages of low measurement cost, easy operation, and fast analysis rate. An indirect analytical method should calibrate a feasible model between spectra and concentrations. However, the model calibrated under a specific condition may be invalid for the spectra measured under another condition. Recalibration is a solution to this problem. However, recalibrating the model between spectra and concentration cost much time and workforce. Thus, calibration transfer can correct the spectral deviation to keep the precision of prediction and avoid the expense of recalibration. In calibration transfer, the spectra used for calibrating model are called primary spectra (A), while those not calibrate model but only use the model of primary spectra are called secondary spectra (B). The procedure of calibration transfer is selecting samples as transfer set of primary spectra (At) from the calibration set, while choosing the samples of secondary spectra as transfer-set of secondary spectra (Bt) who share the same concentrations of At. Then the transfer matrix can be constructed through At and Bt. After that, the corrected secondary spectra (Bnew) can be obtained by validating a set of secondary spectra (Bv) multiplying the transfer matrix. Finally, the Bnew can be substituted for the primary spectra model for prediction. In calibration transfer, generating a transfer set is an important procedure. Selecting samples of transfer set is commonly based on the distances of spectra rather than validation errors. However, the transfer errors are important to estimate the power of calibration transfer. Hence, in this paper, ensemble refinement (ER) based on model population analysis has been proposed to refine further the transfer set generated by the KS method. Initially, the ER generates several subsets of a transfer set and then computes the validation errors of each subset. Subsequently the average error of subsets that includes the sample can be obtained for each sample. Finally, the samples with low average errors can be selected as a transfer set for calibration transfer. The corn dataset is used to examine this method. The results exhibited that in calibration transfer methods such as canonical correlation analysis combined with informative components extraction (CCA-ICE), direct standardization (DS), piecewise direct standardization (PDS) and spectral space transformation (SST), ER can select key samples for calibration transfer to reduce the errors, compared with KS method significantly.
Keywords Calibration transfer; Model population analysis; Sample selection; Partial least squares; Near-infrared spectrum
Near-infrared spectroscopy (NIR) has been widely used in environmental[1], petrolic[2]and agricultural[3]areas, because of its advantages such as ease of operation, low measurement cost, and fast analysis rate. However, as an indirect analytical method, a feasible model for near-infrared spectroscopy must be developed in advance. Generally, the model calibrated under a specific condition cannot be applied to the spectra under different conditions. Thus, recalibrating a new model is necessary to solve this problem. However, recalibrating the model can be uneconomical and labor-intensive. Thus, calibration transfer can be the solution to this problem.
In the spectra batch of calibration transfer, the samples applied to constructing models are called primary spectra, while the samples which are not calibrated but only use the model of the primary spectra are called secondary spectra[4-5].
In recent years, several calibration transfer methods have been proposed, including the direct standardization (DS)[6], piecewise direct standardization (PDS)[7-8], canonical correlation analysis (CCA)[9-10], spectral space transformation (SST)[11], and so on. Among these methods, CCA-ICE has exhibited promising results for calibration transfer. In addition to calibration transfer models, sample selection methods for transfer sets are also crucial, such as the Kennard-Stone (KS) method[12].
However, the transfer set can only be selected by the distance of samples in the calibration set. Supposedly, refining the transfer set generated by the KS method can further reduce the prediction errors. Meanwhile, less informative samples exist in the calibration transfer which can enlarge the prediction errors. Thus, the samples in the transfer set must be refined further. In recent years, the model population analysis (MPA) being utilized in chemical and/or biochemical data analysis, such as for sample selection methods in multivariate calibration. Similar to multivariate calibration, the transfer set generation in calibration transfer is also a sample selection procedure. Thus, in this study, a transfer set refinement method referred to as ensemble refinement (ER) is proposed, which uses the ideology of MPA to optimize the samples in a transfer set.
TheprimaryandsecondaryspectraaresymbolizedasmatricesAandB,respectively.ThetransferandcalibrationsetsofspectraAareassignedasAtandAc,respectively,whilethetransfer,validationandpredictionsetsofspectraBaredesignatedasBt,BvandBp,respectively.Theysymbolizesthesampleconcentrations.AtcanbeobtainedfromAcusingthesampleselectionmethod.Meanwhile,thesamplesofspectraBwithsimilarconcentrationsasthatAtareassignedasBt.
SimilartotheprocedureofMPA[13-15],theERalgorithmincludesthefollowingthreesections: (1)subsetsamplingforthetransferset, (2)sub-modelbuildingthroughcalibrationtransfermethods,and(3)randomanalysisoftherootmeansquareerrorsofvalidation(RMSEV)ofthegeneratedsubsets.Thedetailedprocedurecanbeshownasfollows:
ConsideramatrixAtformsampleswitheachrowasasample.Twoparametersincludingratiosoftheselectedsampletothewholesample(r)andthenumberofselectingtimes(N)mustbefocusedon.ThesamplesmustberandomlyselectedfromAttogeneratethesubset.AfterNtimesofrepeatedlysampling,Nsubsetsofthetransfersetcanbeobtained.ThisprocedureisillustratedinFig.1wherem=20,r=0.6andN=15.
Fig.1 Illustrative example of the subsetsampling in a transfer set The black squares are the selected ones while the white ones not
Figure 1 shows that the first subset including the 12 samples can be selected among the 20 samples (20×0.6). Further, other 12 samples can be chosen from another sampling index. Thus, after 15 samplings, 15 subsets of the transfer set can be generated. In Fig.1, the probability of each sample is 0.6, which is identical to the value ofr. Furthermore, during the sampling, the selected ratio of a sample isr. Thus, afterNsamplings, the theoretical number of (Nt) of the sample to be selected can be computed as follows
Nt=Nr
(1)
The insignificantNtof cannot extract the sample information in a transfer set, while the substantial value ofNtcan increase the computation burden. Thus, an optimalNtvalue must be fixed. In this study,Ntis set to 100, which implies the theoretical sampling time of each sample is 100. Thus, the former two parameters can be reduced into a single parameterr. With the value ofr, the value ofNcan be computed as follows
N=100/r
(2)
Calibration transfer can be generated for each randomly generated sub-dataset to estimate RMSEV (RMSEV1). E.g. in Fig.1, 15 RMSEV1values can be obtained after 15 sampling times.
After randomly sampling, each sample subset of the transfer set can be applied to the calibration transfer. Thus, the corresponding RMSEV1values can be obtained after several sampling times. The subsets with RMSEV1values including the corresponding sample can be obtained for one sample. After that, the average RMSEV1(mRMSEV1) can be fixed as the subsets with the sample. For example, in Figure. 1, after 15 samplings, the 2nd, 4th, 6th, 8th, 9th, 10th, 12th, 13thand 15thsubsets contain the first sample, and thus mRMSEV1of these samples can be obtained to evaluate the transfer power of the first sample. Similarly, mRMSEV1of the 1st, 2nd, 4th, 5th,7th, 10th,11th, 13thand 14thsubsets can be set as the transfer power of the second sample. Based on this, mRMSEV1of each sample can be obtained.
Evidently, after sampling, the samples with low mRMSEV1values can be considered candidates for reducing calibration transfer errors. Thus, the samples can be sorted according to their mRMSEV1values ascending order, and the samples with low mRMSEV1values can be chosen for calibration transfer. The detailed procedure of the proposed method is given as follows:
In Fig.2, the proposed method includes the following four steps: (1) randomly sampling, (2) obtaining RMSEV1of each subset, (3) obtaining mRMSEV1of each sample, and (4) selecting the samples with low mRMSEV1values. In the proposed method,rand the number of samples in the original transfer set (m) must be adjusted in advance.
Fig.2 The procedure of the ER method
The spectra of the corn dataset scanned on three NIR spectrometers are downloaded from http://www.eigenvector.com/data/Corn/index.html. Each of the three NIR spectra batches includes 80 samples ranging from 1100 nm to 2498 nm. In the three datasets, mp6 and m5 are assigned as primary and secondary spectra, respectively. Meanwhile, the moisture values are set asy.
For primary spectra with 80 samples, after sorting the values of y, the first sample in each of the four contiguous samples (20 samples) is set aside. Thus, the remaining 60 primary spectra samples are considered the calibration set of primary spectra. Moreover, among 60 samples of calibration set of primary spectra, certain samples are chosen as the transfer sets of primary spectra using the KS method. After generating the transfer set of primary spectra, the samples of the calibration set in secondary spectra with similar y values are assigned as the transfer set of secondary spectra.
Moreover, for 20 samples of primary spectra set aside, the samples of secondary spectra with similaryvalues as that of the former can be retained. Among 20 samples of secondary spectra, the first and second ones of each two contiguous samples are set as prediction and validation sets, respectively.
For the corn dataset, the number of latent variables is optimized as nine. Additionally, the parameters ofmandrmust be investigated. Because the sampling subset cannot execute CCA-ICE under the condition ofm×r Figure 3, shows that at different combinations ofmandr, the RMSEV2values of the proposed method are nearly lower than those obtained by the KS method. This indicates that the transfer set generated by the KS method can be further refined by using the proposed ER method. In each plot of (c), (d), (e), (f), (g), (h) and (i), with ascendingm, RMSEV2displays a decreasing trend atm<30. This is because many samples obtained by the KS method facilitate the refinement of ER. Furthermore, after the value ofmexceeds 30, RMSEV2remains nearly constant. Since selecting many transfer samples may generate redundant information for the calibration transfer,mis set to 30. Fig.3 The RMSEV2 of corn dataset at r from 0.2 to 0.9 (plots a to h) and m from 20 to 60 In each plot, the blue and red lines represent RMSEV2 of the KS method and the proposed method, respectively In addition tom,rmust be investigated. RMSEV2at differentrvalues are listed in Fig.4. Fig.4 RMSEV2 of the corn dataset at r rangingfrom 0.3 to 0.9 at m=30 In Fig.4, RMSEV2achieves the minimal atr=0.6. Thus,ris set to 0.6. After fixingmandr, the variation in RMSEV2during different w can be examined. The results are displayed below. Fig.5 indicates that with the increase inw, RMSEV2decreases at first and achieves the minimum atw=28. At last, RMSEV2was obtained to be 0.094 2, which is the same as the results without further refinement. Thus, the subset with 28 samples and minimal RMSEV2can be set as the optimal subset. Fixing the parameters using the validation set, RMSEP of the prediction set must be applied to examine the effect of ER. The results are displayed as follows: Fig.5 Variation in RMSEV2 for subsets with wfrom 9 to 30 at m=30 and r=0.6 In Table 1, it is evident that the ER method can refine the transfer set of CCA-ICE with low RMSEV2and achieve low RMSEP compared to the KS method. Meanwhile, the commonly used methods such as DS, PDS and SST can also be applied in the ER method. The results are listed in Table 1. In Table 1, DS, PDS and SST utilize ER to refine the transfer set with lower RMSEV2and RMSEP than the KS method. Moreover, to further analyze the power of ER, the random sampling method can be used for testing. In each calibration transfer method including CCA-ICE, DS, PDS and SST, the randomly sampling method is used 100 times. In each loop, the calibration, validation and prediction sets are randomly fragmented into the sizes of 60, 10 and 10, respectively. Then, the original transfer sets are generated from the calibration set through the KS method. Subsequently, the samples in the transfer set are further refined by the ER method, and RMSEV2of the validation set is used to determine the number of samples to be retained. Finally, the refined and non-refined samples are applied to transfer the prediction set. After 100 randomly samplings, RMSEP of KS and ER at different m can be computed as follows: Table 1 Computation errors of corn dataset by KS and ER methods In Fig.6, it is evident that for each transfer method, including CCA-ICE, DS, PDS and SST, at different numbers ofm, the RMSEP values of ER are lower than those of KS. Among the four calibration transfer methods, CCA-ICE can generate low prediction errors. CCA-ICE transfers the informative components extracted by the partial least squares (PLS) model. Moreover, the backward refinement can further reduce the errors in a prediction set. For DS and SST, with increasingm, RMSEP values of KS display a decreasing trend, while ER’s values remains nearly constant. This implies that ER can select key samples for calibration transfer through DS and SST with low errors. In Fig.6(c), although the errors of PDS obtained by KS are larger than those of CCA-ICE, DS and SST, ER can reduce prediction errors by refining the samples. A new transfer set refinement method ER was proposed based on MPA. Initially, ER generated several subsets for the calibration transfer. Subsequently, the average errors of subsets containing this sample were obtained for each sample. Fig.6 Average RMSEP of corn dataset at different values of m under the transfer set generated by KS (blue line) and ER (red line), respectively Finally, samples with low average errors were selected as the refined transfer set. The corn dataset was used to test the proposed method. The results indicated that the calibration transfer methods, including CCA-ICE, DS, PDS and SST could reduce prediction errors. Hence, ER can effectively refine the transfer set in calibration transfer.4 Conclusion