Baojia Wang, Pingzeng Liu, , Zhang Chao, Wang Junmei, Weijie Chen, Ning Cao, Gregory M.P. O’Hare and Fujiang Wen
Abstract: Garlic prices fluctuate dramatically in recent years and it is very difficult to predict garlic prices. The autoregressive integrated moving average (ARIMA) model is currently the most important method for predicting garlic prices. However, the ARIMA model can only predict the linear part of the garlic prices, and cannot predict its nonlinear part. Therefore, it is urgent to adopt a method to analyze the nonlinear characteristics of garlic prices. After comparing the advantages and disadvantages of several major prediction models which used to forecast nonlinear time series, using support vector machine (SVM) model to predict the nonlinear part of garlic prices and establish ARIMA-SVM hybrid forecast model to predict garlic prices. The monthly average price data of garlic in 2010-2017 was used to test the effect of ARIMA model, SVM model and ARIMA-SVM model. The experimental results show that: (1) Garlic price is affected by many factors but the most is the supply and demand relationship; (2) The SVM model has a good effect in dealing with the nonlinear relationship of garlic prices; (3) The ARIMA-SVM hybrid model is better than the single ARIMA model and SVM model on the accuracy of garlic price prediction, it can be used as an effective method to predict the short-term price of garlic.
Keywords: Price forecast, machine learning, hybrid model, garlic.
Chinese planting area and output of garlic accounted for 58.43% and 80.86% of the world's garlic acreage and total output respectively. The export sum and export volume accounted for 72.96% and 84.25% of the world’s garlic exports respectively. It is the world's largest garlic grower and exporter. Garlic occupies an important position in Chinese agricultural products. However, starting from the 2010 “garlic you are ruthless”and the following “garlic you are cheap”, garlic prices across the country have skyrocketed and plummeted frequently, caused serious damage to the vital interests of garlic farmers and consumers [Shi and Li (2017)]. Therefore, it is of great practical significance to study the characteristics and laws of garlic price fluctuations and make accurate predictions. This is of great practical significance to prevent the price of garlic
from soaring and plunging, to narrow the price fluctuations in the garlic market, and to adopt corresponding policies to stabilize prices [Qiu (2013)]. For it, the Department of Agriculture of Shandong Province, in conjunction with Shandong Agricultural University, Jin Xiang County Government and neighboring companies, jointly set up a big data platform for the garlic industry chain. One of the tasks of the platform is to study how to predict the trends and laws of garlic price fluctuations. Because price fluctuations of garlic are affected by factors such as planting area, production, climate, storage, export volume, supply and demand, and market hot money, price forecasting is very difficult.Research on its price curve is rarely reported and the garlic price forecast method is only one kind of traditional time series method. The traditional time series forecasting model can only analyze the linear part of the price of garlic and can predict the trend, but the prediction accuracy is often not high. It is clear that the traditional time series forecasting model cannot meet people's expectation of garlic price forecast. The biggest bottleneck of garlic price forecasting is not considering its nonlinear characteristics. With the rise of big data in recent years, more and more intelligent prediction methods based on machine learning have been used in price predictions for many fields [Liu, Guo and Shen (2012)].People can use a machine learning algorithm to accurately predict the price of small agricultural products, but machine learning algorithms are very data-dependent and have high data requirements. So people combined the two to get a hybrid model [Xie, Li and Zhou (2002)]. The method of hybrid model is a method that combines the prediction results from different prediction methods to form a new prediction result, which can effectively improve the prediction accuracy [Cheng, Chen and Jiang (2012)]. Xiong et al.[Xiong, Qi and Gao (2015)] selected the Holt-Winters seasonal exponential smoothing model, the Census X12 seasonal decomposition model, the SARIMA model, the BP neural network model, and the gray system model as the applicable models for the shortterm prediction of the three items with small single prediction errors and used the inverse sum of squared error method to determine the weights to build a combined forecasting model. It was found that for bananas and apples, the forecasting error of the combined model is the smallest, and the combined forecasting can improve the forecasting accuracy. Li et al. [Li, Xu and Li (2010)] set up a hybrid forecasting method to assign different weight values based on the model prediction error size, the results show that the single model prediction error fluctuates greatly, and the precision decreases as the prediction period becomes longer, and the combined model is better than the single model; Ping et al. [Ping, Liu and Yang (2010)] combined forecasting methods such as neural network, gray system, and time series forecasting model to predict the pig prices in Jilin Province. Through the comparison of forecast results, it was found that the prediction method combining gray system and neural network has the best prediction accuracy. The above research results show that combined model is beneficial to combine the respective advantages of models and can effectively improve the prediction accuracy.According to the forecast time, the forecast period is less than one year for short-term forecast, and the forecast period for more than one year is long-term forecast [Huang and Song (2016)]. Based on the above research, several current mainstream forecasting models are compared in this paper, provides a garlic short-term price forecasting combined model ARIMA-SVM model based on big data. It is for the garlic industry chain big data platform to better predict garlic price trends and laws.
At present, the garlic price forecasting method is only the traditional time series forecasting method, of which the most used is the ARIMA model. ARIMA models assume that the present data are a linear function of past data points and past errors. They also assume that the errors are white in nature, and require that the data be made stationary before fitting a linear equation to the data. ARIMA model can reflect its advantages when predicting the price time series in a short time. The ARIMA model does not need to directly consider the changes of other relevant indicators. Therefore, the idea of the forecasting model is very clear and concise. It is mainly applicable to short-term predictions with a small number and high prediction frequency [Peng and Shen (2014)].In the literature, ARIMA models have been applied to various time series data, such as sugar prices [Suresh and Priya (2011)], stock market data [Wang and Zhang (2012)], and wind speeds [Cadenas and Rivera (2010)], for the prediction of future values. ARIMA models can help in understanding the dynamics of the data in a given application. Before forecasting time series data, various preprocessing steps can be applied to the raw data if necessary. In Tan et al. [Tan, Zhang and Wang (2010)], a wavelet transformation was applied before forecasting electricity price data of Spanish and PJM electricity markets.In Orhan [Orhan (2013)], new classification and feature extraction techniques were proposed for electrocardiography data. These preprocessing steps can be applied to the raw data to obtain more accurate predictions. In this paper, the basic ARIMA model was chosen as the linear prediction model to predict the garlic prices. The current mainstream model of big data forecasting model is to fully integrate traditional statistical,econometric, and machine learning analysis methods [He (2016)]. Machine learning methods are not only used for classification, but also play a significant role in numerical prediction. Time-series prediction models based on machine learning gradually exert their advantages in the application of non-linear and non-stationary time series. Among them, the neural network model and the support vector machine model can achieve good prediction accuracy when predicting complex time series. They are two important methods to solve the problem of time series prediction with non-stationary and nonlinear characteristics [Wang, Yang and Mao (2008)]. However, the network structure of the neural network is difficult to determine and prone to “overfitting”, and the optimal solution obtained by the SVM is global, solving the problem that cannot be avoided in other algorithms [Wang and Sun (2015)]. People have used SVM to predict complex time series. In Wen et al. [Wen and Xiao (2014)], using the singular spectrum analysis (SSA),decomposes the stock price into terms of the trend, the market fluctuation, and the noise with different economic features over different time horizons, and then introduces these features into the SVM to make price predictions. The results show that compared with SVM without these price characteristics, EEMD-SVM and SSA-SVM, which combine price features into SVM, perform better and SSA-SVM has the best prediction. In Xie[Xie (2011)], the author optimized the selection of various parameters in the model through Particle Swarm Optimization (PSO), and applied the technology of the support vector machine to the stock price forecast model to predict the closing price on the third day. The experimental results show that the stock price model based on particle swarm optimization support vector machine can accurately predict the closing price of the stock on the third day. This method has high practical value. The above research shows that SVM has good effect in dealing with nonlinear characteristics, so SVM model is chosen to predict the nonlinear part of garlic price in this paper.
A large number of theories and practices have proved that it is impossible to simultaneously capture the linear and nonlinear laws of time series by only a single model [Wang, Zhang and Qin (2017)]. Therefore, this paper combines the idea of combination forecasting to establish a garlic price forecasting model based on the combination of ARIMA and SVM.
2.1.1 ARIMA model
ARIMA model (autoregressive integrated moving average), also known as differential autoregressive moving average model. It is a time series prediction method proposed by Box and Jenkins in the 1970s. The ARIMA model usually uses the past values of the time series to predict future data values. That is, starting from the observed time series data,first analyze the data characteristics and then select a black box, if the black box can convert the observed time series into a white noise sequence which is a series of random numbers that are not related to each other, then the black box is correct, and the black box is the ARIMA model to be selected.
The ARIMA model fits a differential stationary sequence, which is actually a combination of differential arithmetic and an ARMA model. For modeling with nonseasonal time series, the ARIMA (p, d, q) model can be used. It can smooth the sequence by appropriate d-order (d is an integer) difference operation. After the time series data is smoothed, the ARIMA (p, d, q) model is transformed into an ARMA (p, q) model. For the time series with obvious seasonal factors, the ARIMA model extracts the seasonal information in the time series by the differential operation with a period step, making the time series become a stationary sequence, and its residual sequence is also a stationary sequence. The general representation of the ARIMA (p, d, q) model is:
In the formula, Φ (B) is a p-order AR (autoregressive) model, θ (B) is a q-order MA(moving average) model.
2.1.2 SVM model
Support vector machine, referred to as SVM. In layman's terms, it is a two-class classification model. To reduce the redundant information and extract the most distinct features, ROI and PCA operations are performed for learned features of convolutional layer or pooling layer and the extracted features are fed into SVM classifier. The basic model is defined as a linear classifier with a large interval in the feature space. Its learning strategy is a larger interval, and it can eventually be transformed into a solution to a convex quadratic programming problem [Dhas and Kumanan (2016)]. SVM is a supervised machine learning method. It mainly studies the learning rules of small sample data. It is divided into support vector regression (SVR) and support vector classification(SVC), among them, support vector regression SVR is mostly used for numerical prediction. The SVM model based on statistical learning theory takes structural risk minimization as the principle, constructs the optimal linear classification hyper-plane in the high-dimensional feature space, and then classifies the data to make it have good generalization ability when solving the pattern recognition of nonlinear, small sample,high dimension. It is widely used in statistical classification, regression estimation,probability density function estimation and other fields.
Support vector machine transforms linearly indivisible samples of low-dimensional input space into high-dimensional feature space to make it linearly separable by nonlinear mapping algorithm, so as to achieve the purpose of linear analysis of nonlinear features.Its linear regression linear equation is:
F(x) is the predicted value, w is the dimension weight factor, x is the mapping function into the high-dimensional feature space, and b is the adjustable factor.
2.1.3 ARIMA-SVM hybrid model
Firstly, forecasting garlic price data using ARIMA model, the prediction result of ARIMA model contains the linear characteristics of the data, and nonlinear characteristics exist in the prediction error of ARIMA model. Then, the SVM model is used to predict the prediction error of the ARIMA model, so that the nonlinear features are included in the SVM prediction results. Finally, the predicted values of the combined forecasting model are obtained by adding the prediction results of ARIMA and the SVM prediction results. Specific steps are as follows:
Step 1: Consider a set of time series L consisting of two parts, linear “a” and non-linear b.The time series L can be expressed as
Step 2: Predict linear “a” by the ARIMA model. The predicted value is a. The residual of the predicted value and the true value is e.
Step 3: Use the SVM model to predict the residual e. The prediction result is ê.
Step 4: Add the predicted values of the two models to obtain the final predicted value of the combined model.
The schematic diagram of the ARIMA-SVM combination prediction model is shown in Fig. 1:
Figure 1: The principle of ARIMA and SVM combination model
Shandong Province is the main producing area of garlic in China. It is the distribution center for garlic all over the country, and its price also the wind vane of garlic price. Its domestic market share of garlic produced by it is over 70% [Li, Qin and Zhou (2017)].The Department of Agriculture of Shandong Province, in conjunction with Shandong Agricultural University, Jin Xiang County Government and neighboring companies,jointly established a big data platform for the garlic industry chain. The Department of Agriculture of Shandong Province provided data on the average monthly wholesale price of garlic in the province from 2010 to 2017 for analysis by the Key Lab of Smart Agriculture at Shandong Agricultural University. The data from the garlic industry chain big data platform.
Establishing ARIMA model to predict the monthly average price in the second half of 2017 by monthly average prices data of 2010-June 2017. The residual sequence obtained by ARIMA is analyzed by SVM, and the residual of the monthly average price in the second half of 2017 is predicted. The two predicted values are added to obtain the predicted values of the ARIMA and SVM combination models. Use the actual value of the monthly average price in the second half of 2017 to verify the effectiveness of the combined model. Finally, using the ARIMA and SVM combination model to predict the average monthly price in the first half of 2018.
In order to better predict the price of garlic, seasonal decomposition of the garlic price data is first used to understand its changing trend.
Figure 2: Seasonal decomposition of monthly average price of garlic
Using the decompose function of R to obtain seasonal term, trend term, and random term.From the trend and seasonal terms, it can be seen that the price of garlic in Shandong Province has risen first, then decreased and then risen. The cycle is one year. The price of garlic is generally highest in February and March (the price rises before and after Spring Festival), and the lowest in May and June (new garlic marketing and price drop). In the second half of 2010, the price of garlic continued to rise, and a large amount of hot money poured into the garlic market to stir up the price, while the price in the second half of 2011 dropped by the “cliff-break style” due to the high price of garlic in 2010 and the large-scale expansion of garlic farmers. As a result, the price of garlic dropped in 2011.In 2016, as the temperature continued to decline, garlic was on a large scale, and garlic price continued to rise. It can be seen that the price of garlic is greatly affected by factors such as planting area, climate, and market speculation [Shi (2017)].
Although garlic price fluctuations are affected by many factors, the most fundamental one is the supply and demand relationship. Planting area, climate, pests and diseases and other factors affect the output, while the output affects the supply and demand relationship. Supply exceeds the requirement price, and it needs to exceed the supply price increase. The speculation of the garlic market is also to influence the supply and demand in the market through warehousing. Speculators stock up on garlic for speculation, causing the garlic to be greater than supply on the market, and selling after the price increases.
The first step in establishing an ARIMA model is to test the stability of time series data.Use R to obtain a time series line chart, and then determine the smoothness of the sequence. Logarithmic or differential processing is performed on non-stationary time series, and then the smoothness of the processed sequence is judged again. Repeat the above process until it becomes a stable sequence. The number of differences at this time is the order d in the ARIMA (p, d, q) model. After smoothing, the ARIMA (p, d, q)model is transformed into the ARMA (p, q) model.
Figure 3: Monthly average price of garlic
As can be seen from the time series line graph, the garlic prices in a periodic monotonous trend, so the sequence is not stable, it needs to be smoothed first, and it needs to be differentiated once.
Figure 4: Time series of monthly average price difference of garlic
The observations after the first-order difference of the original time series fluctuate around the zero-mean value, and it can be roughly estimated that the monthly average price time series of garlic after the first-order difference is stable. The ADF unit root test is performed on the sequence after the first difference using the instruction in the R language expansion package, so that the p value is less than 0.05, and the sequence after the first difference can be determined to be stable.
After the sequence is stationary, it is necessary to determine which model the sequence fits to base on the autocorrelation coefficient and the partial correlation coefficient. If autocorrelation tails, partial correlation truncation, it is suitable for AR model;autocorrelation truncation, partial correlation tailing, is suitable for MA model;autocorrelation and partial correlation are trailing, it is suitable for ARMA model.
Figure 5: Autocorrelation of sequence after one differential
Figure 6: Auto-correlation of partial autocorrelation after one differential
From the differential post-sequence autocorrelation graph of Fig. 5 and the differential auto-correlation autocorrelation graph of Fig. 6, it can be seen that the sequence after the first-order difference is suitable for the ARMA model, that is, the original sequence is suitable for the ARIMA model. The automatic ordering function auto.arima () in the R language is used to auto-determine the model. Many ARIMA models are automatically determined on the raw data, but the data after the difference is automatically ordered, and the resulting intermediate difference variable is needed for the solution. Adding the first few differences as the difference of the last model. Finally, the ARIMA ((1, 1, 0) (1, 1, 0))model was fitted. After passing the white noise test, use the forecast function to predict the price trend for the next 6 months.
Figure 7: Monthly average price forecast in the second half of 2017
The monthly average price forecast for the second half of 2017 through the forecast function. The blue line in the figure is the monthly average of ARIMA’s average price for the second half of 2017, and the dotted line is the average monthly price for June 2010-2017 predicted by the ARIMA model. The solid black line is for June 2010-June 2007 of the true monthly average price data. Finally, by subtracting the average monthly price data for June 2010-2017 predicted by the ARIMA model from the true data of the monthly average price for June 2010-2017, the residual sequence between the predicted and the true values of the ARIMA model is obtained.
3.3.1 Convert the residual sequence into a matrix
According to the obtained residual sequence, selecting the number of cyclic residuals,and the first few are taken as the input of the SVM model, and the predicted value is taken as the output of the SVM. Then, follow this loop. Because the data volume is not large, this paper directly selects 4 as the number of cycles.
3.3.2 Select training set and test set
The SVM model predicts that the training set must be selected for training, and then the test set is selected for cross-validation to ensure the accuracy of the SVM model. The first 72 residuals were selected as the training set and the last 18 as the test set.
3.3.3 SVM test
After the SVM is trained and cross-validated, predictions can be made. The crossvalidation results show that except for a large difference between the individual residuals and the real residuals, the rest are very close, so the SVM model is successfully trained.Finally, use the SVM model that has been trained to predict the average monthly price for the second half of 2017.
The ARIMA and SVM combined forecasting model adds the ARIMA predictive value and the SVM predictive value, and the final result is the predicted value of the combined model. The combined model prediction results are shown in Tab. 1.
Table 1: The forecast result in the second half of 2017(unit: Yuan/kg)
From Tab. 1, we can see that in addition to the May 2017 combined prediction error and the ARIMA prediction error a little larger, the rest of the combined model prediction is better than the single ARIMA model. In order to better evaluate the prediction performance of single model and combined model, SVM model is used to predict it. Then the Root Mean Square Error (RMSE) is used as the evaluation index of the model. RMSE is applicable to the comparison between the different models of the same data set, and the results are shown in Tab. 2. The results show that the combination of ARIMA and SVM combines the advantages of both. It not only improves the accuracy of the prediction, but also improves the accuracy of the prediction. In order to study the further predict of ARIMA and SVM combination model in garlic prices, based on the existing data, carries on the forecast to 2018 on Shandong Province in the first half of the average monthly price of garlic, in order to further study and validation of the model results, and provide reference for the Agriculture Department of Shandong Province, the prediction results are shown in Tab. 3.
Table 2: RMSE of ARIMA, SVM and ARIMA SVM
Table 3: Hybrid model forecast result in the first half of 2018
Garlic as a small agricultural product, the price fluctuation rule is elusive. The current method of garlic price forecasting is only the traditional ARIMA model. The predicted effect is only a good trend forecast, and the prediction accuracy is often not high. With the rise of big data, time-series prediction models based on machine learning gradually exert their advantages in the application of nonlinear and non-stationary time series. In this paper, combining the idea of hybrid model, an ARIMA-SVM garlic short-term price mixed prediction model based on big data is established. Using the garlic price data from representative regions for experiments, and comparing the experimental results of ARIMA model, SVM model and mixed ARIMA-SVM model, it can be shown that: (1)Garlic price is affected by many factors but the most is the supply and demand relationship; (2) The SVM model has a good effect in dealing with the nonlinear relationship of garlic prices; (3) The ARIMA-SVM hybrid model is better than the single ARIMA model and SVM model on the accuracy of garlic price prediction, it can be used as an effective method to predict the short-term price of garlic.
Acknowledgements:This work was financially supported by the following project:(1) Shandong independent innovation and achievements transformation project(2014ZZCX07106). (2) The research project “Intelligent agricultural system research and development of facility vegetable industry chain” of Shandong Province Major Agricultural Technological Innovation Project in 2017. (3) Monitoring and statistics project of agricultural and rural resources of the Ministry of Agriculture.
Computers Materials&Continua2018年11期