YIN+Xiao+WANG+Ming-yu
Abstract: Under the modern education system of China, the annual scholarship evaluation is a vital thing for many of the college students. This paper adopts the classification algorithm of decision tree C4.5 based on the bettering of ID3 algorithm and construct a data set of the scholarship evaluation system through the analysis of the related attributes in scholarship evaluation information. And also having found some factors that plays a significant role in the growing up of the college students through analysis and research of moral education, intellectural education and culture&PE.
Key words: data mining; scholarship evaluation system; decision tree algorithm;C4.5 algorithm
中圖分類號:TP311 文獻(xiàn)標(biāo)識碼:A 文章編號:1009-3044(2015)09-0011-03
1 Introduction
For encouraging college students to be more diligent,study harder and develop themselves in all aspects,the national Ministry of finance and education have unitedly established the management interim measures of national scholarship on regular four-year college and advanced vocational school.Under the new situation of two-way reform on teaching and education carried by each college one after another,it is important to improve the scholarship evaluation system constantly and guide the college students preferably for establishing a more justified encouragement system.
At present,under the circumstances of constantly improving student information management system,the scholarship evaluation system has been optimized in a certain degree,but the actual situation is not like what you see.It often happened that human factors interfered with the scholarship evaluation reduntantly,and manifest the immature of the scholarship evaluation system.More importantly,the initial intention of establishing the scholarship system is to commend those who were studying hard and diligent at school or else obtained good results in any contests.However,the current evaluation system is not so perfect enough to act like this.What we can do only is to let all the colleges students or even the whole humans to feel the fairness of education by consummating the evaluation system constantly.
On the other hand,colleges and universities in China have eventually catched up with the world in information management through decades of efforts.We can provide some practical suggestions for the improving of the scholarship evaluation system by obtaining the essensial information of comprehensive assessment and analyzing vital factors that affect the scholarship evaluation adopting the methods of data mining,and then promote the consistent development of the scholarship evaluation system.
2 Data Mining and Data Classification Overview
Data mining,is something that diggining and discovering information hidden in large amounts of data from databases through some certain algorithms.And data classification is a vital method of data analysis,predict data trendency of future by establishing model of data classification.Many methods of data classification have already been put forward in other scientific domains before research on data mining carring out.And data mining has inherited and developed stretch techniques of classification for big data processing adopting parallel processing and distributed processing respectively based on these researches.
Data classification,should establish a classification model for describing prescient dataset or ontologies at the first place,and then evaluate and verify the model,classify data tuples or object whose class lable are unknown.Shows in Figure 1.
At present,commonly used classification algorithms include decision tree algorithm,artificial neural network, bayesian algorithm,linear regression,multiple regression and so on.We can establish good classification models according to different types of datasets by adopting these methods.This text conducts statistics and analysis for data in comprehensive assessment using C4.5 algorithm in decision tree algorithm which is designed for providing a reliable data information for the development and perfection of scholarship evaluation system.
3 Attribute Selection and Processing of Data
Data in this paper are from the 2014 comprehensive evaluation list of X college in Hunan Agricultural University,and also conducting data analysis according to these data.In Hunan Agricultural University scholarship evaluation system,assessment of undergraduate scholarship include national scholarship,national encouragement scholarship,major award of outstanding students and minor award of outstanding students.To ensure that the scholarship does not lose its value,grades and the situation of comprehensive assessment should be considered in scholarship evaluation.
The data showed that the number of people get a scholarship is 63 which accounts for 19.75 percent in the 319 students of X college,Hunan Agricultural University,2014.Comprehensive evaluation mainly divided into three parts,moral education,intellectual education,culture and physical education.The scholarship evaluation are mainly based on comprehensive assessment and the academic year averages.This paper conduct data analysis based on three properties of scholarship evaluation and investigate important factors related to scholarship evaluation.
The initial data of comprehensive evaluation contains a large amount of redundant information and PRN code,shows as the Table 1.
After technical links for comprehensive assessment data such as data cleaning,data conversion,data integration and data reduction,we can get the preprocessing data table,shows as the Table 2.
ME,IE,CP,GPA,SA in the table represent moral education,intellectual education,culture and physical education,grade-point average and scholarship assessment(the same below).Each attribute includes 4 property values:A,B,C,D.Y and N stand for acquiring and not acquiring scholarship.
Choose this part of the data as training set,and the test set is made up by the 2013 comprehensive assessment data of a class in X college.
4 The Selection of Decision Tree Algorithm and Constructing Decision Tree
The decision tree algorithm is a relatively classical sorting algorithm,which constructs the mapping relationship between attributes and property values through analyzing attribute values of data,and then establish a prediction model for the data set.The decision tree algorithm mainly include the CART algorithm,ID3 algorithm and C4.5 algorithm.And this paper use C4.5 algorithm for data analysis.
C4.5 algorithm is a fairly reliable classification algorithm,which is a product derived from improved ID3 algorithm.C4.5 algorithm uses information gain rate as the standard of measuring attributes.
At first,supposing data set I has s samples in total,and the data set contains n class data after division according to property set A. Pi(i∈[1,n]) represents the proportion of the ith class in the data set.And the information entropy of the sample data set is:
[E(I)=-i=1npilb(pi)] (1)
Second,divide each attribute of property set A,and get a set of information entropy:
[EA(I)=-j=1vTipilb(pi)] (2)
Among the formula,v represents the classification of each attribute, Tj (j∈[1,v]) stand for the division of attribute.
Third,calculate information entropy of each attribute according to the above formula,and the smaller the entropy,the higher purity of the subset divided from the sample set.And then calculate information gain of each attribute in property set A:
[Gain(A)=E(I)-EA(I)] (3)
Fourth,calculate split information measurement which is used to measure the breath and uniformity coefficient of attribute splitting data:
[IV(A)=-j=1vpilb(pi)] (4)
Finally,acquire the information gain rate according to the following formula:
[IGR(A)=Gain(A)/IV(A)] (5)
Choose the attribute of maximum information gain rate as the root node of decision tree,and then divide the node until all the properties have been split.
The data set contains 319 instances after preprocessing,and 63 people won a scholarship among them.Calculate the data according to the above rule and get the total entropy of the comprehensive evaluation data set:
[E(S) =-63/319log2(63/319log2-256/319log2(256/319)=0.7169]
And then calculate the information entropy,the information gain as well as the information gain rate of the moral education,intellectual education,culture and physical education respectively.
The attribute value of moral education is divided into A,B,C,D four categories,and the number is 32,77,97 and 103 respectively.
And get the information entropy,the information gain,the information gain rate as well as the measurement of information division of moral education according to the formula above:
EM(S) = 0.5884
Gain(M) = 0.1285
IV(M) = 1.8804
IGR(M) = Gain(M)/IV(M)=0.0683
In the similar way,get the information entropy,the information gain,the information gain rate as well as the measurement of information division of intellectual education through calculating:
EI(S) = 0.5564
Gain(I) = 0.1605
IV(I) = 1.969
IGR(I) = 0.0815
And then obtain the information entropy,the information gain,the information gain rate as well as the measurement of information division of culture and physical education:
EP(S) = 0.6512
Gain(P) = 0.0657
IV(P) = 1.9688
IGR(P)= 0.0334
Choose attribute intellectual education owing the maximum information gain rate as the root node of the decision tree,and then repeat the above rules to construct a decision tree in accordance with this,shows as Figure 2.
GS,NS stand for getting a scholarship and no scholarship respectively.
Consider related factors like average scores after comprehensive assessment,and then determine whether to evaluate a scholarship.
5 Conclusions and Recommendations
Using the scholarship evaluation information for data analysis through decision tree algorithm,we can get the decision tree above and obtain some vital informations about growing-up of contemporary college students:
(1)The scholarship winners in comprehensive evaluation generally have a good intellectual education evaluation;
(2)Only owing performance advantage of moral education in comprehensive evaluation cannot ensure acquiring scholarship;
(3)Culture and physical education evaluation is seemingly meaningless,but actually there is quite a number of scholarship winners whose culture and physical education evaluation surpasses its classmates afar;
(4)The role of comprehensive assessment results in scholarship evaluation depends on what grade points average is in a certain degree(GPA>=80);
(5)Comprehensive evaluations emphasis on intellectual education often damage the enthusiasm of college students developing in all aspects.
We can know that carrying on research for scholarship evaluation system using decision tree algorithm can find some drawbacks of the current system from the analysis above,thus optimize and improve the system preferablely,also play a positive role in the students growth at the same time.
References:
[1] Han J,Kamber M.Data mining:concepts and techniques[M].Fan Ming,Meng Xiaofeng,Translating.Beijing:the Machine and Industry Press,2001.
[2] Wu X,Kumar V,Quinlan J R.Top 10 algorithms in data mining [J].Knowledge and Information Systems,2008,14 (1):1-37.
[3] Kantardzie M.Data mining:concepts,models,methods,and algorithms[J].J Comput Inf Sci Eng,2005,5 (4):394-395.
[4] Shen Q.On rough sets,their recent extensions and applications[J].The Knowledge Engineering Review,2010,25 (4):365-395.
[5] Greco S,Matarazzo B,Slowinski R.Rough approximation by dominance relations [J]. International Journal of Intelligent Systems,2002,17 (2):153-171.
[6] Dubois D,Prade H.Rough fuzzy sets and fuzzy rough sets[J]. International Journal of General Systems,1990,17 (1):191-208.
[7] Zhai J H,Gao Y Y,Zhai M Y,et al.Rough set model and its eight extensions[C]//2011 IEEE International Conference on Systems Man and Cybernetics, 2011:3512-3517.