国产日韩欧美一区二区三区三州_亚洲少妇熟女av_久久久久亚洲av国产精品_波多野结衣网站一区二区_亚洲欧美色片在线91_国产亚洲精品精品国产优播av_日本一区二区三区波多野结衣 _久久国产av不卡

?

基于DBSCAN算法和多源數(shù)據(jù)的缺失公交到站數(shù)據(jù)修補(bǔ)

2019-12-23 07:19王成崔紫薇杜梓林高悅爾
計(jì)算機(jī)應(yīng)用 2019年11期

王成 崔紫薇 杜梓林 高悅爾

摘 要:針對(duì)缺失公交到站信息修補(bǔ)方法考慮因素較少、準(zhǔn)確度低、魯棒性差的現(xiàn)狀,提出了基于DBSCAN算法和多源數(shù)據(jù)的缺失公交到站數(shù)據(jù)修補(bǔ)方法。該方法使用公交全球定位系統(tǒng)(GPS)、公交集成電路卡(IC)等多源數(shù)據(jù)進(jìn)行缺失到站信息的修補(bǔ)。對(duì)于缺失的到站名稱、到站經(jīng)緯度數(shù)據(jù),用已有完整到站數(shù)據(jù)和靜態(tài)線路信息關(guān)聯(lián)分析進(jìn)行修補(bǔ)。對(duì)于缺失的到站時(shí)刻數(shù)據(jù),則按以下步驟進(jìn)行修補(bǔ):首先,對(duì)每一個(gè)缺失數(shù)據(jù)站點(diǎn)與其最近的未缺失數(shù)據(jù)站點(diǎn),將這兩站點(diǎn)間歷史完整到站數(shù)據(jù)的行程時(shí)間和班次時(shí)序進(jìn)行基于DBSCAN算法的聚類;其次,判斷研究班次的兩個(gè)相鄰的數(shù)據(jù)完整的班次所屬簇是否為同一個(gè)簇,若為同一個(gè)簇則不作改變,否則將兩個(gè)簇合并;最后,將簇中點(diǎn)對(duì)應(yīng)最大行程時(shí)間作為缺失行程時(shí)間判斷是否有乘客在該站點(diǎn)上車刷卡,若有則由乘客開(kāi)始刷卡時(shí)刻推算到站時(shí)刻,若無(wú)則將簇中點(diǎn)對(duì)應(yīng)最大、最小行程時(shí)間的均值作為缺失行程時(shí)間推算到站時(shí)刻。以廈門市公交到站數(shù)據(jù)為例,在缺失到站名稱、經(jīng)緯度修補(bǔ)中,基于GPS數(shù)據(jù)聚類的方法、基于極大概率估計(jì)的方法和所提方法皆可進(jìn)行100%的修補(bǔ);在缺失到站時(shí)刻修補(bǔ)中,所提方法的平均相對(duì)誤差比兩種對(duì)比方法分別低0.030-1%和0.000-4%,相關(guān)系數(shù)比對(duì)比方法分別高0.005和0.007-5。實(shí)驗(yàn)結(jié)果表明,所提算法在缺失公交到站數(shù)據(jù)修補(bǔ)中能有效提高修補(bǔ)的準(zhǔn)確度,降低缺失站點(diǎn)個(gè)數(shù)變化對(duì)于準(zhǔn)確度的影響。

關(guān)鍵詞:缺失到站數(shù)據(jù)修補(bǔ);DBSCAN算法;到站經(jīng)緯度;到站時(shí)刻;多源數(shù)據(jù)

中圖分類號(hào): TP311

文獻(xiàn)標(biāo)志碼:A

Repairing of missing bus arrival data based on DBSCAN algorithm and multisource data

WANG Cheng1*, CUI Ziwei1, DU Zilin1, GAO Yueer2

1.College of Computer Science and Technology, Huaqiao University, Xiamen Fujian 361021, China;

2.School of Architecture, Huaqiao University, Xiamen Fujian 361021, China

Abstract:

In order to solve the problem that the existing repair methods for missing bus arrival information have little factors considered, low accuracy and poor robustness, a method to repair missing bus arrival data based on DBSCAN (DensityBased Spatial Clustering of Applications with Noise) algorithm and multisource data was proposed. Bus GPS (Global Positioning System) data, IC (Integrated Circuit) card data and other source data were used to repair the missing arrival information. For the name, longitude and latitude data of the missing arrival station, the association analysis of complete arrival data and static line information were carried out to repair. For the missing arrival time data, the following steps were taken to repair. Firstly, for every missing data station and its nearest nonmissing data station, the travel time and schedule in the historical complete arrival data between the two stations were clustered based on DBSCAN algorithm. Secondly, whether the two adjacent runs of the studied bus with complete data belonged to the same cluster was judged, and if they belonged to the same cluster, th cluster would not change, otherwise the two clusters would be merged. Finally, the maximum travel time corresponding to the cluster midpoint was used as the missing travel time to determine whether there was a passenger swiping his card to board the bus at this station or not, if so, the arrival time was calculated from the time of swiping cards, and if not, the mean of the maximum and minimum travel time corresponding to the cluster midpoint was used as the missing travel time to calculate the arrival time. Taking Xiamen bus arrival data as examples, in the repair of name, longitude and latitude of the missing arrival station, the clustering method based on GPS data, the maximum probability estimation method and the proposed method can repair the data by 100.00%. In the repair of missing arrival time, the mean relative error of the proposed method is 0.030-1% and 0.000-4% lower than that of two comparison methods respectively, and the correlation coefficient of the proposed method is 0.005 and 0.007-5 higher than that of two comparison methods respectively. The simulation results show that the proposed method can effectively improve the accuracy of repair of missing bus arrival data, and reduce the impact of the number of missing stations on accuracy.

Key words:

bus missing arrival data repair; DensityBased Spatial Clustering of Applications with Noise (DBSCAN) algorithm; longitude and latitude of arrival station; arrival time; multisource data

0?引言

受城市建筑陰影效應(yīng)等影響,公交到站數(shù)據(jù)會(huì)出現(xiàn)缺失現(xiàn)象,嚴(yán)重影響公交車輛運(yùn)營(yíng)特征的分析、乘客上車站點(diǎn)識(shí)別等?,F(xiàn)有研究表明,3.75%的公交到站數(shù)據(jù)缺失即可造成超過(guò)25%的乘客上車站點(diǎn)匹配失敗[1],因此公交缺失到站數(shù)據(jù)修補(bǔ)是必要的,本文對(duì)于公交到站數(shù)據(jù)的到站名稱、經(jīng)緯度和時(shí)刻數(shù)據(jù)進(jìn)行修補(bǔ)。

公交到站數(shù)據(jù)的修補(bǔ)可根據(jù)使用數(shù)據(jù)類型分為僅使用公交全球定位系統(tǒng)(Global Positioning System, GPS)數(shù)據(jù)和使用多源數(shù)據(jù)兩大類。僅使用公交GPS數(shù)據(jù)方法分為基于統(tǒng)計(jì)分析的填充和基于數(shù)據(jù)挖掘的填充?;诮y(tǒng)計(jì)分析的填充通過(guò)熟悉數(shù)據(jù)集的研究者構(gòu)建擬合模型、估計(jì)統(tǒng)計(jì)參數(shù)實(shí)現(xiàn)對(duì)缺失數(shù)據(jù)的填充,常見(jiàn)方法有線性擬合填充法[2]、均值填充法[3]、概率估算法[4]等,但只用于小樣本數(shù)據(jù)集中,難以適應(yīng)交通大數(shù)據(jù)分析的需求。基于數(shù)據(jù)挖掘的填充可充分利用海量數(shù)據(jù)的特點(diǎn)、不受研究者對(duì)數(shù)據(jù)集熟悉程度的影響進(jìn)行修補(bǔ)缺失值,常見(jiàn)方法有k最 近 鄰 填 充[5-6]、雙向循環(huán)神經(jīng)網(wǎng)絡(luò)[7]、貝葉斯網(wǎng)絡(luò)[8]、聚類方法填充[9-14]等; 但僅使用公交GPS數(shù)據(jù)的方法所使用的數(shù)據(jù)少,獲得的信息不夠全面,因此修補(bǔ)準(zhǔn)確度低?;诙嘣磾?shù)據(jù)的修補(bǔ)可分為間接修補(bǔ)和直接修補(bǔ)兩種:間接修補(bǔ)首先通過(guò)對(duì)伴隨的集成電路(Integrated Circuit, IC)卡刷卡數(shù)據(jù)中缺失信息進(jìn)行修補(bǔ),以此推算缺失的GPS到站數(shù)據(jù),但沒(méi)有乘客上(下)車刷卡的站點(diǎn)無(wú)法進(jìn)行修補(bǔ),常見(jiàn)的方法有基于行程的網(wǎng)絡(luò)多模態(tài)特性研究方法[15]、基于刷卡數(shù)據(jù)分離和序列約束站點(diǎn)匹配的方法[16]、基于車輛運(yùn)營(yíng)時(shí)刻表方法[17]等;直接修補(bǔ)可從根本上對(duì)于缺失到站數(shù)據(jù)進(jìn)行修補(bǔ),對(duì)于無(wú)乘客上下車的缺失到站數(shù)據(jù)也可進(jìn)行修補(bǔ),修補(bǔ)范圍更廣,可以更好地進(jìn)行公交運(yùn)營(yíng)特征的挖掘,常見(jiàn)的方法有基于車輛運(yùn)營(yíng)時(shí)刻表、基于極大概率估計(jì)的方法[18]等,但這些方法受路段狀況等隨機(jī)因素的影響過(guò)大、沒(méi)有剔除噪聲數(shù)據(jù)的影響,因此缺失值修補(bǔ)的準(zhǔn)確度不高。

本文的主要貢獻(xiàn)如下:

1)選擇獲得信息更全面、修補(bǔ)數(shù)據(jù)范圍更廣的多源數(shù)據(jù)進(jìn)行缺失到站數(shù)據(jù)的直接修補(bǔ),使用的數(shù)據(jù)為GPS數(shù)據(jù)、IC卡數(shù)據(jù)和靜態(tài)線路信息數(shù)據(jù);

2)對(duì)于現(xiàn)有方法噪聲數(shù)據(jù)影響較大的情況,依賴同類數(shù)據(jù)對(duì)缺失數(shù)據(jù)進(jìn)行修補(bǔ)以此降低噪聲影響。獲取同類數(shù)據(jù)時(shí),采用聚類的方法,并且結(jié)合了站點(diǎn)間行程時(shí)間和完整到站班次時(shí)序,以此將后者反映的缺失數(shù)據(jù)班次所在時(shí)段內(nèi)路況、天氣等影響因素信息加入考慮,使修補(bǔ)值更接近實(shí)際;

3)同類數(shù)據(jù)常見(jiàn)的聚類方法有基于密度聚類方法[19]和基于劃分聚類方法(如Kmeans聚類算法)[20]等,根據(jù)本文數(shù)據(jù)的特點(diǎn):聚集數(shù)量不明確和噪聲點(diǎn)影響較大選擇了密度聚類中的DBSCAN(DensityBased Spatial Clustering of Applications with Noise)算法對(duì)于數(shù)據(jù)進(jìn)行聚類;

4)對(duì)于現(xiàn)有研究將站點(diǎn)間行程時(shí)間取值范圍設(shè)置為相同數(shù)值的情況,本文對(duì)任意兩站點(diǎn)間的行程時(shí)間和完整到站時(shí)序進(jìn)行聚類,使得每一對(duì)站點(diǎn)間的行程時(shí)間取值范圍具有個(gè)性化,以此獲得更準(zhǔn)確的同類數(shù)據(jù)和修補(bǔ)值,并使得修補(bǔ)數(shù)據(jù)準(zhǔn)確度受缺失站點(diǎn)個(gè)數(shù)變化的影響程度低、普適性高。

1?問(wèn)題描述及分析

1.1?公交到站數(shù)據(jù)介紹

某班次完整公交GPS到站數(shù)據(jù)包括到站編號(hào)(本文中用到站編號(hào)代替到站名稱說(shuō)明問(wèn)題)、車牌號(hào)、線路號(hào)、站點(diǎn)(東)經(jīng)度、站點(diǎn)(北)緯度、到站日期、班次和到站時(shí)刻(如表1所示);若某班次缺失1號(hào)站點(diǎn)和2號(hào)站點(diǎn)到站數(shù)據(jù),則其原始數(shù)據(jù)僅包含其余站點(diǎn)的到站數(shù)據(jù)(如表1點(diǎn)線以下部分所示)。

1.2?刷卡數(shù)據(jù)及線路數(shù)據(jù)介紹

目前,大部分城市公交乘客只在上車刷卡、下車不刷卡,因此本文使用乘客上車IC刷卡數(shù)據(jù)進(jìn)行分析,則有某班次同一站點(diǎn)乘客上車IC刷卡數(shù)據(jù)包括卡號(hào)、刷卡日期、刷卡時(shí)刻、乘坐的線路號(hào)和乘坐車輛的車牌號(hào)(如表2所示)。

某線路某運(yùn)營(yíng)車輛靜態(tài)線路信息示例包括到站編號(hào)、車牌號(hào)、線路號(hào)、站點(diǎn)(東)經(jīng)度和站點(diǎn)(北)緯度(如表3所示)。

1.3?公交到站數(shù)據(jù)的缺失情況

設(shè)S1,S2,…,Sk,…,SK為公交某班次運(yùn)行中連續(xù)缺失到站GPS數(shù)據(jù)的站點(diǎn),Sbefore為S1之前到達(dá)的最后一個(gè)數(shù)據(jù)完整站點(diǎn),Safter為SK之后到達(dá)的第一個(gè)數(shù)據(jù)完整站點(diǎn)。如圖1所示,對(duì)于某班次,公交到站數(shù)據(jù)的缺失通常分為三種情況:

缺失前半部分?jǐn)?shù)據(jù)?即公交某班次運(yùn)行中連續(xù)經(jīng)過(guò)的站點(diǎn)為S1,S2,…,Sk,…,SK,Safter。

缺失中間部分?jǐn)?shù)據(jù)?又分為連續(xù)多個(gè)缺失和單個(gè)缺失,后者可視為前者的一種特例,即公交某班次運(yùn)行中連續(xù)經(jīng)過(guò)的站點(diǎn)為Sbefore,S1,S2,…,Sk,…,SK,Safter。

缺失后半部分?jǐn)?shù)據(jù)?即公交某班次運(yùn)行中連續(xù)經(jīng)過(guò)的站點(diǎn)為Sbefore,S1,S2,…,Sk,…,SK。

對(duì)于缺失中間部分?jǐn)?shù)據(jù),可由缺失數(shù)據(jù)站點(diǎn)前(后)最近的完整到站數(shù)據(jù)站點(diǎn)后推(前推)得到缺失站點(diǎn)相關(guān)數(shù)據(jù),因此可視為缺失前(后)半部分?jǐn)?shù)據(jù)的特例,本質(zhì)都為由第一條(最后一條)完整到站數(shù)據(jù)推得未知到站數(shù)據(jù)。因此,后面僅對(duì)連續(xù)缺失前半部分到站數(shù)據(jù)的情況進(jìn)行敘述。

2?缺失公交到站數(shù)據(jù)修補(bǔ)

2.1?缺失到站名稱、經(jīng)緯度修補(bǔ)

鑒于公交運(yùn)營(yíng)線路固定、??空军c(diǎn)固定的實(shí)際情況,可結(jié)合完整到站數(shù)據(jù)與靜態(tài)線路信息修補(bǔ)缺失到站名稱、經(jīng)緯度。三者之間進(jìn)行關(guān)聯(lián)分析修補(bǔ)的方法如圖2所示,對(duì)于某班次到站數(shù)據(jù)修補(bǔ)的具體流程為:根據(jù)完整到站數(shù)據(jù)和靜態(tài)線路信息得到缺失到站數(shù)據(jù)的站點(diǎn)編號(hào),再以此對(duì)于缺失站點(diǎn)編號(hào)的缺失名稱、經(jīng)緯度進(jìn)行修補(bǔ),直至本班次全部站點(diǎn)的到站名稱、經(jīng)緯度修補(bǔ)完畢(如表4所示),此時(shí),僅有到站時(shí)刻數(shù)據(jù)未進(jìn)行修補(bǔ)。

2.2?缺失到站時(shí)刻修補(bǔ)

在連續(xù)缺失前半部分到站數(shù)據(jù)的情況下,設(shè)到達(dá)站點(diǎn)Safter時(shí)刻為Tafter,則從站點(diǎn)Sk到Safter的行程時(shí)間為tk,after,取值范圍為Tk,after。設(shè)缺失到站數(shù)據(jù)班次所在線路同一行駛方向同一天內(nèi),有N班次同時(shí)在缺失到站數(shù)據(jù)站點(diǎn)Sk與站點(diǎn)Safter有完整到站數(shù)據(jù),將N班次按照到達(dá)站點(diǎn)Safter的先后排序即有集合{l1,l2,…,ln,…,lN},對(duì)應(yīng)到達(dá)站點(diǎn)Safter的時(shí)刻為{Tl1after,Tl2after,…,Tlnafter,…,TlNafter},并且對(duì)應(yīng)從Sk到Safter的行程時(shí)間為{tl1k,after,tl2k,after,…,tlnk,after,…,tlNk,after}。某班次在第k個(gè)站點(diǎn)Sk的缺失到站時(shí)刻Tk修補(bǔ)步驟如下所示:

步驟1?將N趟班次按照到達(dá)站點(diǎn)Safter的先后順序,與對(duì)應(yīng)從Sk到Safter的行程時(shí)間,整合成為包含N趟班次的二維空間點(diǎn)集,如下所示:

P={(l1,tl1k,after),…,(ln,tlnk,after),…,(lN,tlNk,after)}(1)

步驟2?給定半徑ε,以及最小個(gè)數(shù)M,對(duì)步驟1的空間點(diǎn)集合進(jìn)行聚類得到Nsum個(gè)簇,即為{C1,C2,…,Cnsum,…,CNsum}。

步驟3?N趟班次中,缺少數(shù)據(jù)班次在Tafter到達(dá)站點(diǎn)Safter后第一個(gè)到達(dá)此站點(diǎn)的班次為:

laftern=argmin{Tlnafter|Tlnafter>Tafter,

ln∈{l1,l2,…,ln,…,lN}}(2)

缺少數(shù)據(jù)班次在Tafter到達(dá)站點(diǎn)Safter前最后一個(gè)到達(dá)此站點(diǎn)的班次為(如圖3所示):

lbeforen=argmax{Tlnafter|Tlnafter

ln∈{l1,l2,…,ln,…,lN}}(3)

步驟4?由步驟2可得,第laftern班次屬于簇Cone,則有Cone中全部點(diǎn)的縱坐標(biāo)取值范圍,即為行程時(shí)間取值范圍為TCone;第lbeforen班次屬于簇Ctwo,則有Ctwo中全部點(diǎn)的縱坐標(biāo)取值范圍,即為行程時(shí)間取值范圍為TCtwo。因此從站點(diǎn)Sk到Safter的行程時(shí)間取值范圍為:

Tk,after=TCone∪TCtwo(4)

步驟5?行程時(shí)間取值范圍Tk,after的最小值為:

tmink,after=minTk,after(5)

行程時(shí)間取值范圍Tk,after的最大值為:

tmaxk,after=maxTk,after(6)

則根據(jù)已知的到達(dá)站點(diǎn)Safter時(shí)刻為Tafter,判斷在時(shí)間段[Tafter-tmaxk,after,Tafter]內(nèi)是否有乘客刷卡:

如果有,可得第一位刷卡乘客的刷卡時(shí)刻為tICk,則有從站點(diǎn)Sk到Safter的行程時(shí)間為:

tk,after=Tafter-(tICk-Tp)(7)

其中:Tp表示乘客開(kāi)始刷卡時(shí)刻與車輛進(jìn)站的時(shí)間間隔,如圖4所示;tICk-Tp即為某班次在第k個(gè)缺失到站數(shù)據(jù)站點(diǎn)的到站時(shí)刻;如果沒(méi)有,則從站點(diǎn)Sk到Safter的行程時(shí)間為:

tk,after=tmink,after+tmaxk,after2(8)

步驟6?某班次在第k個(gè)缺失到站數(shù)據(jù)站點(diǎn)的到站時(shí)刻為:

Tk=Tafter-tk,after(9)

2.3?缺失公交到站數(shù)據(jù)修補(bǔ)流程

實(shí)際情況中,缺失到站名稱、經(jīng)緯度與缺失到站時(shí)刻同時(shí)存在,需要進(jìn)行基于DBSCAN和多源數(shù)據(jù)的缺失公交到站數(shù)據(jù)修補(bǔ),流程如圖5所示。

2.4?算法的理論分析與比較

設(shè)置基于GPS數(shù)據(jù)聚類的方法[9-14]作為對(duì)比實(shí)驗(yàn)的原因在于改變是否使用IC卡刷卡數(shù)據(jù),可以更好地說(shuō)明使用伴隨IC卡刷卡數(shù)據(jù)對(duì)于缺失到站數(shù)據(jù)修補(bǔ)的影響,因此將對(duì)比實(shí)驗(yàn)聚類方法選擇與本文一致的DBSCAN方法。現(xiàn)有研究表明,多源數(shù)據(jù)可以將多種環(huán)境或?qū)ο筮M(jìn)行綜合,可以獲得比單一信息源更高質(zhì)量、更精確、更完全的信息,因此本文方法理論上修補(bǔ)準(zhǔn)確度及普適性優(yōu)于基于GPS數(shù)據(jù)聚類的方法。在時(shí)、空復(fù)雜度方面,本文方法雖對(duì)缺失站點(diǎn)行程時(shí)間范圍內(nèi)IC卡刷卡數(shù)據(jù)進(jìn)行了排序,但此操作遠(yuǎn)低于聚類算法的時(shí)、空復(fù)雜度,因此本文方法的時(shí)、空復(fù)雜度即為對(duì)于N班次歷史行程時(shí)間及順序進(jìn)行基于DBSCAN聚類的時(shí)、空復(fù)雜度,與使用DBSCAN對(duì)GPS數(shù)據(jù)聚類方法的時(shí)、空復(fù)雜度一樣,與其他常見(jiàn)聚類修補(bǔ)數(shù)值方法的時(shí)、空復(fù)雜度比較如表5所示。

極大概率估計(jì)方法本文方法GPS、IC卡刷卡

相鄰刷卡記錄的固定分離閾值O(NIC)O(NIC)中低

ε、MO(N2)O(N)高高

!根據(jù)情況左右加

注:基于GPS數(shù)據(jù)kmeans聚類的參數(shù)表示迭代次數(shù)、表示聚類類別個(gè)數(shù);基于GPS數(shù)據(jù)DBSCAN聚類和本文方法中的參數(shù)ε表示半徑、

M表示最小個(gè)數(shù);極大概率估計(jì)方法的時(shí)、空復(fù)雜度中NIC表示某缺失數(shù)據(jù)班次的全部刷卡數(shù)據(jù)。

極大概率估計(jì)方法[18]首先擬合缺失站點(diǎn)對(duì)間的歷史行程時(shí)間分布;其次,根據(jù)相鄰刷卡記錄的固定分離閾值將同一輛車同方向各站的乘客刷卡記錄分離;然后,判斷若有乘客在缺失站點(diǎn)上車刷卡,則將缺失站點(diǎn)的第一條刷卡數(shù)據(jù)時(shí)刻作為時(shí)間戳,選擇參考點(diǎn)出站時(shí)刻到該時(shí)間戳的行程時(shí)間內(nèi),概率最大的到達(dá)站點(diǎn)即為缺失到站站點(diǎn);最后,對(duì)于無(wú)乘客上車刷卡的站點(diǎn),通過(guò)高斯分布函數(shù)等擬合參考點(diǎn)至站點(diǎn)的歷史行程時(shí)間集合,選擇概率最大的歷史行程時(shí)間作為缺失行程時(shí)間。對(duì)于有乘客上車的站點(diǎn),極大概率估計(jì)方法對(duì)缺失站點(diǎn)的第一條刷卡數(shù)據(jù)判斷準(zhǔn)確度依賴程度較高,此方法相鄰刷卡記錄的固定分離閾值無(wú)法根據(jù)不同站點(diǎn)之間的實(shí)際情況進(jìn)行調(diào)節(jié),噪聲點(diǎn)影響較大;對(duì)于無(wú)乘客上車刷卡的站點(diǎn),極大概率估計(jì)方法將全部歷史行程時(shí)間進(jìn)行擬合分布,但實(shí)際中大部分線路早、晚高峰時(shí)段的班次總數(shù)少于平峰時(shí)段的,因此概率最大的歷史行程時(shí)間最有可能為平峰時(shí)段的行程時(shí)間,不適用于發(fā)生擁堵概率最高的早晚高峰時(shí)段缺失數(shù)據(jù)的修補(bǔ)。

與極大概率估計(jì)方法相比,本文對(duì)歷史行程時(shí)間及班次順序進(jìn)行了聚類,使得相似的數(shù)據(jù)同簇,以此發(fā)現(xiàn)其內(nèi)在聯(lián)系,基于缺失數(shù)據(jù)班次的相似數(shù)據(jù)推算行程時(shí)間取值范圍,隨后再判斷行程時(shí)間取值范圍內(nèi)是否有IC卡刷卡數(shù)據(jù),若有則根據(jù)第一條刷卡數(shù)據(jù)確定缺失到站時(shí)刻,若無(wú)則根據(jù)相似行程時(shí)間數(shù)據(jù)確定缺失到站時(shí)刻。本文基于聚類后的相似數(shù)據(jù)對(duì)行程時(shí)間取值范圍個(gè)性化設(shè)置,比基于全部歷史數(shù)據(jù)的行程時(shí)間確定更符合實(shí)際,因此本文方法理論上修補(bǔ)準(zhǔn)確度及普適性優(yōu)于極大概率估計(jì)方法。在時(shí)、空間復(fù)雜度方面,由前文敘述已知,本文方法對(duì)于N班次歷史行程時(shí)間及順序進(jìn)行了基于DBSCAN的聚類,所以時(shí)間復(fù)雜度為O(N2)、空間復(fù)雜度為O(N),而極大概率方法雖無(wú)需進(jìn)行聚類且擬合用時(shí)較少,但需要對(duì)于某缺失數(shù)據(jù)班次的全部NIC條刷卡數(shù)據(jù)進(jìn)行判斷相鄰刷卡的時(shí)間間隔大小,因此時(shí)、空復(fù)雜度皆為O(NIC)。實(shí)際情況中一條線路某天某班次的刷卡數(shù)據(jù)量遠(yuǎn)大于該天的班次數(shù)量,因此本文方法的空間復(fù)雜度遠(yuǎn)低于極大概率方法的空間復(fù)雜度,并且在早晚高峰時(shí)段缺失數(shù)據(jù)班次的刷卡數(shù)量較多時(shí),本文方法的時(shí)間復(fù)雜度與極大概率方法的相近。

3?實(shí)驗(yàn)與結(jié)果

3.1?數(shù)據(jù)集介紹

本文以廈門市2018年某日為例,對(duì)于8*9路上行方向的68個(gè)班次2-584條到站數(shù)據(jù)進(jìn)行統(tǒng)計(jì)(每班次38個(gè)到站站點(diǎn)),發(fā)現(xiàn)GPS數(shù)據(jù)缺失情況如表6所示。選取到站數(shù)據(jù)完整的8*9路上行方向到達(dá)第35個(gè)站點(diǎn)時(shí)刻為19:33:07的班次,以及此班次GPS到站數(shù)據(jù)的(東)經(jīng)度、(北)緯度和乘客IC刷卡數(shù)據(jù)。

3.2?檢驗(yàn)方法和評(píng)價(jià)標(biāo)準(zhǔn)

檢驗(yàn)方法?根據(jù)第35個(gè)站點(diǎn)的到站數(shù)據(jù),推得第24~34個(gè)站點(diǎn)的到站數(shù)據(jù),并與真實(shí)到站數(shù)據(jù)比較進(jìn)行方法檢驗(yàn)。

評(píng)價(jià)標(biāo)準(zhǔn)?1)使用錯(cuò)誤修補(bǔ)的比率(False Classify Rate, FCR)[21]對(duì)修補(bǔ)的到站名稱、到站經(jīng)緯度進(jìn)行檢驗(yàn)。2)用平均相對(duì)誤差指標(biāo)與相關(guān)系數(shù)指標(biāo)對(duì)修補(bǔ)的到站時(shí)刻進(jìn)行檢驗(yàn):設(shè)平均相對(duì)誤差指標(biāo)為δave,其代表修補(bǔ)值可信程度,數(shù)值越小越說(shuō)明修補(bǔ)值與真實(shí)值差距越小、修補(bǔ)準(zhǔn)確度越高;設(shè)相關(guān)系數(shù)指標(biāo)為R,其代表修補(bǔ)值與真實(shí)值相關(guān)性的相關(guān)性系數(shù),數(shù)值越接近1越說(shuō)明修補(bǔ)方法準(zhǔn)確度受缺失站點(diǎn)個(gè)數(shù)變化的影響程度越低,即隨著缺失數(shù)據(jù)站點(diǎn)與完整到站數(shù)據(jù)站點(diǎn)的距離變化、修補(bǔ)準(zhǔn)確率變化較小。

3.3?缺失公交到站數(shù)據(jù)修補(bǔ)結(jié)果

首先,確定DBSCAN聚類算法中最小個(gè)數(shù)M的取值,對(duì)于研究的N=56班次假設(shè)缺失數(shù)據(jù)站點(diǎn)34和完整到站數(shù)據(jù)站點(diǎn)35,隨著最小個(gè)數(shù)的取值不同,簇的數(shù)量隨著半徑的變化如圖6所示:大部分情況下,簇的個(gè)數(shù)與最小個(gè)數(shù)的大小成反比。由現(xiàn)實(shí)情況可知,一天中公交兩站點(diǎn)間行程時(shí)間,可根據(jù)普遍的工作時(shí)間與休息時(shí)間理論上至少可分為早高峰、上午平峰、午高峰、下午平峰、晚高峰、晚上平峰6種,即為6個(gè)簇,則當(dāng)M=2或M=3時(shí)符合此現(xiàn)狀。但是簇的個(gè)數(shù)過(guò)多會(huì)造成聚類效果差,缺失數(shù)據(jù)補(bǔ)全時(shí)依據(jù)的歷史數(shù)據(jù)過(guò)少,使得修補(bǔ)效果差的情況,因此本文令M=3。

其次,在確定M=3后,當(dāng)缺失數(shù)據(jù)站點(diǎn)與完整數(shù)據(jù)站點(diǎn)相距不同時(shí),即為站點(diǎn)對(duì)不同時(shí),簇的數(shù)量隨半徑的變化呈現(xiàn)不規(guī)則變化,如圖7所示:當(dāng)M=3時(shí),缺失數(shù)據(jù)站點(diǎn)與完整數(shù)據(jù)站點(diǎn)相距不同所對(duì)應(yīng)的最大簇的數(shù)量是可接受的個(gè)數(shù),大于理論最小值6且沒(méi)有過(guò)大造成聚類效果差。因此,本文選擇最大簇的個(gè)數(shù)作為行程時(shí)間聚類后的簇個(gè)數(shù),但此時(shí)對(duì)應(yīng)的半徑可能為多個(gè)值。由于最小個(gè)數(shù)、簇的數(shù)量一定的情況下,半徑越大、噪聲點(diǎn)的數(shù)量越少,可以利用的歷史數(shù)據(jù)量越多,因此本文選擇簇個(gè)數(shù)最大時(shí)的最大半徑作為缺失站點(diǎn)數(shù)據(jù)進(jìn)行行程時(shí)間聚類時(shí)的半徑ε?;谏鲜龇治觯疚膶?duì)于56班次第24~34個(gè)站點(diǎn)的到站數(shù)據(jù)分別進(jìn)行修補(bǔ)時(shí),得到聚類結(jié)果如表7所示。

最后,對(duì)于2018年某日廈門市全部線路的已識(shí)別上車站點(diǎn)數(shù)據(jù),本文進(jìn)行乘客開(kāi)始刷卡與到站的時(shí)間間隔統(tǒng)計(jì):乘客開(kāi)始刷卡時(shí)刻與到站時(shí)刻間隔的均值為1.03s;有80.04%的時(shí)間間隔在1s及以內(nèi)(如表8所示)。因此,本文選擇乘客開(kāi)始刷卡與到站的時(shí)間間隔Tp=1.00s。

根據(jù)實(shí)驗(yàn)參數(shù)的設(shè)置結(jié)果,本文方法與兩種對(duì)比方法的修補(bǔ)到站時(shí)刻如圖8所示、檢驗(yàn)結(jié)果如表9所示,并且不同方法修補(bǔ)到站時(shí)刻相對(duì)誤差如圖9所示(時(shí)刻皆已轉(zhuǎn)換成為一天00:00:00為參照以秒為單位的數(shù)值)。

3.4?結(jié)果分析

1)由表9可得,對(duì)于FCR指標(biāo)代表的缺失到站名稱、經(jīng)緯度修補(bǔ),三種方法皆可以進(jìn)行100%的正確修補(bǔ)。說(shuō)明是否使用多源數(shù)據(jù)和聚類后的歷史相似班次數(shù)據(jù)進(jìn)行推算,對(duì)于缺失到站名稱、經(jīng)緯度修補(bǔ)的影響較小。

2)圖8可得,本文方法在三種方法中,大部分站點(diǎn)的修補(bǔ)到站時(shí)刻與真實(shí)到站時(shí)刻相距最近,且由表9可得本文方法平均相對(duì)誤差δave=0.113-5%是三種方法中最小、修補(bǔ)準(zhǔn)確度最高的。同時(shí)發(fā)現(xiàn),僅使用單一GPS數(shù)據(jù)源的基于DBSCAN和多源數(shù)據(jù)方法的平均相對(duì)誤差,遠(yuǎn)大于另外兩種基于多源數(shù)據(jù)的方法,說(shuō)明使用多源數(shù)據(jù)對(duì)于缺失到站時(shí)刻修補(bǔ)的準(zhǔn)確度產(chǎn)生了有益的影響;并且與本文同樣使用多源數(shù)據(jù),但基于全部歷史班次數(shù)據(jù)確定行程時(shí)間取值范圍的極大概率估計(jì)方法平均相對(duì)誤差比本文方法的大,說(shuō)明使用聚類后的相似班次數(shù)據(jù)對(duì)于缺失到站時(shí)刻修補(bǔ)的準(zhǔn)確度產(chǎn)生了有益的影響。

3)圖9可得,本文方法的相對(duì)誤差隨缺失站點(diǎn)個(gè)數(shù)的變化趨勢(shì)最平穩(wěn),且由表9可得本文方法的R值最接近1,因此是三種方法中準(zhǔn)確度受缺失站點(diǎn)個(gè)數(shù)變化的影響程度最低、普適性最高的。同時(shí)發(fā)現(xiàn),極大概率估計(jì)方法R值最小,說(shuō)明本文方法使用聚類后的相似班次數(shù)據(jù)進(jìn)行推算,對(duì)于缺失到站時(shí)刻修補(bǔ)的普適性產(chǎn)生了有益的影響,并且得到是否使用多源數(shù)據(jù)對(duì)于缺失到站時(shí)刻修補(bǔ)的普適性影響較小的結(jié)論。

綜上,本文基于DBSCAN和多源數(shù)據(jù)的缺失公交到站數(shù)據(jù)修補(bǔ)方法,在缺失到站時(shí)刻修補(bǔ)的準(zhǔn)確度和普適性方面均優(yōu)于基于GPS數(shù)據(jù)聚類的方法和極大概率估計(jì)方法。同時(shí)得到結(jié)論:多源數(shù)據(jù)對(duì)于缺失到站時(shí)刻修補(bǔ)的準(zhǔn)確度產(chǎn)生了有益的影響、但對(duì)于普適性影響較小;使用歷史數(shù)據(jù)聚類后的相似班次數(shù)據(jù)進(jìn)行推算,對(duì)于缺失到站時(shí)刻修補(bǔ)的準(zhǔn)確度和普適性都產(chǎn)生了有益的影響。

4?結(jié)語(yǔ)

本文提出了一種基于DBSACN和多源數(shù)據(jù)的缺失到站數(shù)據(jù)修補(bǔ)方法,可對(duì)缺失的公交到站經(jīng)緯度、到站時(shí)刻數(shù)據(jù)進(jìn)行修補(bǔ)。實(shí)例及檢驗(yàn)結(jié)果表明,本文方法使用了多源數(shù)據(jù)和聚類后的相似班次數(shù)據(jù)進(jìn)行推算,比現(xiàn)有的基于GPS數(shù)據(jù)聚類的方法、基于極大概率估計(jì)方法,具有更好的修補(bǔ)準(zhǔn)確度和普適性。

本方法的缺陷為無(wú)法對(duì)于整班次的到站時(shí)刻丟失數(shù)據(jù)進(jìn)行修補(bǔ),也無(wú)法對(duì)于缺失的到站時(shí)刻數(shù)據(jù)進(jìn)行實(shí)時(shí)預(yù)測(cè)或修補(bǔ),且DBSCAN聚類算法中有較多參數(shù)需要調(diào)整,與其他方法的時(shí)、空復(fù)雜度可在不同地區(qū)不同數(shù)據(jù)集上進(jìn)行應(yīng)用對(duì)比,如有條件還可將由乘客經(jīng)驗(yàn)得到的先驗(yàn)知識(shí)加入考慮,未來(lái)可對(duì)這些情況進(jìn)行深入研究。

參考文獻(xiàn) (References)

[1]LIU Y, WENG X, WAN J, et al. Exploring data validity in transportation systems for smart cities[J]. IEEE Communications Magazine, 2017, 55(5): 26-33.

[2]ZHANG H, ZHANG Y, LI Z, et al. Spatialtemporal traffic data analysis based on global data management using MAS[J]. IEEE Transactions on Intelligent Transportation Systems, 2004, 5(4): 267-275.

[3]吳昊,唐振軍.加權(quán)殼近鄰填充數(shù)學(xué)模型[J].華南師范大學(xué)學(xué)報(bào)(自然科學(xué)版),2013,45(3):45-48.(WU H, TANG Z J. Weighted shellneighbor imputation models[J]. Journal of South China Normal University (Natural Science Edition), 2013, 45(3): 45-48.)

[4]ZHOU Y, YAO L, CHEN Y, et al. Bus arrival time calculation model based on smart card data[J]. Transportation Research Part C: Emerging Technologies, 2017, 74: 81-96.

[5]CHEN J, SHAO J. Nearest neighbor imputation for survey data[J]. Journal of Official Statistics, 2000, 16(2): 113-131.

[6]郝勝軒,宋宏,周曉鋒.基于近鄰噪聲處理的KNN缺失數(shù)據(jù)填補(bǔ)算法[J]. 計(jì)算機(jī)仿真,2014,31(7):264-268. (HAO S X,SONG H,ZHOU X F. Predicting missing values with KNN based on the elimination of neighbor noise[J]. Computer Simulation,2014,31(7):264-268.)

[7]陳奔.基于雙向遞歸神經(jīng)網(wǎng)絡(luò)的軌跡數(shù)據(jù)修復(fù)[D].濟(jì)南:山東大學(xué), 2018.(CHEN B. Trajectory data repair based on bidirectional recurrent neural network [D]. Jinan: Shandong University, 2018.)

[8]QU L, ZHANG Y, HU J, et al. A BPCA based missing value imputing method for traffic flow volume data[C]// Proceedings of the 2008 IEEE Intelligent Vehicles Symposium. Piscataway: IEEE, 2008: 985-990.

[9]趙霞,張勇,尹寶才,等.基于改進(jìn)k*means 算法的不完整公交到站時(shí)間填充[J]. 北京工業(yè)大學(xué)學(xué)報(bào), 2018, 44(1): 135-143. (ZHAO X, ZHANG Y, YIN B C, et al. Imputation of incomplete bus arrival time based on the improvedk*means algorithm [J]. Journal of Beijing University of Technology, 2018, 44(1): 135-143.)

[10]王妍,王鳳桐,王俊陸,等.基于泛化中心聚類的不完備數(shù)據(jù)集填補(bǔ)方法[J]. 小型微型計(jì)算機(jī)系統(tǒng), 2017, 38(9) :2017-2021. (WANG Y, WANG F T, WANG J L, et al. Missing data imputation approach based on generalized centroids clustering algorithm[J]. Journal of Chinese Computer Systems, 2017, 38(9) :2017-2021.)

[11]冷泳林,陳志奎,張清辰,等.不完整大數(shù)據(jù)的分布式聚類填充算法[J].計(jì)算機(jī)工程, 2015, 41(5):19-25. (LENG Y L, CHEN Z K, ZHANG Q C, et al. Distributed clustering and filling algorithm of incomplete big data[J]. Computer Engineering, 2015,41(5):19-25.)

[12]趙亮,陳志奎,張清辰.基于分布式減法聚類的不完整數(shù)據(jù)填充算法[J].小型微型計(jì)算機(jī)系統(tǒng), 2015, 36(7):1409-1414. (ZHAO L, CHEN Z K, ZHANG Q C. Incomplete data imputation algorithm based on distributed subtractive clustering[J]. Journal of Chinese Computer Systems, 2015, 36(7):1409-1414.)

[13]韓飛,沈鎮(zhèn)林.基于不完備集雙聚類的缺失數(shù)據(jù)填補(bǔ)算法[J].計(jì)算機(jī)工程, 2016, 42(4): 20-26. (HAN F, SHEN Z L. Missing data filling algorithm based on incomplete set biclustering[J]. Computer Engineering, 2016, 42(4): 20-26.)

[14]劉思謙,陳志奎,蔣昆佑,等.一種混雜的多核估計(jì)數(shù)據(jù)填充方法[J].小型微型計(jì)算機(jī)系統(tǒng), 2017, 38(7):1523-1527. (LIU S Q, CHEN Z K,JIANG K Y, et al. Hybrid multikernels estimation method for data imputation[J]. Journal of Chinese Computer Systems, 2017, 38(7):1523-1527.)

[15]VIGGIANO C, KOUTSOPOULOS H N, WILSON N H M, et al. Journeybased characterization of multimodal public transportation networks[J]. Public Transport, 2017, 9(1/2): 437-461.

[16]LI M, DU B, HUANG J, et al. Public transport smart card data analysis and passenger flow distribution[EB/OL].[2019-05-20].https://ascelibrary.org/doi/abs/10.1061/9780784412442.170。.

[17]BARRY J J, FREIMER R, SLAVIN H L. Use of entryonly automatic fare collection data to estimate linked transit trips in New York city[J]. Transportation Research Record Journal of the Transportation Research Board, 2009, 2112:53-61.

[18]劉永鑫.基于多源數(shù)據(jù)融合的城市公交系統(tǒng)乘客出行模式挖掘及其應(yīng)用研究[D].廣州: 華南理工大學(xué),2018.(LIU Y X. Study on key technologies of transit passengers travel pattern mining and applications based on multiple sources of data[D]. Guangzhou: South China University of Technology,2018.)

[19]潘冬明,黃德才.基于相對(duì)密度的不確定數(shù)據(jù)聚類算法[J].計(jì)算機(jī)科學(xué),2015,42(11A): 72-88. (PAN D M, HUANG D C. Relative densitybased clustering algorithm over uncertain data[J]. Computer Science,2015,42(11A): 72-88.)

[20]張建民,姚亮,胡學(xué)鋼.一種面向數(shù)據(jù)缺失問(wèn)題的Kmeans 改進(jìn)算法[J].合肥工業(yè)大學(xué)學(xué)報(bào)(自然科學(xué)版), 2008, 31(9): 1455-1457. (ZHAN J M, YAO L, HU X G. An improvedKmeans algorithm for the datamissing problem[J]. Journal of Hefei University of Technology (Natural Science), 2008, 31(9): 1455-1457.)

[21]應(yīng)臻奕.基于AP聚類的不完備數(shù)據(jù)處理方法的研究與實(shí)現(xiàn)[D]. 北京:北京郵電大學(xué),2018.(YING Z Y. Research and implementation of incomplete data processing based on AP clustering[D]. Beijing: Beijing University of Posts and Telecommunications,2018.)

This work is partially supported by the National Natural Science Foundation of China Youth Fund (51608209), the Project of Natural Science Foundation of Fujian Province (2017J01090), the Project of Guiding Plan of Fujian Province (2019H0017), the Project of Quanzhou Science and Technology Plan (2018Z008), the Project of Postgraduate Research Innovation Ability Cultivation of Huaqiao University (17013083017).

WANG Cheng, born in 1984, Ph. D., associate professor. His research interests include traffic big data, data mining, machine learning.

CUI Ziwei, born in 1995, M. S. candidate. Her research interests include traffic big data, data mining.

DU Zilin, born in 1997. His research interests include traffic big data, data mining.

GAO Yueer, born in 1983, Ph. D., associate professor. Her research interests include traffic big data, data mining.

安塞县| 华池县| 绥芬河市| 冷水江市| 旌德县| 阿克陶县| 唐山市| 石渠县| 台安县| 灵武市| 蕲春县| 措美县| 白城市| 怀化市| 佛坪县| 饶阳县| 迁西县| 达拉特旗| 齐河县| 南昌县| 南通市| 静安区| 滦南县| 剑川县| 开封县| 金湖县| 金秀| 兴宁市| 赣榆县| 攀枝花市| 南宫市| 武宁县| 临夏市| 绥棱县| 温泉县| 卢湾区| 珠海市| 仁布县| 台南县| 崇信县| 达尔|