基于偽標簽和遷移學習的雙關(guān)語識別方法

2024-05-15 21:04:51姜思羽張智恒姜立標馬樂陳博遠王連喜趙亮

重慶大學學報 2024年2期

關(guān)鍵詞：遷移學習

姜思羽張智恒姜立標馬樂陳博遠王連喜趙亮

摘要：針對雙關(guān)語樣本短缺問題，研究提出了基于偽標簽和遷移學習的雙關(guān)語識別模型（pun detection based on Pseudo-label and transfer learning）。該模型利用上下文語義、音素向量和注意力機制生成偽標簽；然后，遷移學習和置信度結(jié)合挑選可用的偽標簽；最后，將偽標簽數(shù)據(jù)和真實數(shù)據(jù)混合到網(wǎng)絡(luò)中進行訓練，重復(fù)偽標簽標記和混合訓練過程。一定程度上解決了雙關(guān)語樣本量少且獲取困難的問題。使用該模型在SemEval 2017 shared task 7以及Pun of the Day 數(shù)據(jù)集上進行雙關(guān)語檢測實驗，結(jié)果表明模型性能均優(yōu)于現(xiàn)有主流雙關(guān)語識別方法。

關(guān)鍵詞：雙關(guān)語檢測；偽標簽；遷移學習

中圖分類號：TP391.1????????? 文獻標志碼：A????? 文章編號：1000-582X（2024）02-051-11

Pun detection basd on pseudo-label and transfer learning

JIANG Siyu1，2a， ZHANG Zhiheng1， JIANG Libiao3a， MA Le3b， CHEN Boyuan2b，

WANG Lianxi1， ZHAO Liang4

（1. School of Information Science and Technology， Guangdong University of Foreign Studies，

Guangzhou 510006， P. R. China; 2a. School of Software; 2b. School of Mechanical and Automotive Engineering， South China University of Technology， Guangzhou 510000， P. R. China; 3a. School of Mechanical Engineering; 3b. Engineering Research Institute， Guangzhou City University of Technology， Guangzhou 510800， P. R. China; 4. College of Further Education， Guangdong Industry Polytechnic， Guangzhou 510300， P. R. China）

Abstract： To address the problem of shortage of the pun samples， this paper proposes a pun recognition model based on pseudo-label speech-focused context （pun detection based on pseudo-label and transfer learning）. Firstly， the model uses contextual semantics， phoneme vector and attention mechanism to generate pseudo-labels. Then， it combines transfer learning and confidence to select useful pseudo-labels. Finally， the pseudo-label data and real data are used for network theory and training， and the pseudo-label labeling and mixed training procedures are repeated. To a certain extent， the problem of small sample size and difficulty in obtaining puns has been solved. By this model， we carry out pun detection experiments on both the SemEval 2017 shared task 7 dataset and the Pun of the Day dataset. The results show that the performance of this model is better than that of the existing mainstream pun recognition methods.

Keywords： pun detection; pseudo-label; transfer learning

隨著社交媒體不斷發(fā)展，人們在網(wǎng)絡(luò)上創(chuàng)作了大量幽默內(nèi)容。幽默的結(jié)構(gòu)往往十分復(fù)雜，且依賴真實世界知識。在自然語言中，常見的修辭方法雙關(guān)語是幽默的一種重要表現(xiàn)形式。雙關(guān)語是將詞語的真正含義模糊化，使同一個句子有2種或者多種釋義，使文本產(chǎn)生不同程度的敏感性。雙關(guān)語是著名文學、廣告和演講中幽默來源的標準修辭手法。例如，它常常作為一種幽默手段被用于廣告中，引發(fā)聽眾聯(lián)想雙關(guān)語中的潛在表達，既能引人注意又能產(chǎn)生聯(lián)想，加深記憶[1]，有益于判斷文本的情感傾向。因此，雙關(guān)語自動識別被認為是傳統(tǒng)語言學和自然語言處理領(lǐng)域認知科學中重要的研究課題，具有廣泛應(yīng)用價值。

雙關(guān)語的經(jīng)典分類是諧音雙關(guān)語和語義雙關(guān)語[2]。語義雙關(guān)語，即指同詞多義，如表1所示中的“Whats the longest sentence in the world？ Life sentence.”屬于語義雙關(guān)，“Life sentence”中的“sentence”還有徒刑的意思，故“Life sentence”表示為無期徒刑的意思。諧音雙關(guān)語，2個不同的詞語符合相同語境，即指同音不同詞，表1中的“A bicycle cant stand on its own because it is two-tyred”中的“two-tyred”根據(jù)讀音可被人聯(lián)想為“too-tired”，使句子具有完全不同意思。理解雙關(guān)語對于深入理解復(fù)雜語義有重要意義。

隨著深度神經(jīng)網(wǎng)絡(luò)的發(fā)展，現(xiàn)有雙關(guān)語識別模型算法大多基于神經(jīng)網(wǎng)絡(luò)：例如，刁宇峰等[3]提出了英文字典編碼的搭配注意網(wǎng)絡(luò)模型（word Net-Encoded collocation-attention network， WECA），該模型以基于英文詞典“WordNet“來理解和編碼嵌入作為輸入，結(jié)合上下文權(quán)重，使用神經(jīng)注意力網(wǎng)絡(luò)，捕捉語義雙關(guān)語中的多義性。但此類基于神經(jīng)網(wǎng)絡(luò)方法學習模型存在的缺陷是：1）現(xiàn)有模型依賴大量有標簽數(shù)據(jù)?，F(xiàn)實中雙關(guān)語收集較為困難，一般需要具有豐富相關(guān)知識的人進行準確判定和分類，Miller等[4]公布了SemEval 2017 shared task 7 （SemEval 2017）數(shù)據(jù)集中一共包含4 030個雙關(guān)語樣例，反應(yīng)出對于雙關(guān)語的收集和標記有一定難度；2）在少樣本學習中，如何提升模型的泛化能力是一個富有挑戰(zhàn)性的問題。

筆者提出一種基于偽標簽和遷移學習的雙關(guān)語識別模型（pun detection based on pseudo-label and transfer learning，PDPTL）。利用未標簽數(shù)據(jù)重疊信息在同類數(shù)據(jù)中尋找更為通用的特征，使用遷移學習和置信度結(jié)合挑選可用的偽標簽，重復(fù)偽標簽數(shù)據(jù)與混合訓練過程，一定程度緩解雙關(guān)語數(shù)據(jù)樣本稀缺和模型泛化能力的問題。經(jīng)過實驗，PCPRPL在公開數(shù)據(jù)集的預(yù)測效果獲得比較明顯提高，且優(yōu)于目前已知方法。

1 相關(guān)工作

雙關(guān)語任務(wù)涉及到雙關(guān)語識別與生成，研究主要運用偽標簽和遷移學習技術(shù)為解決雙關(guān)語任務(wù)提供新方法。

1.1 雙關(guān)語識別與生成

Pedersen 等[5]利用詞義消歧技術(shù)（word sense disambiguation technique，WSD）[6]識別語句中詞語的合理釋義，進而達到識別雙關(guān)語的目的。Dieke等[7]利用外部數(shù)據(jù)庫，例如英文詞典“WordNet”，對雙關(guān)語的詞義進行判斷。上述2種方法各有缺點，前者不能處理諧音詞，因為諧音詞具有不同拼寫，后者知識庫只包含有限詞匯。為解決上述2個問題，Mikolov等[8]和Pennington 等[9]使用詞嵌入技術(shù)（word embedding techniques， WET）為雙關(guān)語提供了靈活表示。在實際情景中一個詞語根據(jù)它所在文本的上下文可能有多種釋義，詞語的罕用含義也可能應(yīng)用于創(chuàng)造雙關(guān)語，使上述靜態(tài)詞嵌入技術(shù)，難以勝任動態(tài)變化。為解決上述問題，Zhou等[10]提出語音注意語境雙關(guān)識別模型（pronunciation-attentive contextualized pun recognition，PCPR）將上下文語義向量和語音嵌入向量2種特征同時應(yīng)用于雙關(guān)語識別，取得不錯效果。Xiu等[11]基于詞匯網(wǎng)絡(luò)以及詞嵌入技術(shù)訓練了無監(jiān)督模型。該模型只依賴語義來檢測異義雙關(guān)語，忽略了語音中蘊含的豐富信息。Doogan Samuel等 [12]拼接發(fā)音字符串利用詞嵌入和語音信息，但單采用拼接方法效果有限，利用長短期記憶網(wǎng)絡(luò)（long - short memory，LSTM）和條件隨機場（conditional random fields，CRF）的標簽聯(lián)合檢測和定位雙關(guān)語。

1.2 偽標簽

Lee等 [13]在2013年實現(xiàn)了簡單有效的半監(jiān)督式學習方法，叫做“偽標簽（pseudo-label）”，這個想法是在一批有標簽和無標簽的圖像上，同時訓練一個模型。有監(jiān)督方式使用有標簽數(shù)據(jù)和無標簽數(shù)據(jù)訓練模型，預(yù)測一批無標簽數(shù)據(jù)生成偽標簽，最后使用有標簽數(shù)據(jù)和偽標簽數(shù)據(jù)訓練新模型。

Google AI 的Qizhe Xie等 [14]提出一種受知識蒸餾（knowledge distillation）啟發(fā)的半監(jiān)督方法“嘈雜學生（noisy student）”。核心思想是訓練2種不同的模型，即“老師（teacher）”和“學生（student）”。教師模型首先對標簽圖像進行訓練，對未標記圖像進行偽標簽推斷。然后，將有標記和未標記的圖像組合在一起，并根據(jù)這些組合的數(shù)據(jù)訓練學生模型。再將學生模型作為新的教師模型進行迭代，研究使用的無標簽數(shù)據(jù)大部分不屬于目標數(shù)據(jù)集的分布。上述偽標簽方法大多被應(yīng)用于圖形處理領(lǐng)域。

1.3 遷移學習

遷移學習（transfer learning）旨在通過遷移包含在不同但相關(guān)源域中的知識提高目標學習者在目標域上的表現(xiàn)，減少構(gòu)建目標學習器對大量目標域數(shù)據(jù)的依賴[15]。根據(jù)領(lǐng)域之間差異，遷移學習可分為兩類：同構(gòu)遷移學習和異構(gòu)遷移學習[16]。1）在同構(gòu)遷移學習中，一些研究通過校正樣本選擇偏差 [17] 或協(xié)變量偏移 [18]調(diào)整域的邊緣分布。然而，這個假設(shè)在很多情況下并不成立，如在情感分類問題中，一個詞在不同領(lǐng)域有不同意義傾向，這種現(xiàn)象也稱為上下文特征偏差，為解決這個問題，一些研究進一步適應(yīng)了條件分布；2）異構(gòu)遷移學習是指域具有不同特征空間情況下的知識遷移過程。除了分布適應(yīng)，異構(gòu)遷移學習還需要特征空間適應(yīng)[19]，這使得它比同構(gòu)遷移學習更復(fù)雜。筆者主要針對相似特征空間的雙關(guān)語數(shù)據(jù)集進行處理，因此屬于同構(gòu)遷移學習方法。

2 基于適應(yīng)偽標簽領(lǐng)域的語音專注語境的雙關(guān)語識別模型

構(gòu)建研究模型：基于偽標簽和遷移學習的雙關(guān)語識別模型PDPTL。

2.1 任務(wù)概述

遵循Zhou等對于任務(wù)的定義，對于一段含有N個詞的文本{t_1，t_2，...，t_N}。每個詞t_i具有M_i個音素，根據(jù)發(fā)音，可表示為H（t_i）={h_（i，1），h_（i，2），...，h_（i，M_i ）}，h_（i，j）表示文本中第i個詞的第j個音素。這些音素是由CMU 發(fā)音字典（CMU pronouncing dictionary）[16]提供。雙關(guān)語檢測模型的任務(wù)是一個二分類問題，目的是檢測輸入文本是否包含雙關(guān)語。

2.2 PDPTL模型框架

基礎(chǔ)模型：PDPTL選用PCPR作為基礎(chǔ)模型。模型使用BERT[20]生成詞語的上下文語義向量TC_i（D_C維的向量），以及文本的總體語義TC_（[CLS]）。

對于詞語t_i的每個音素h_（i，j）使用Keras的Embedding層投影為D_P維向量p_（i，j），之后通過局部注意力機制（local-attention mechanism）[21]進行加權(quán)生成語音嵌入向量TP_i（pronunciation embedding vector）

e_（i，j）=tanh（F_P （p_（i，j））），（1）

α_（i，j）^P=（e_（i，j）^T e_s）/（∑_k?〖e_（i，k）^T e_s 〗）， ?? （2）

TP_i=∑_j?〖α_（i，j） e_（i，j）〗，（3）

式中：F_P （?）是輸出D_a維向量的全連接層；α_（i，j）^P是p_（i，j）的重要分數(shù)；e_s是用來評估每個語音嵌入重要性的D_a維向量，D_a是模型定義的局部注意力機制的大小。

通過拼接上下文語義向量TC_i和語音嵌入向量TP_i（pronunciation embedding vector）生成TJ_i（D_j=D_a+D_P維向量）并運用自注意機制（Self-attention Mechanism）[22]加權(quán)得到自注意向量TJ_（[ATT]）（self-attention embedding vector）

TJ_i=［TC_i;TP_i］，???? （4）

F_S （T）=Softmax（（TT^T）/√a）T， ??? （5）

α_i^S=（exp（F_S （TJ_i）））/（∑_j?〖exp（F_S （TJ_j））〗），???? （6）

TJ_（[ATT]）=∑_i?〖α_i^S?TJ_i 〗，?? （7）

式中：F_S （T）是用來估算注意力的函數(shù)；α_i^S是每個單詞t_i的重要分數(shù)；a是一個縮放系數(shù)，為了避免過小的梯度。最后拼接TJ_（[ATT]）與TC_（[CLS]）生成輸入文本的整體特征即語音聯(lián)合上下文語義向量TJ_（[CLS]）

TJ_（[CLS]）=［TC_（[CLS}）;TJ_（[ATT]）］，???? （8）

預(yù)測標簽由采用softmax激活函數(shù)的全連接層給出

y ?_i^L=argmaxF_D （TJ_（[CLS]））_k，k∈{0，1}，??? （9）

式中，F(xiàn)_D 〖（?）〗_k生成二元分類中兩類的值。

偽標簽：先前的偽標簽學習方法篩選偽標簽的策略通常為選取高置信度的樣本。策略的依據(jù)是聚類假設(shè)，即高置信度樣本在相同類別的可能性較大。具體步驟為設(shè)定confidence_coefficient這一置信度閾值，只有生成的偽標簽概率大于confidence_coefficient時，模型才會將其加入訓練數(shù)據(jù)中。

概率由以下公式得出

confidence=MAX（Softmax（F_D （TJ_（[CLS]））），??? （10）

但這樣的策略，一方面閾值的確定過于依賴人工實驗，另一方面忽視了潛藏的危險“高置信度的陷阱”——模型所認為的高置信度樣本并不一定可靠，最終導(dǎo)致高置信度的錯誤樣本加入到了模型訓練過程中。為了篩選出更加可靠的樣本，模型在高置信度策略基礎(chǔ)上結(jié)合遷移學習方法中的MMD（maximum mean discrepancy）[23]距離來評估偽標簽樣本的可靠性。

MMD是由Gretton 等人提出，用于度量2個數(shù)據(jù)集分布的匹配程度，常用于檢測雙關(guān)樣本問題。度量值代表2個數(shù)據(jù)集分布在再生希爾伯特空間（reproducing kernel Hilbert space，RKHS）中的距離，度量值越小，則距離越近，代表2個分布越相似，MMD的計算公式如下

MMD（TD，PD）=‖1/n^2? ∑_i^n?∑_i^n?〖k（TD_i，TD_i）-2/nm ∑_i^n?∑_j^m?〖k（TD_i，PD_j）-〗〗┤ ├ 1/m^2? ∑_j^m?∑_j^m?〖k（PD_j，PD_j）〗┤‖_H。? （11）

本模型的偽標簽樣本篩選策略，給定置信度閾值confidence_coefficient一個初始值，置信度閾值以一定步幅（speed）增長，計算在當前置信度閾值下篩選得出的偽標簽數(shù)據(jù)（Pseudo_label_data）與訓練數(shù)據(jù)（labeled_data）的MMD距離，將其中MMD距離最小的閾值作為最終置信度閾值，由此篩選出最終偽標簽數(shù)據(jù)（Pseudo_label_data），標記偽標簽，加入訓練。為了保證模型能盡可能學到正確知識及從有標簽數(shù)據(jù)中學習到足夠知識，筆者采用了加權(quán)損失函數(shù)，即在T_start批次前對帶有偽標簽的數(shù)據(jù)權(quán)重設(shè)置為零后慢慢增加，直到T_End批次保持不變?yōu)槌?shù)weight。

weight（t）={（0，@（t-T_start）/（T_End-T_start ）×weight，@weight，）┤ （t

損失函數(shù)為交叉熵損失函數(shù)，真實訓練數(shù)據(jù)（labeled_data）和偽標簽數(shù)據(jù)（Pseudo_label_data）將會分開計算損失值，最后如下加權(quán)合并得出最終損失Loss

Loss=loss（labeled_data）+weight（t）*loss（Pseudo_label_data）。????? （13）

PDPTL：圖 1體現(xiàn)了PDPTL的整體框架。概括而言，模型分為3步：

1）通過有標簽數(shù)據(jù)訓練基礎(chǔ)模型，得到已訓練模型；

2）已訓練模型對無標簽數(shù)據(jù)進行預(yù)測獲得帶有偽標簽的數(shù)據(jù)；

3）將有標簽數(shù)據(jù)和篩選后的偽標簽數(shù)據(jù)混合取代有標簽數(shù)據(jù)重新訓練基礎(chǔ)模型，進入下一輪。

根據(jù)以上闡述，算法1展示了PDPTL的總體流程。

算法1

times循環(huán)更新pseudo_labels的次數(shù)

Base_Model基礎(chǔ)模型

num_train_epochs模型訓練批次

eval_data無標簽數(shù)據(jù)。

eval（）評估函數(shù)輸入模型和無標簽數(shù)據(jù)輸出偽標簽數(shù)據(jù)

confidence_coefficient 初始閾值

Best_MMD 最小的MMD距離

Best_confidence_coefficient 最佳閾值

speed 閾值增加步幅

for index<-0 to times：/*times循環(huán)更新pseudo_labels的次數(shù)*/

{

init Base_Model/*Base_Model基礎(chǔ)模型*/

for epoch<-0 to num_train_epochs：/*num_train_epochs模型訓練批次*/

{

train Base_Model with train_data_with_label /*使用訓練數(shù)據(jù)訓練Base_Model*/

}

data_with_pseudo_labels <- eval（Base_Model，eval_data）

/*eval_data無標簽數(shù)據(jù)。eval（）評估函數(shù)輸入模型和無標簽數(shù)據(jù)輸出偽標簽數(shù)據(jù)*/

init train_data_with_label

/*初始化訓練數(shù)據(jù)，即去除上一輪加入的偽標簽數(shù)據(jù)*/

Now_confidence_coefficient = confidence_coefficient

While Now_confidence_coefficient <= 1：

{

for data_with_pseudo_label in data_with_pseudo_labels：/*遍歷每一條偽標簽數(shù)據(jù)*/

{

if probability of data_with_pseudo_label larger than Now_confidence_coefficient：

/*判斷的概率大于置信度confidence_coefficient*/

add data_with_pseudo_label to pseudo _data_with_label/*將偽標簽數(shù)據(jù)加入偽標簽數(shù)據(jù)集中*/

}

MDD = getMDD（train_data_with_label，pseudo _data_with_label）/*獲取當前偽標簽數(shù)據(jù)集與訓練數(shù)據(jù)集的MDD*/

if Now_ confidence_coefficient == confidence_coefficient：

Best_MDD = MDD

else：

if Best_MDD < MDD：/*距離變小則更新*/

{

Best_MDD = MDD

Best_confidence_coefficient = Now_confidence_coefficient/*更新最佳閾值和最佳偽標簽數(shù)據(jù)集*/

best_pseudo_data_with_label = pseudo_data_with_label

}

init pseudo _data_with_label/*初始化當前偽標簽數(shù)據(jù)集，即清空

Now_confidence_coefficient = Now_confidence_coefficient + speed/*按照speed遞增*/

}

Add best_pseudo_data_with_label to train_data_with_label

}

3 實驗

展示實驗相關(guān)設(shè)置，將PDPTL模型與其他經(jīng)典算法在2個公開數(shù)據(jù)集上進行性能比較。

3.1 實驗設(shè)置

實驗數(shù)據(jù)集：模型在 SemEval 2017 shared task 7 數(shù)據(jù)集（SemEval 2017） [4]以及the Pun of The Day 數(shù)據(jù)集（PTD） [24]進行實驗。SemEval 2017 task 7 數(shù)據(jù)集由4 030個雙關(guān)語樣例組成，且每個樣例都被細分為語義雙關(guān)語或者諧音雙關(guān)語，表2詳細統(tǒng)計了數(shù)據(jù)集。SemEval 2017 數(shù)據(jù)集包含了雙關(guān)語和非雙關(guān)語笑話、格言以及由專業(yè)幽默作家創(chuàng)作，或從網(wǎng)絡(luò)上收集的短文。這個數(shù)據(jù)集是目前此研究領(lǐng)域中使用的最大公開數(shù)據(jù)集。

PTD 數(shù)據(jù)集包含4 826個樣例。表3顯示了PDT的統(tǒng)計信息。PTD 數(shù)據(jù)集則包含從雙關(guān)語網(wǎng)站上篩選收集的雙關(guān)語笑話和從美聯(lián)社新聞、《紐約時報》、雅虎問答以及英文諺語中篩選摘取的非幽默文本。雖然PTD數(shù)據(jù)集原意是為識別幽默文本創(chuàng)建，但由于其上述特殊的內(nèi)在構(gòu)成，本模型也將在該數(shù)據(jù)集上進行實驗。

評價標準：選擇使用準確率（P），召回率（R）以及F1值來比較PDPTL和基礎(chǔ)模型以及其他基準模型的性能。其中TP代表被模型正確分類的包含雙關(guān)語的樣例數(shù)量，MP代表了模型判斷為包含雙關(guān)語的樣例的數(shù)量，TP為真實包含雙關(guān)語的樣例數(shù)量。

P=TP/MP ，????? （14）

R=TP/TR，? （15）

F1= 2RP/（R+P）。??? （16）

基準模型：在SemEval 2017數(shù)據(jù)集上，PDPTL會與Duluth ， CRF [24]，Joint [24]，JU_CSE_ NLP[25]，PunFields[26]，F(xiàn)ermi[27]以及CPR[10]7個基準模型比較。JU_CSE_ NLP基于規(guī)則分類雙關(guān)語。PunFields使用同義詞典識別雙關(guān)語。Fermi在監(jiān)督學習的基礎(chǔ)上使用RNN分類。CPR即是PCPR模型去除語音特征，只使用語義特征。在PDT數(shù)據(jù)集上，模型會和HAE [28]，MCL[29]，PAL [30]、HUR[31]、WECA[2]以及CPR 5個基準模型進行比較。HAE [23]應(yīng)用了基于Word2Vec和以人為中心的隨機森林方法。MCL[24]利用帶有多種文體特征的單詞表示。PAL[29]運用CNN方法去自動學習基本特征。HUR[31] 在已有CNN模型基礎(chǔ)上調(diào)整了過濾器的大小和添加highway層。

實驗細節(jié)設(shè)置：模型的超參數(shù)weight=0.84，T_start=2，T_End=4，times=5，num_train_epochs=7，confidence_coefficient=0.999 7，speed=0.000 1。但在PDT數(shù)據(jù)集上，times=3，num_train_epochs=5。模型的實驗環(huán)境：pytorch-pretrained-bert==0.6.1， seqeval==0.0.5，torch==1.0.1.post2，tqdm==4.31.1，nltk==3.4.5，GPU型號為Tesla V100-SXM2，實驗在Goolgle的Colab平臺運行。

3.2 實驗結(jié)果

表4將PDPTL模型與其他經(jīng)典模型在檢測SemEval數(shù)據(jù)集上的語義雙關(guān)語和諧音雙關(guān)語性能方面進行比較。在SemEval 2017數(shù)據(jù)集上，PDPTL對比3個基準模型表現(xiàn)最優(yōu)。在語義雙關(guān)語上對比最優(yōu)基準模型分別在準確率（P）、召回率（R）和F1值（F1）提高5.01%、2.93%、4.04%，在諧音雙關(guān)語上對比最優(yōu)的基準模型分別在準確率（P）、召回率（R）和F1值（F1）提高9.12%、3.77%、6.55%。

表5則是在PDT數(shù)據(jù)集上比較了模型的性能。在PDT數(shù)據(jù)集上，PDPTL對比最優(yōu)的基準模型分別在準確率（P）、召回率（R）和F1值（F1）提高12.01%、5.54%、8.98%。

圖2與圖3為PDPTL與基礎(chǔ)模型在2個數(shù)據(jù)集上的比較。在SemEval數(shù)據(jù)集上，對于語義雙關(guān)語，PDPTL模型相較于基礎(chǔ)模型分別在準確率（P）、召回率（R）和F1值（F1）提高1.51%、0.69%、1.10%。對于諧音雙關(guān)語，PDPTL模型對比基礎(chǔ)模型分別在準確率（P）、召回率（R）和F1值（F1）提高0.87%、1.73%、1.30%。在PDT數(shù)據(jù)集上，PDPTL模型對比基礎(chǔ)模型分別在準確率（P）、召回率（R）和F1值（F1）提高0.37%、0.65%、0.52%。值得注意，PCPR方法和CPR方法在PDT數(shù)據(jù)集上相比較結(jié)果相差無幾。CPR方法即是PCPR去除語音向量，僅依靠BERT生成的上下文語義向量及注意力機制。明顯看出PDPTL方法在PDT數(shù)據(jù)集上提升效果不如在SemEval 2017數(shù)據(jù)集， PDT數(shù)據(jù)集的樣本數(shù)量是SemEval數(shù)據(jù)集單一子集數(shù)量的2倍，結(jié)果符合假設(shè)。

4 結(jié)束語

針對現(xiàn)有的雙關(guān)語數(shù)據(jù)集樣本較少問題，提出利用偽標簽技術(shù)輔助模型進行訓練；考慮到偽標簽數(shù)據(jù)和真實數(shù)據(jù)之間的特征分布差異，遷移學習技術(shù)和置信度相結(jié)合，提出一種新型雙關(guān)語識別模型。使用該模型在SemEval 2017 shared task 7以及Pun of the Day 數(shù)據(jù)集上進行雙關(guān)語檢測實驗，表明了PDPTL模型可拉近偽標簽和真實標簽數(shù)據(jù)的特征分布，預(yù)測性能均優(yōu)于現(xiàn)有的主流雙關(guān)語識別方法。

參考文獻

［1］? 徐琳宏，林鴻飛，祁瑞華，等. 基于多特征融合的諧音廣告語生成模型[J]. 中文信息學報， 2018， 32（10）： 109-117.

Xu L H， Lin H F， Qi R H， et al. Homophonic advertisement generation based on features fusion[J]. Journal of Chinese Information Processing， 2018， 32（10）： 109-117.（in Chinese）

［2］? Redfern W D. Guano of the mind： puns in advertising[J]. Language & Communication， 1982， 2（3）： 269-276.

［3］? Diao Y F， Lin H F， Wu D， et al. WECA： a WordNet-encoded collocation-attention network for homographic pun recognition[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels， Belgium. Stroudsburg， PA， USA： Association for Computational Linguistics， 2018： 2507–2516.

［4］? Miller T， Hempelmann C F， Gurevych I. Semeval-2017 task 7： detection and interpretation of english puns[C]//Proceedings of the 11th International Workshop on Semantic Evaluation （SemEval-2017）. Vancouver， Canada： Association for Computational Linguistics， 2017： 58-68.

［5］? Pedersen T. Puns upon a midnight dreary， lexical semantics for the weak and weary[C]//Proceedings of the 11th International Workshop on Semantic Evaluation. Vancouver， Canada： Association for Computational Linguistics， 2017： 416-420.

［6］? Ranjan Pal A， Saha D. Word sense disambiguation： a survey[J]. International Journal of Control Theory and Computer Modeling， 2015， 5（3）： 1-16.

［7］? Dieke O， Kilian E. Global vs. local context for interpreting and locating homographic english puns with sense embeddings[C]//Proceedings of the 11th International Workshop on Semantic Evaluation. Vancouver， Canada： Association for Computational Linguistics， 2017： 444-448.

［8］? Mikolov T， Sutskever I， Chen K， et al. Distributed representations of words and phrases and their compositionality[EB/OL]. [2021-06-10].. https：//arxiv.org/abs/1310.4546.pdf.

［9］? Pennington J， Socher R， Manning C. Glove： global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing （EMNLP）. Doha， Qatar. Stroudsburg， PA， USA： Association for Computational Linguistics， 2014： 1532–1543.

［10］? Zhou Y C， et al. The boating store had its best sail ever： pronunciation-attentive contextualized pun recognition[EB/OL].[2021-06-10]. https：//arxiv.org/pdf/2004.14457.pdf.

［11］? Xiu Y L， et al. Using supervised and unsupervised methods to detect and locate english puns[C]//Proceedings of the 11th International Workshop on Semantic Evaluation. Vancouver， Canada： Association for Computational Linguistics， 2017： 453-456.

［12］? Samuel D， Aniruddha G， Hanyang C，et al. Detection and interpretation of english puns[C]//Proceedings of the 11th International Workshop on Semantic Evaluation. Vancouver， Canada： Association for Computational Linguistics， 2017： 103-108.

［13］? Lee D. Pseudo-label： The simple and efficient semi-supervised learning method for deep neural networks[C]//In Workshop on Challenges in Representation Learning， Atlanta， Georgia： International Conference on Machine Learning， 2013.

［14］? Xie Q Z， Luong M T， Hovy E， et al. Self-training with noisy student improves ImageNet classification[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. June 13-19， 2020. Seattle， WA， USA： IEEE， 2020： 10684-10695.

［15］? Zhuang F Z， Qi Z Y， Duan K Y， et al. A comprehensive survey on transfer learning[J]. Proceedings of the IEEE， 2021， 109（1）： 43-76.

［16］? Pan S J， Yang Q. A survey on transfer learning[J]. IEEE Transactions on Knowledge and Data Engineering， 2010， 22（10）： 1345-1359.

［17］? Huang J Y， Smola A J， Gretton A， et al. Correcting sample selection bias by unlabeled data[M]//Advances in Neural Information Processing Systems 19. US： MIT Press， 2007： 601-608.

［18］? Sugiyama M， Suzuki T， Nakajima S， et al. Direct importance estimation for covariate shift adaptation[J]. Annals of the Institute of Statistical Mathematics， 2008， 60（4）： 699-746.

［19］? Day O， Khoshgoftaar T M. A survey on heterogeneous transfer learning[J]. Journal of Big Data， 2017， 4（1）： 1-42.

［20］? Devlin J， Chang M W， Lee K， et al. BERT： pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2021-06-10].. https：//arxiv.org/abs/1810.04805.pdf.

［21］? Bahdanau D， Cho K， Bengio Y. Neural machine translation by jointly learning to align and translate[EB/OL]. [2021-06-10]. https：//arxiv.org/abs/1409.0473.pdf.

［22］? Ashish V， Noam S， Niki P， et al. Attention is all you need[J]. Advances in Neural Information Processing Systems， 2017： 5998-6008.

［23］? Gretton A， Borgwardt K， Rasch M， et al. A kernel two-sample test[J]. Journal of Machine Learning Research， 2012（13）：723-773.

［24］? Yanyan Z， Wei L. Joint detection and location of english puns[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Minnesota： Association for Computational Linguistics， 2019： 2117-2123.

［25］? Aniket P， Dipankar D. Employing rules to detect and interpret english puns[C]// Proceedings of the 11th International Workshop on Semantic Evaluation. Vancouver， Canada： Association for Computational Linguistics， 2017： 432-435.

［26］? Mikhalkova E， Karyakin Y. Pun fields at SemEval-2017 task 7： employing rogets thesaurus in automatic pun recognition and interpretation[C]//Proceedings of the 11th International Workshop on Semantic Evaluation （SemEval-2017）. Vancouver， Canada. Stroudsburg， PA， USA： Association for Computational Linguistics， 2017.

［27］? Indurthi V， Oota S R. Fermi at SemEval-2017 task 7： detection and interpretation of homographic puns in English language[C]//Proceedings of the 11th International Workshop on Semantic Evaluation （SemEval-2017）. Vancouver， Canada. Stroudsburg， PA， USA： Association for Computational Linguistics， 2017： 457-460.

［28］? Yang D Y， Lavie A， Dyer C， et al. Humor recognition and humor anchor extraction[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon， Portugal. Stroudsburg， PA， USA： Association for Computational Linguistics， 2015： 2367-2376.

［29］? Mihalcea R， Strapparava C. Making computers laugh： investigations in automatic humor recognition[C]//Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing - HLT '05. October 6-8， 2005. Vancouver， ColumbiaBritish， Canada. Morristown， NJ， USA： Association for Computational Linguistics， 2005.

［30］? Chen L， Lee C M. Predicting audiences laughter using convolutional neural network[EB/OL]. [2021-06-10].https：//arxiv.org/abs/1702.02584.pdf.

［31］? Chen P Y， Soo V W. Humor recognition using deep learning[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 2 （Short Papers）. New Orleans， Louisiana. Stroudsburg， PA， USA： Association for Computational Linguistics， 2018： 113-117.

（編輯? 侯湘）

国产日韩欧美一区二区三区三州_亚洲少妇熟女av_久久久久亚洲av国产精品_波多野结衣网站一区二区_亚洲欧美色片在线91_国产亚洲精品精品国产优播av_日本一区二区三区波多野结衣 _久久国产av不卡

基于偽標簽和遷移學習的雙關(guān)語識別方法