勞拉·哈努等 郭曉陽
Machine-learning systems could help flag hateful, threatening or offensive language. 機器學(xué)習(xí)系統(tǒng)可幫助標(biāo)記仇恨性、威脅性或攻擊性言論。
Social platforms large and small are struggling to keep their communities safe from hate speech, extremist content, harassment and misinformation. One solution might be AI: developing algorithms to detect and alert us to toxic and inflammatory comments and flag them for removal. But such systems face big challenges.
The prevalence of hateful or offensive language online has been growing rapidly in recent years. Social media platforms, relying on thousands of human reviewers, are struggling to moderate the ever-increasing volume of harmful content. In 2019, it was reported that Facebook moderators are at risk of suffering from PTSD as a result of repeated exposure to such distressing content. Outsourcing this work to machine learning can help manage the rising volumes of harmful content. Indeed, many tech giants have been incorporating algorithms into their content moderation1 for years.
One such example is Googles Jigsaw2, a company focusing on making the internet safer. In 2017, it helped create Conversation AI, a collaborative research project aiming to detect toxic comments online. However, a tool produced by that project, called Perspective, faced substantial criticism. One common complaint was that it created a general “toxicity score” that wasnt flexible enough to serve the varying needs of different platforms. Some Web sites, for instance, might require detection of threats but not profanity, while others might have the opposite requirements.
Another issue was that the algorithm learned to conflate toxic comments with nontoxic comments that contained words related to gender, sexual orientation, religion or disability. For example, one user reported that simple neutral sentences such as “I am a gay black woman” or “I am a woman who is deaf” resulted in high toxicity scores, while “I am a man” resulted in a low score.
Following these concerns, the Conversation AI team invited developers to train their own toxicity-detection algorithms and enter them into three competitions (one per year) hosted on Kaggle, a Google subsidiary known for its community of machine learning practition-ers, public data sets and challenges. To help train the AI models, Conversation AI released two public data sets containing over one million toxic and non-toxic comments from Wikipedia and a service called Civil Comments. Some comments were seen by many more than 10 annotators (up to thousands), due to sampling and strategies used to enforce rater accuracy.
The goal of the first Jigsaw challenge was to build a multilabel toxic comment classification model with labels such as “toxic”, “severe toxic”, “threat”, “insult”, “obscene”, and “identity hate”. The second and third challenges focused on more specific limitations of their API: minimizing unintended bias towards pre-defined identity groups and training multilingual models on English-only data.
Our team at Unitary, a content-moderation AI company, took inspir-ation from the best Kaggle solutions and released three different models corresponding to each of the three Jigsaw challenges. While the top Kaggle solutions for each challenge use model ensembles, which average the scores of multiple trained models, we obtained a similar performance with only one model per challenge.
While these models perform well in a lot of cases, it is important to also note their limitations. First, these models will work well on examples that are similar to the data they have been trained on. But they are likely to fail if faced with unfamiliar examples of toxic language.
Furthermore, we noticed that the inclusion of insults or profanity in a text comment will almost always result in a high toxicity score, regardless of the intent or tone of the author. As an example, the sentence “I am tired of writing this stupid essay” will give a toxicity score of 99.7 percent, while removing the word “stupid” will change the score to 0.05 percent.
Lastly, all three models are still likely to exhibit some bias, which can pose ethical concerns when used off-the-shelf3 to moderate content.
Although there has been considerable progress on automatic detection of toxic speech, we still have a long way to go until models can capture the actual, nuanced, meaning behind our language—beyond the simple memorization of particular words or phrases. Of course, investing in better and more representative datasets would yield incremental improvements, but we must go a step further and begin to interpret data in context, a crucial part of understanding online behavior. A seemingly benign text post on social media accompanied by racist symbolism in an image or video would be easily missed if we only looked at the text. We know that lack of context can often be the cause of our own human misjudgments. If AI is to stand a chance of replacing manual effort on a large scale, it is imperative that we give our models the full picture.
大大小小的社交平臺都在竭力保障用戶遠離仇恨言論、極端內(nèi)容、網(wǎng)絡(luò)騷擾及錯誤信息。人工智能或可成為一種解決方案:開發(fā)算法來檢測惡意和煽動性言論,打上刪除標(biāo)記,并向我們發(fā)出警告。但此類系統(tǒng)面臨重大挑戰(zhàn)。
近年來,網(wǎng)上的仇恨言論或攻擊性語言激增。社交媒體平臺依靠數(shù)千名人工審核員,難以審核持續(xù)增長的有害內(nèi)容。據(jù)報道,2019年,臉書公司的審核員由于反復(fù)接觸此類令人痛苦的內(nèi)容,面臨罹患創(chuàng)傷后應(yīng)激障礙的風(fēng)險。把這項工作交由機器學(xué)習(xí)完成,有助于解決有害內(nèi)容數(shù)量不斷攀升的問題。事實上,近年來,許多大型科技公司已經(jīng)把算法集成到內(nèi)容審核中。
谷歌旗下的Jigsaw公司即為一例。Jigsaw是一家專注于提升互聯(lián)網(wǎng)安全性的公司。2017年,它幫助創(chuàng)建了Conversation AI。這是一個旨在檢測網(wǎng)上惡意評論的合作研究項目。然而,這個項目推出的一款名為Perspective的工具卻遭到廣泛批評。一條常見的投訴意見是,此工具生成的綜合“惡意評分”不夠靈活,無法滿足不同平臺的各種需求。例如,有些網(wǎng)站可能需要檢測威脅言論,而非不雅語言,而另一些網(wǎng)站的需求可能正好相反。
另一個問題是,算法學(xué)習(xí)將惡意評論與含有性別、性取向、宗教信仰或殘障相關(guān)字眼的非惡意評論混為一談。例如,一位用戶報告稱,諸如“我是一名同性戀黑人女性”或“我是一名耳聾女性”等中性句會得到高惡意評分,而“我是個男人”的惡意評分卻很低。
為回應(yīng)這些關(guān)切,Conver-sation AI團隊邀請開發(fā)者訓(xùn)練自己的惡意檢測算法,并參加在Kaggle平臺舉辦的三項算法競賽(每年一項)——Kaggle是谷歌公司的子公司,以旗下的機器學(xué)習(xí)從業(yè)者社區(qū)、公共數(shù)據(jù)集和挑戰(zhàn)賽而聞名。為幫助訓(xùn)練人工智能模型,Conversation AI公布了兩個公共數(shù)據(jù)集——包含一百余萬條來自維基百科的惡意和非惡意評論,以及一個名為“文明評論”的服務(wù)。由于采樣和為加強評分者準(zhǔn)確率所采用的策略等原因,部分評論由遠超十名(最多數(shù)千名)的注釋者審閱。
Jigsaw公司第一個挑戰(zhàn)的目標(biāo)是創(chuàng)建一個多標(biāo)簽惡意評論分類模型,其標(biāo)簽包含“惡意”“嚴(yán)重惡意”“威脅”“侮辱”“淫穢”“身份仇恨”等。第二及第三個挑戰(zhàn)則專注于解決更加具體的API限制:最大限度減少對預(yù)定義身份群體的無意識偏見,以及訓(xùn)練純英語數(shù)據(jù)的多語言模型。
Unitary公司是一家內(nèi)容審核人工智能公司,我們在Unitary的團隊從最優(yōu)秀的Kaggle方案中得到啟發(fā),公布了三種不同的模型,分別對應(yīng)Jigsaw公司的三項挑戰(zhàn)。每項挑戰(zhàn)的頂級Kaggle方案均采用模型集成,對多個訓(xùn)練好的模型分數(shù)取平均值,而我們每項挑戰(zhàn)只使用一個模型,就取得了相似的表現(xiàn)。
這些模型在許多情況下表現(xiàn)良好,但也要注意到它們的局限性。首先,這些模型在樣例與訓(xùn)練數(shù)據(jù)近似時,會有良好表現(xiàn)。但若處理不熟悉的惡意語言樣例,很可能失效。
此外,我們注意到,在文本評論中包含侮辱性或不雅語言,幾乎總會得到高惡意評分,不管作者的意圖或語氣如何。例如,“我不想再寫這篇討厭的文章了”這句話會得到99.7%的惡意評分,而把“討厭的”這個詞刪除,評分會變?yōu)?.05%。
最后一點,這三個模型仍有很大可能展現(xiàn)出某種偏見,若直接用于內(nèi)容審查,可能引發(fā)倫理問題。
雖然惡意言論自動識別技術(shù)已取得長足進步,但要超越只記憶特定單詞或短語,發(fā)展到模型能捕捉語言背后實際的、微妙的含義,我們還有很長的路要走。當(dāng)然,投入構(gòu)建更好、更具代表性的數(shù)據(jù)集會帶來漸進性改善,但我們必須更進一步,著手在語境下解讀數(shù)據(jù),這是理解網(wǎng)上行為的關(guān)鍵一環(huán)。發(fā)布在社交媒體上的一段文字表面似乎無害,但附帶的圖片或視頻中含有種族歧視符號,倘若我們只關(guān)注文字本身,就很容易將其漏掉。我們知道,缺乏語境經(jīng)常導(dǎo)致人類產(chǎn)生誤解。人工智能若要大量代替人力操作,我們必須給模型提供全景信息。
(譯者為“《英語世界》杯”翻譯大賽獲獎?wù)撸?/p>
1 content moderation內(nèi)容審核,是基于圖像、文本、視頻的檢測技術(shù),可自動檢測涉黃、廣告、涉政、涉暴、涉及敏感人物等內(nèi)容,對用戶上傳的圖片、文字、視頻進行內(nèi)容審核,幫助客戶降低業(yè)務(wù)違規(guī)風(fēng)險。? 2由谷歌建立的一家技術(shù)孵化公司(其前身為谷歌智庫部門Google Ideas),主要負責(zé)創(chuàng)建技術(shù)工具來減少并遏制線上虛假信息、騷擾以及其他問題。
3 off the shelf(產(chǎn)品)現(xiàn)成的,不需定制的。文中充當(dāng)副詞,用作狀語。