Contextual-Aware and Expert Data Resources for Brazilian Portuguese Hate Speech Detection

Research Square (Research Square)(2023)

引用 0|浏览7
暂无评分
摘要
Abstract This paper provides data resources for low-resource hate speech detection. Specifically, we introduce a large-scale expert annotated corpus of Brazilian Instagram comments and a context-aware offensive lexicon, which was manually extracted by a linguist from the proposed corpus and annotated with contextual information. We further provide native-speaker translations and adaptations from the specialized lexicon for other low-resource languages. The corpus consists of 7,000 document-level multi-layer annotations: (i) a binary offensive class, (ii) offensiveness-level classes, and (iii) nine hate speech targets. The context-aware offensive lexicon holds 1,000 explicit and implicit terms and expressions with pejorative connotations annotated with context-dependent offensiveness and context-independent offensiveness labels. Both corpus and lexicon were annotated by three different experts and achieved high inter-annotator agreement. Finally, we implemented baseline experiments on both data resources (corpus and lexicon). Results show the reliability of the proposed data, outperforming baseline dataset results in Portuguese, as well as presenting promising results for hate speech detection in different languages.
更多
查看译文
关键词
expert data resources,brazilian,speech,contextual-aware
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要