Skip to main navigation Skip to search Skip to main content

Breaking news: Unveiling a new dataset for Portuguese news classification and comparative analysis of approaches

  • Klaifer Garcia
  • , Pedro Shiguihara
  • , Lilian Berton
  • Universidade Federal de São Paulo
  • obtuvo un doctorado en la de Maryland y realizó un postdoctorado de la Universidad de Toronto. Es docente-investigador en la Universidad San Ignacio de Loyola

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

Every day thousands of news are published on the web and filtering tools can be used to extract knowledge on specific topics. The categorization of news into a predefined set of topics is a subject widely studied in the literature, however, most works are restricted to documents in English. In this work, we make two contributions. First, we introduce a Portuguese news dataset collected from WikiNews an open-source media that provide news from different sources. Since there is a lack of datasets for Portuguese, and an existing one is from a single news channel, we aim to introduce a dataset from different news channels. The availability of comprehensive datasets plays a key role in advancing research. Second, we compare different architectures for Portuguese news classification, exploring different text representations (BoW, TF-IDF, Embedding) and classification techniques (SVM, CNN, DJINN, BERT) for documents in Portuguese, covering classical methods and current technologies. We show the trade-off between accuracy and training time for this application. We aim to show the capabilities of available algorithms and the challenges faced in the area.

Original languageEnglish
Article numbere0296929
JournalPLoS ONE
Volume19
Issue number1 January
DOIs
StatePublished - Jan 2024
Externally publishedYes

Fingerprint

Dive into the research topics of 'Breaking news: Unveiling a new dataset for Portuguese news classification and comparative analysis of approaches'. Together they form a unique fingerprint.

Cite this