publications | Pedro Balage

2018

SUMMA at TAC Knowledge Base Population Task 2017

Afonso Mendes, David Nogueira, Samuel Broscheit, and 5 more authors

In Proceedings of the Text Analysis Conference (TAC) KBP 2017 2018

PDF

2017

SUMMA at TAC Knowledge Base Population Task 2016

Peteris Paikens, Guntis Barzdins, Afonso Mendes, and 7 more authors

In Proceedings of the Text Analysis Conference (TAC) KBP 2016 2017

PDF
The SUMMA Platform Prototype

Renars Liepins, Ulrich Germann, Guntis Barzdins, and 43 more authors

In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics Apr 2017

Abs PDF

We present the first prototype of the SUMMA Platform: an integrated platform for multilingual media monitoring. The platform contains a rich suite of low-level and high-level natural language processing technologies: automatic speech recognition of broadcast media, machine translation, automated tagging and classification of named entities, semantic parsing to detect relationships between entities, and automatic construction / augmentation of factual knowledge bases. Implemented on the Docker platform, it can easily be deployed, customised, and scaled to large volumes of incoming media streams.
Aspect Extraction in Sentiment Analysis for Portuguese Language

Pedro Paulo Balage Filho

Apr 2017

Abs PDF Slides

Aspect-based sentiment analysis is the field of study which extracts and interpret the sentiment, usually classified as positive or negative, towards some target or aspect in an opinionated text. This doctoral dissertation details an empirical study of techniques and methods for aspect extraction in aspect-based sentiment analysis with the focus on Portuguese. Three different approaches were explored: frequency-based, relation-based and machine learning. In each one, this work shows a comparative study between a Portuguese and an English corpora and the differences found in applying the approaches. In addition, richer linguistic knowledge is also explored by using syntatic dependencies and semantic roles, leading to better results. This work lead to the establishment of new benchmarks for the aspect extraction in Portuguese.

2015

A Qualitative Analysis of a Corpus of Opinion Summaries based on Aspects

Roque Lopez, Thiago Pardo, Lucas Avanço, and 10 more authors

In Proceedings of The 9th Linguistic Annotation Workshop Jun 2015

PDF

2014

NILC_USP: An Improved Hybrid System for Sentiment Analysis in Twitter Messages

Pedro Paulo Balage Filho, Lucas Vinicius Avanço, Maria das Graças Volpe Nunes, and 1 more author

In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) "23–24 " # "aug" 2014

Abs PDF Poster

This paper describes the NILC USP system that participated in SemEval-2014 Task 9: Sentiment Analysis in Twitter, a re-run of the SemEval 2013 task under the same name. Our system is an improved version of the system that participated in the 2013 task. This system adopts a hybrid classification process that uses three classification approaches: rule-based, lexicon-based and machine learning. We suggest a pipeline architecture that extracts the best characteristics from each classifier. In this work, we want to verify how this hybrid approach would improve with better classifiers. The improved system achieved an F-score of 65.39% in the Twitter message-level subtask for 2013 dataset (+ 9.08% of improvement) and 63.94% for 2014 dataset.
NILC_USP: Aspect Extraction using Semantic Labels

Pedro Paulo Balage Filho, and Thiago Alexandre Salgueiro Pardo

In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) "23–24 " # "aug" 2014

Abs PDF Poster

This paper details the system NILC USP that participated in the Semeval 2014: Aspect Based Sentiment Analysis task. This system uses a Conditional Random Field (CRF) algorithm for extracting the aspects mentioned in the text. Our work added semantic labels into a basic feature set for measuring the efficiency of those for aspect extraction. We used the semantic roles and the highest verb frame as features for the machine learning. Overall, our results demonstrated that the system could not improve with the use of this semantic information, but its precision was increased.
Enriquecendo o Córpus CSTNews - a Criação de Novos Sumários Multidocumento

Márcio Souza Dias, Alessandro Yovan Bokan Garay, Carla Chuman, and 14 more authors

In Proceedings of the I Workshop on Tools and Resources for Automatically Processing Portuguese and Spanish - ToRPorEsp "9 " # "oct" 2014

Abs PDF

Relata-se, neste artigo, o processo de criação de novos sumários multidocumento â extrativos e abstrativos â para o córpus CSTNews, que é um córpus voltado para o processamento multidocumento, em especial, a sumarização automática para a lı́ngua portuguesa. Com isto, tem-se mais dados para subsidiar novas pesquisas na área, tanto no desenvolvimento quanto na avaliação de métodos e sistemas de sumarização.
Corpus Annotation of Textual Aspects in Multi-Document Summaries

Ariani Di Felippo, Lucia Helena Machado Rino, Thiago Alexandre Salgueiro Pardo, and 11 more authors

"9 " # "oct" 2014

PDF
A Large Opinion Corpus in Portuguese: Tackling Out-Of-Vocabulary Words

Nathan Siegle Hartmann, Lucas Vinicius Avanço, Pedro Paulo Balage Filho, and 4 more authors

In Proceedings of the 9th edition of the Language Resources and Evaluation Conference (LREC) "26–31 " # "may" 2014

Abs PDF

Web 2.0 has allowed a never imagined communication boom. With the widespread use of computational and mobile devices, anyone, in practically any language, may post comments in the web. As such, formal language is not necessarily used. In fact, in these communicative situations, language is marked by the absence of more complex syntactic structures and the presence of internet slang, with missing diacritics, repetitions of vowels, and the use of chat-speak style abbreviations, emoticons and colloquial expressions. Such language use poses severe new challenges for Natural Language Processing (NLP) tools and applications, which, so far, have focused on well-written texts. In this work, we report the construction of a large web corpus of product reviews in Brazilian Portuguese and the analysis of its lexical phenomena, which support the development of a lexical normalization tool for, in future work, subsidizing the use of standard NLP products for web opinion mining and summarization purposes.

2013

NILC_USP: A Hybrid System for Sentiment Analysis in Twitter Messages

Pedro Paulo Balage Filho, and Thiago Alexandre Salgueiro Pardo

In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013) "14–15 " # "june" 2013

Abs PDF

This paper describes the NILC USP system that participated in SemEval-2013 Task 2: Sentiment Analysis in Twitter. Our system adopts a hybrid classification process that uses three classification approaches: rule-based, lexicon-based and machine learning approaches. We suggest a pipeline architecture that extracts the best characteristics from each classifier. Our system achieved an F- score of 56.31% in the Twitter message-level subtask.

2012

Use of Discourse Knowledge to Improve Lexicon-based Sentiment Analysis

Pedro Paulo Balage Filho

BULAG Natural Language Processing and Human Language Technology 2012 "14–15 " # "june" 2012

Abs PDF Slides

Sentiment Analysis deals with the computational treatment of sentiment in texts. Discourse is a linguistic level of analysis where the author represents ideas and links concepts in a rational chain of thoughts. One important representation of discourse is the Rhetorical Structure Theory (RST). The objective of this work consists in to use discourse knowledge to improve a lexicon-based sentiment classifier. To achieve this goal it presents a lexicon- based algorithm adapted to weight portions of text under particular RST relations distinctly. Two experiments are reported. The first experiment verifies if the RST improves sentiment classification. It also shows the discourse relations which are most important in the process. The second experiment incorporates discourse markers in the algorithm in order to eliminate the necessity of a RST annotated corpus. It uses the weights learned in the first experiment to perform the sentiment classification.
Use of Discourse Knowledge to Improve Lexicon-based Sentiment Analysis

Pedro Paulo Balage Filho

"14–15 " # "june" 2012

Abs PDF Slides

Sentiment Analysis deals with the computational treatment of senti- ment in texts. The recent interest for sentiment analysis has grown due the popularity of internet and the increase of user-generated contents, such as blogs, social networks and reviews websites. This work understands sentiment analysis as a classification prob- lem. In this problem, a text can be classified as positive or negative. Sentiment classifiers can be distinguished by twomain approaches: ma- chine learning and lexicon-based. The machine learning approach uses a corpus to automatically learn the best classification features. The lexicon-based approach uses a previously computed dictionary with the sentiment lexicon. Discourse is a linguistic level of analysis where the author represents ideas and links concepts in a rational chain of thoughts. One important representation of discourse is the Rhetorical Structure Theory (RST). This theory organizes the discourse in 26 relations that hierarchically represent the text discourse. This objective of this work is to use discourse knowledge to improve a lexicon-based sentiment classifier. To achieve this goal it proposes the SO-RST, a lexicon-based algorithm that weights portions of text under particular RST relations distinctly. Two experiments are re- ported. The first experiment verifies if the RST improves sentiment classification. It also shows the discourse relations which are most im- portant in the process. The second experiment incorporates discourse markers in the algorithm in order to eliminate the necessity of a RST annotated corpus. It uses the weights learned in the first experiment to perform the sentiment classification. The results obtained showed which RST relations most help the lexicon-based classifier to achieve a better accuracy. The discourse markers introduced in the algorithm showed some directions to follow and the necessary steps to better study this technique.
A Graphical User Interface for Feature-Based Opinion Mining

Pedro Paulo Balage Filho, Caroline Brun, and Gilbert Rondeau

In Proceedings of the NAACL-HLT 2012: Demonstration Session "3–8 " # "jun" 2012

Abs PDF Slides

In this paper, we present XOpin, a graphical user interface that have been developed to provide a smart access to the results of a feature-based opinion detection system, build on top of a parser.
A Corpus Based Method for Product Feature Ranking for Interactive Question Answering Systems

Natalia Konstantinova, Constantin Orasan, and Pedro Paulo Balage

International Journal of Computational Linguistics and Applications Mar 2012

Abs PDF

At times choosing a product can be a difficult task due to the fact that customers need to consider many features before they can reach a decision. Interactive question answering (IQA) systems can help customers in this process, by answering questions about products and initiating a dialogue with the customer when their needs are not clearly defined. For this purpose we propose a corpus-based method for weighting the importance of product features depending on how likely they are to be of interest for a user. By using this method, we hope that users can select the desired product in an optimal way. For the experiments a corpus of user reviews is used, the assumption being that the features mentioned in a review are probably more important for a person who is likely to purchase a product. To improve the method, a sentiment classification system is also employed to distinguish between features mentioned in positive and negative contexts. Evaluation shows that the ranking method that incorporates this information is one of the best performing ones.