SUMMA at TAC Knowledge Base Population Task 2017 Mendes, Afonso, Nogueira, David, Broscheit, Samuel, Aleixo, Filipe, Balage, Pedro, Martins, Rui, Miranda, Sebastiao, and Almeida, Mariana SC In Proceedings of the Text Analysis Conference (TAC) KBP 2017 2018
SUMMA at TAC Knowledge Base Population Task 2016 Paikens, Peteris, Barzdins, Guntis, Mendes, Afonso, Ferreira, Daniel, Broscheit, Samuel, Almeida, Mariana S. C., Miranda, Sebastião, Nogueira, David, Balage, Pedro, and Martins, André F. T. In Proceedings of the Text Analysis Conference (TAC) KBP 2016 2017
The SUMMA Platform Prototype Liepins, Renars, Germann, Ulrich, Barzdins, Guntis, Birch, Alexandra, Renals, Steve, Weber, Susanne, Kreeft, Peggy, Bourlard, Herve, Prieto, João, Klejch, Ondrej, Bell, Peter, Lazaridis, Alexandros, Mendes, Alfonso, Riedel, Sebastian, Almeida, Mariana S. C., Balage, Pedro, Cohen, Shay B., Dwojak, Tomasz, Garner, Philip N., Giefer, Andreas, Junczys-Dowmunt, Marcin, Imran, Hina, Nogueira, David, Ali, Ahmed, Miranda, Sebastião, Popescu-Belis, Andrei, Miculicich Werlen, Lesly, Papasarantopoulos, Nikos, Obamuyide, Abiola, Jones, Clive, Dalvi, Fahim, Vlachos, Andreas, Wang, Yang, Tong, Sibo, Sennrich, Rico, Pappas, Nikolaos, Narayan, Shashi, Damonte, Marco, Durrani, Nadir, Khurana, Sameer, Abdelali, Ahmed, Sajjad, Hassan, Vogel, Stephan, Sheppey, David, Hernon, Chris, and Mitchell, Jeff In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics 2017
We present the first prototype of the SUMMA Platform: an integrated platform for multilingual media monitoring. The platform contains a rich suite of low-level and high-level natural language processing technologies: automatic speech recognition of broadcast media, machine translation, automated tagging and classification of named entities, semantic parsing to detect relationships between entities, and automatic construction / augmentation of factual knowledge bases. Implemented on the Docker platform, it can easily be deployed, customised, and scaled to large volumes of incoming media streams.
Aspect Extraction in Sentiment Analysis for Portuguese Language Balage Filho, Pedro Paulo 2017
Aspect-based sentiment analysis is the field of study which extracts and interpret the sentiment, usually classified as positive or negative, towards some target or aspect in an opinionated text. This doctoral dissertation details an empirical study of techniques and methods for aspect extraction in aspect-based sentiment analysis with the focus on Portuguese. Three different approaches were explored: frequency-based, relation-based and machine learning. In each one, this work shows a comparative study between a Portuguese and an English corpora and the differences found in applying the approaches. In addition, richer linguistic knowledge is also explored by using syntatic dependencies and semantic roles, leading to better results. This work lead to the establishment of new benchmarks for the aspect extraction in Portuguese.
A Qualitative Analysis of a Corpus of Opinion Summaries based on Aspects Lopez, Roque, Pardo, Thiago, Avanço, Lucas, Balage Filho, Pedro Paulo, Bokan, Alessandro, Cardoso, Paula, Dias, Márcio, Nóbrega, Fernando, Cabezudo, Marco, Souza, Jackson, Zacarias, Andressa, Seno, Eloize, and Di Felippo, Ariani In Proceedings of The 9th Linguistic Annotation Workshop 2015
NILC_USP: An Improved Hybrid System for Sentiment Analysis in Twitter Messages Balage Filho, Pedro Paulo, Avanço, Lucas Vinicius, Nunes, Maria das Graças Volpe, and Pardo, Thiago Alexandre Salgueiro In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) 2014
This paper describes the NILC USP system that participated in SemEval-2014 Task 9: Sentiment Analysis in Twitter, a re-run of the SemEval 2013 task under the same name. Our system is an improved version of the system that participated in the 2013 task. This system adopts a hybrid classification process that uses three classification approaches: rule-based, lexicon-based and machine learning. We suggest a pipeline architecture that extracts the best characteristics from each classifier. In this work, we want to verify how this hybrid approach would improve with better classifiers. The improved system achieved an F-score of 65.39% in the Twitter message-level subtask for 2013 dataset (+ 9.08% of improvement) and 63.94% for 2014 dataset.
NILC_USP: Aspect Extraction using Semantic Labels Balage Filho, Pedro Paulo, and Pardo, Thiago Alexandre Salgueiro In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) 2014
This paper details the system NILC USP that participated in the Semeval 2014: Aspect Based Sentiment Analysis task. This system uses a Conditional Random Field (CRF) algorithm for extracting the aspects mentioned in the text. Our work added semantic labels into a basic feature set for measuring the efficiency of those for aspect extraction. We used the semantic roles and the highest verb frame as features for the machine learning. Overall, our results demonstrated that the system could not improve with the use of this semantic information, but its precision was increased.
Enriquecendo o Córpus CSTNews - a Criação de Novos Sumários Multidocumento Dias, Márcio Souza, Garay, Alessandro Yovan Bokan, Chuman, Carla, Barros, Cláudia Dias, Maziero, Erick Galani, Nobrega, Fernando Antônio Asevedo, Souza, Jackson Wilke da Cruz, Cabezudo, Marco Antonio Sobrevilla, Delege, Marina, Jorge, Maria Lucı́a Del Rosario Castro, Silva, Naira Licia, Cardoso, Paula Christina Figueira, Balage Filho, Pedro Paulo, Condori, Roque Enrique López, Felippo, Ariani Di, Nunes, Maria das Graças Volpe, and Pardo, Thiago Alexandre Salgueiro In Proceedings of the I Workshop on Tools and Resources for Automatically Processing Portuguese and Spanish - ToRPorEsp 2014
Relata-se, neste artigo, o processo de criação de novos sumários multidocumento â extrativos e abstrativos â para o córpus CSTNews, que é um córpus voltado para o processamento multidocumento, em especial, a sumarização automática para a lı́ngua portuguesa. Com isto, tem-se mais dados para subsidiar novas pesquisas na área, tanto no desenvolvimento quanto na avaliação de métodos e sistemas de sumarização.
Corpus Annotation of Textual Aspects in Multi-Document Summaries Felippo, Ariani Di, Rino, Lucia Helena Machado, Pardo, Thiago Alexandre Salgueiro, Cardoso, Paula Christina Figueira, Seno, Eloize Rossi Marques, Balage Filho, Pedro Paulo, Rassi, Amanda Pontes, Dias, Márcio Souza, Jorge, Maria Lucı́a Del Rosario Castro, Maziero, Erick Galani, Zacarias, Andressa Caroline Inácio, Souza, Jackson Wilke da Cruz, Camargo, Renata Tironi, and Agostini, Verônica 2014
A Large Opinion Corpus in Portuguese: Tackling Out-Of-Vocabulary Words Hartmann, Nathan Siegle, Avanço, Lucas Vinicius, Balage Filho, Pedro Paulo, Duran, Magali, Nunes, Maria das Graças Volpe, Pardo, Thiago Alexandre Salgueiro, and Aluı́sio, Sandra Maria In Proceedings of the 9th edition of the Language Resources and Evaluation Conference (LREC) 2014
Web 2.0 has allowed a never imagined communication boom. With the widespread use of computational and mobile devices, anyone, in practically any language, may post comments in the web. As such, formal language is not necessarily used. In fact, in these communicative situations, language is marked by the absence of more complex syntactic structures and the presence of internet slang, with missing diacritics, repetitions of vowels, and the use of chat-speak style abbreviations, emoticons and colloquial expressions. Such language use poses severe new challenges for Natural Language Processing (NLP) tools and applications, which, so far, have focused on well-written texts. In this work, we report the construction of a large web corpus of product reviews in Brazilian Portuguese and the analysis of its lexical phenomena, which support the development of a lexical normalization tool for, in future work, subsidizing the use of standard NLP products for web opinion mining and summarization purposes.
NILC_USP: A Hybrid System for Sentiment Analysis in Twitter Messages Balage Filho, Pedro Paulo, and Pardo, Thiago Alexandre Salgueiro In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013) 2013
This paper describes the NILC USP system that participated in SemEval-2013 Task 2: Sentiment Analysis in Twitter. Our system adopts a hybrid classification process that uses three classification approaches: rule-based, lexicon-based and machine learning approaches. We suggest a pipeline architecture that extracts the best characteristics from each classifier. Our system achieved an F- score of 56.31% in the Twitter message-level subtask.
Use of Discourse Knowledge to Improve Lexicon-based Sentiment Analysis Balage Filho, Pedro Paulo BULAG Natural Language Processing and Human Language Technology 2012 2012
Sentiment Analysis deals with the computational treatment of sentiment in texts. Discourse is a linguistic level of analysis where the author represents ideas and links concepts in a rational chain of thoughts. One important representation of discourse is the Rhetorical Structure Theory (RST). The objective of this work consists in to use discourse knowledge to improve a lexicon-based sentiment classifier. To achieve this goal it presents a lexicon- based algorithm adapted to weight portions of text under particular RST relations distinctly. Two experiments are reported. The first experiment verifies if the RST improves sentiment classification. It also shows the discourse relations which are most important in the process. The second experiment incorporates discourse markers in the algorithm in order to eliminate the necessity of a RST annotated corpus. It uses the weights learned in the first experiment to perform the sentiment classification.
Use of Discourse Knowledge to Improve Lexicon-based Sentiment Analysis Balage Filho, Pedro Paulo 2012
Sentiment Analysis deals with the computational treatment of senti- ment in texts. The recent interest for sentiment analysis has grown due the popularity of internet and the increase of user-generated contents, such as blogs, social networks and reviews websites. This work understands sentiment analysis as a classification prob- lem. In this problem, a text can be classified as positive or negative. Sentiment classifiers can be distinguished by twomain approaches: ma- chine learning and lexicon-based. The machine learning approach uses a corpus to automatically learn the best classification features. The lexicon-based approach uses a previously computed dictionary with the sentiment lexicon. Discourse is a linguistic level of analysis where the author represents ideas and links concepts in a rational chain of thoughts. One important representation of discourse is the Rhetorical Structure Theory (RST). This theory organizes the discourse in 26 relations that hierarchically represent the text discourse. This objective of this work is to use discourse knowledge to improve a lexicon-based sentiment classifier. To achieve this goal it proposes the SO-RST, a lexicon-based algorithm that weights portions of text under particular RST relations distinctly. Two experiments are re- ported. The first experiment verifies if the RST improves sentiment classification. It also shows the discourse relations which are most im- portant in the process. The second experiment incorporates discourse markers in the algorithm in order to eliminate the necessity of a RST annotated corpus. It uses the weights learned in the first experiment to perform the sentiment classification. The results obtained showed which RST relations most help the lexicon-based classifier to achieve a better accuracy. The discourse markers introduced in the algorithm showed some directions to follow and the necessary steps to better study this technique.
A Graphical User Interface for Feature-Based Opinion Mining Balage Filho, Pedro Paulo, Brun, Caroline, and Rondeau, Gilbert In Proceedings of the NAACL-HLT 2012: Demonstration Session 2012
In this paper, we present XOpin, a graphical user interface that have been developed to provide a smart access to the results of a feature-based opinion detection system, build on top of a parser.
A Corpus Based Method for Product Feature Ranking for Interactive Question Answering Systems Konstantinova, Natalia, Orasan, Constantin, and Balage, Pedro Paulo International Journal of Computational Linguistics and Applications 2012
At times choosing a product can be a difficult task due to the fact that customers need to consider many features before they can reach a decision. Interactive question answering (IQA) systems can help customers in this process, by answering questions about products and initiating a dialogue with the customer when their needs are not clearly defined. For this purpose we propose a corpus-based method for weighting the importance of product features depending on how likely they are to be of interest for a user. By using this method, we hope that users can select the desired product in an optimal way. For the experiments a corpus of user reviews is used, the assumption being that the features mentioned in a review are probably more important for a person who is likely to purchase a product. To improve the method, a sentiment classification system is also employed to distinguish between features mentioned in positive and negative contexts. Evaluation shows that the ranking method that incorporates this information is one of the best performing ones.