Repository logo
 

Search Results

Now showing 1 - 10 of 12
  • Parafraseamento automático de registo informal em registo formal na Língua Portuguesa
    Publication . Barreiro, Anabela Marques; Rebelo-Arnold, Ida; Baptista, Jorge; Mota, Cristina; Garcez, Isabel
    Este artigo apresenta o processo de automatização de parafraseamento em português e conversão de construções típicas do registo informal ou da linguagem falada em construções de registo formal usadas na linguagem escrita. Ilustraremos o processo de automatização com exemplos extraídos do corpus e-PACT, que envolvem a colocação normalizada de pronomes clíticos quando co-ocorrem com compostos verbais. A tarefa consiste em parafrasear e normalizar, entre outras, construções como vou-lhe/posso-lhe fazer uma surpresa em vou/posso fazer-lhe uma surpresa. Este artigo apresenta o processo de automatização de parafraseamento em português e conversão de construções típicas do registo informal ou da linguagem falada em construções de registo formal usadas na linguagem escrita. Ilustraremos o processo de automatização com exemplos extraídos do corpus e-PACT,que envolvem a colocação normalizada de pronomesclíticos quando co-ocorrem com compostos verbais.A tarefa consiste em parafrasear e normalizar, en-tre outras, constru ̧c ̃oes comovou-lhe/posso-lhefazeruma surpresaemvou/posso fazer-lheuma surpresa,em que o pronome cl ́ıticolhemigra de uma posi ̧c ̃aoencl ́ıtica imediatamente a seguir ao primeiro verbo docomposto verbal para uma posi ̧c ̃ao encl ́ıtica a seguirao verbo principal, que ́e o verbo respons ́avel pelasele ̧c ̃ao do argumento pronominal. O primeiro verbo ́e um verbo auxiliar ou um verbo volitivo, e.g.,querer.Este ́e um procedimento padronizado no processo de revisão em português europeu. Casos como este representam fenómenos linguísticos em que os estudantes de língua portuguesa e falantes em geral se confundem ou onde “tropeçam”. O artigo enfatiza a língua padrão em que os fenómenos observados ocorrem,descreve exemplos de interesse encontrados no cor-pus e apresenta uma solução automática, baseada na aplica ̧c ̃ao de gramáticas transformacionais genéricas,que facilitam a normalização de inadequações ou falhas sintáticas (registos informais) encontradas nas construções pesquisadas em construções padronizadas típicas da escrita formal ou escrita profissional.
  • Estimating lexical availability of European Portuguese proverbs
    Publication . Reis, Sónia; Baptista, Jorge
    This paper relates data on lexical availability with data on textual frequency of proverbs in European Portuguese. Each data source should provide different perspectives on the use of proverbs in the language. This should allow an empirically well-motivated selection of proverbs aiming at the development of NLP resources, specifically for applications for learning Portuguese as a Foreign Language and for the diagnosis/therapy of speech impairments/disabilities. A large database (over 114,000 proverbs and their variants) was independently classified by two annotators, according to intuitively estimated lexical availability. Next, a random, stratified sample was selected and lexical availability was then confirmed with an online survey. Frequency data was gathered from two web browsers and a large-sized, publicly available, corpus of journalistic texts. Results from the survey, the web and the corpus by and large confirm the initial intuitive classification and a core of commonly used proverbs was defined
  • Portuguese proverbs: types and variants
    Publication . Reis, Sónia; Baptista, Jorge
    Drawing on the methodology and previous results of Rassi et al. (2014) on the automatic identification of Brazilian Portuguese proverbs, this paper reports on an extension of that experiment, but now focused on the identification of the European Portuguese proverbs and their variants. Based on a large collection of over 56 thousand Portuguese proverbs and their variants, a database of proverb types was specifically built for natural language processing, along with the finite-state tools that allow for the identification of these strings in texts. Our aim is to make these linguistic resources and language processing tools publicly available, which will undoubtedly be deemed useful assets to other paremiologic studies.
  • Authorship attribution in portuguese using character N-grams
    Publication . Markov, Ilia; Baptista, Jorge; Pichardo-Lagunas, Obdulia
    For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.
  • Os provérbios em manuais de ensino de português língua não materna
    Publication . Reis, Sónia; Baptista, Jorge
    Os provérbios apresentam uma grande variedade de estruturas e podem servir diversos propósitos comunicativos. Devido à sua riqueza cultural e linguística, prestamse ainda a múltiplos objetivos didáticos, nomeadamente no ensino de Português como Língua não Materna (PLNM). Neste trabalho, investigamos como são de facto utilizados os provérbios em manuais de PLNM, usando ferramentas e recursos de processamento computacional de linguagem natural (PLN). Os resultados são comparados com observações já feitas sobre um corpus de manuais de Português para falantes nativos.
  • Syntax Deep Explorer
    Publication . Correia, José; Baptista, Jorge; Mamede, Nuno
    The analysis of the co-occurrence patterns between words allows for a better understanding of the use (and meaning) of words and its most straightforward applications are lexicography and linguist description in general. Some tools already produce co-occurrence information about words taken from Portuguese corpora, but few can use lemmata or syntactic dependency information. Syntax Deep Explorer is a new tool that uses several association measures to quantify several co-occurrence types, defined on the syntactic dependencies (e.g. subject, complement, modifier) between a target word lemma and its co-locates. The resulting co-occurrence statistics is represented in lex-grams, that is, a synopsis of the syntactically-based co-occurrence patterns of a word distribution within a given corpus. These lex-grams are obtained from a large-sized Portuguese corpus processed by STRING [19] and are presented in a user-friendly way through a graphical interface. The Syntax Deep Explorer will allow the development of finer lexical resources and the improvement of STRING processing in general, as well as providing public access to co-occurrence information derived from parsed corpora.
  • Automated anonymization of text documents
    Publication . Dias, Francisco; Mamede, Nuno; Baptista, Jorge
    Sharing data in the form of text is important for a wide range of activities but it also raises a concern about privacy when sharing data that could be sensitive. Automated text anonymization is a solution for removing all the sensitive information from documents. However, this is a challenging task due to the unstructured form of textual data and the ambiguity of natural language. In this work, we present our implementation of an automated anonymization system, built in a modular structure, for documents written in Portuguese. Four different methods of anonymization are evaluated and compared. Two methods replace the sensitive information by artificial labels: suppression and tagging. The other two methods replace the information by textual expressions: random substitution and generalization. Evaluation showed that the use of the tagging and the generalization methods facilitates the reading of an anonymized text while preventing some semantic drifts caused by the remotion of the original information.
  • Automatic generation of exercises on passive transformation in portuguese
    Publication . Baptista, Jorge; Lourenço, Sandra; Mamede, Nuno J.
    Technology plays a very important role in education and Intelligent Computer-Assisted Language Learning (iCALL) has emerged as a complementary or even alternative method to the conventional language teaching practices. The automatic generation (and correction) of language exercises based on real texts extracted from corpora constitutes a non-trivial challenge to iCALL tutorial systems, and may involve the use of sophisticated Natural Language Processing tools and large-scale linguistic resources. This paper presents the main issues related to the automatic generation of exercises on the Passive transformation, a commonly occurring type of exercises in language textbooks, but also a very complex topic of Portuguese grammar. The paper describes the methods used to produce a large batch of passive-active sentence pairs, where the active sentence was automatically generated from naturally occurring passive sentences, taken from a large-sized, publicly available, corpus. Sentence pairs are ranked by difficulty level. A sample of randomly selected sentence pairs (40 from difficult level, 100 from medium, and 100 from easy level) was manually evaluated by an expert. Results are presented and error analysis is performed. The sentence pairs can be used as prime and correct answer for iCALL systems.
  • Assisting European Portuguese teaching: linguistic features extraction and automatic readability classifier
    Publication . Curto, Pedro; Mamede, Nuno; Baptista, Jorge
    This paper describes two automatic systems: a linguistic features extractor and a text readability classifier for European Portuguese texts. Its main goal is to assist the selection of adequate reading materials to support Portuguese teaching, especially as a second language. To the feature extraction from texts, the system uses several Natural Language Processing (NLP) tools. Currently, 52 features are extracted: parts-of-speech (POS), syllables, words, chunks and phrases, averages and frequencies, among others. A classifier was created using these features and a corpus, previously annotated readability level, adopting the five-levels language classification official standard for Portuguese as Second Language. In a five-levels (from A1 to C1) scenario, the best-performing learning algorithm (LogitBoost) achieved an accuracy of 75.11% with a root mean square error (RMSE) of 0.269. In a three-levels (A, B and C) scenario, the best-performing learning algorithm (C4.5 grafted) achieved 81.44% accuracy, with a RMSE of 0.346.
  • Let's play with proverbs? NLP tools and resources for iCALL applications around proverbs for PFL
    Publication . Reis, Sónia; Baptista, Jorge
    Proverbs are an important form of cultural expression of a society and are related to various areas of knowledge and human experience (González Rey, 2002). While linguistic elements in widespread use, proverbs are very rich structures both from a cultural and from a linguistic point of view and can therefore contribute significantly to the teaching of languages, both native and foreign (Council of Europe, 2001). However, though there are extensive collections of Portuguese proverbs with tens of thousands of forms and its variants (Reis, in preparation), its automatic identification in texts is quite difficult, given its formal variation, both lexical and syntactic (Chacoto, 1994). Nevertheless, using real examples, where proverbs are used in a natural or spontaneous discourse context, is a more natural way to learn and teach the complex conditions and communicative situations that determine the use and meaning of these expressions. On the other hand, frequency indices associated with proverbs and its variants would allow one to select the most common expressions. These are precisely the most interesting forms from the point of view of their teaching/learning and could serve as a basis for the construction of educational games, particularly for learning Portuguese autonomously as a foreign language (PFL) assisted by computer. To make this possible, it is necessary, first of all, be able to recognize the occurrence of proverbs in the texts (Rassi et al. 2014), including the instances where these expressions are presented in a truncated or creatively modified form, for example, to better suit the communicative situation or to produce new and more expressive meanings. In this paper, we present an on-going project, which aims at automatic identification of proverbs in texts. In this interdisciplinary study, we combine natural language processing tools with questionnaires construction techniques for teaching purposes (Hoshino and Nakagawa 2005, Correia et al. 2010). This is illustrated here with different sets of formats that can be built based on the knowledge of the form and variation of proverbs, as well as their frequency in corpora.