Journal of Computer-Assisted Linguistic Research - Vol 03 (2019)

Permanent URI for this collection

https://riunet.upv.es/handle/10251/123821

Tabla de contenidos

Articles

The Role of Previous Discourse in Identifying Public Textual Cyberbullying
Sampling Techniques to Overcome Class Imbalance in a Cyberbullying Context
An Evaluation Of A Linguistically Motivated Conversational Software Agent Framework
A qualitative analysis of the Wikipedia N-Substate Algorithm's Enhancement Terms
The car pet in the carpet. On the interaction of computer-linguistic methodology and manual refinement in researching noun compounds

Browse

Now showing 1 - 5 of 5

The car pet in the carpet. On the interaction of computer-linguistic methodology and manual refinement in researching noun compounds
(Universitat Politècnica de València, 2019-07-16) Huber, Elisabeth
[EN] Why does football combine productively with further nouns to form more complex expressions like football game, whereas seemingly comparable compounds like keyword only infrequently expand to more complex sequences? This project explores why some two-noun compounds are more readily available for forming triconstituent constructions than others. I hypothesize that the productivity of a two-noun compound in the formation of triconstituent sequences depends on the degree of entrenchment of that two-noun compound, assuming that only compounds that are entrenched to a certain degree are productive in forming more complex constructions. In order to test this hypothesis, a list of three-noun compounds in the English language needed to be compiled. The obvious thing to do would be to search for sequences of three nouns in POS-tagged corpora. However, since such automatized searches on the one hand do not allow the recall of all required instances and, on the other hand, often create results that are not precise enough, this requires substantial manual screening. Furthermore, in order to operationalize the concepts of entrenchment and productivity, it was necessary to count the usage frequencies of noun constructions. For this work, as well, the automatic elicitation of the data needed to be complemented by further manual selection in order to obtain correct usage frequencies. Both the complex automatic and manual work processes in the elicitation of the data will be presented in detail to give an impression of the extent of such a project.
A qualitative analysis of the Wikipedia N-Substate Algorithm's Enhancement Terms
(Universitat Politècnica de València, 2019-07-16) Goslin, Kyle; Hofmann, Markus
[EN] Automatic Search Query Enhancement (ASQE) is the process of modifying a user submitted search query and identifying terms that can be added or removed to enhance the relevance of documents retrieved from a search engine. ASQE differs from other enhancement approaches as no human interaction is required. ASQE algorithms typically rely on a source of a priori knowledge to aid the process of identifying relevant enhancement terms. This paper describes the results of a qualitative analysis of the enhancement terms generated by the Wikipedia NSubstate Algorithm (WNSSA) for ASQE. The WNSSA utilises Wikipedia as the sole source of a priori knowledge during the query enhancement process. As each Wikipedia article typically represents a single topic, during the enhancement process of the WNSSA, a mapping is performed between the user’s original search query and Wikipedia articles relevant to the query. If this mapping is performed correctly, a collection of potentially relevant terms and acronyms are accessible for ASQE. This paper reviews the results of a qualitative analysis process performed for the individual enhancement term generated for each of the 50 test topics from the TREC-9 Web Topic collection. The contributions of this paper include: (a) a qualitative analysis of generated WNSSA search query enhancement terms and (b) an analysis of the concepts represented in the TREC-9 Web Topics, detailing interpretation issues during query-to-Wikipedia article mapping performed by the WNSSA.
An Evaluation Of A Linguistically Motivated Conversational Software Agent Framework
(Universitat Politècnica de València, 2019-07-16) Panesar, Kulvinder
[EN] This paper presents a critical evaluation framework for a linguistically motivated conversational software agent (CSA). The CSA prototype investigates the integration, intersection and interface of the language, knowledge, and speech act constructions (SAC) based on a grammatical object, and the sub-model of belief, desires and intention (BDI) and dialogue management (DM) for natural language processing (NLP). A long-standing issue within NLP CSA systems is refining the accuracy of interpretation to provide realistic dialogue to support human-to-computer communication. This prototype constitutes three phase models: (1) a linguistic model based on a functional linguistic theory – Role and Reference Grammar (RRG), (2) an Agent Cognitive Model with two inner models: (a) a knowledge representation model, (b) a planning model underpinned by BDI concepts, intentionality and rational interaction, and (3) a dialogue model. The evaluation strategy for this Java-based prototype is multi-approach driven by grammatical testing (English language utterances), software engineering and agent practice. A set of evaluation criteria are grouped per phase model, and the testing framework aims to test the interface, intersection and integration of all phase models. The empirical evaluations demonstrate that the CSA is a proof-of-concept, demonstrating RRG’s fitness for purpose for describing, and explaining phenomena, language processing and knowledge, and computational adequacy. Contrastingly, evaluations identify the complexity of lower level computational mappings of NL – agent to ontology with semantic gaps, and further addressed by a lexical bridging solution.
Sampling Techniques to Overcome Class Imbalance in a Cyberbullying Context
(Universitat Politècnica de València, 2019-07-16) Colton, David; Hofmann, Markus
[EN] The majority of datasets suffer from class imbalance where samples of a dominant class significantly outnumber the samples available for the minority class that is to be detected. Prediction and classification machine learning models work best when there are roughly equal numbers of each class type. This paper explores sampling techniques that can be used to overcome this class imbalance problem in a cyberbullying context. A newly classified cyberbullying dataset, including detailed descriptions of the criteria used in its classification, was used to examine the feasibility of applying text mining techniques, to automate the detection of cyberbullying text when the dataset shows a significant class imbalance between the positive, cyberbullying, sample and the negative, not cyberbullying, samples. In this paper, we will investigate if oversampling the minority positive class or undersampling the majority negative class affects the performance of a prediction model. A compromise solution where the positive class is partially oversampled, and the negative class is partially undersampled is also examined. Although not strictly a class imbalance solution, sampling using the most frequently observed features was also explored.
The Role of Previous Discourse in Identifying Public Textual Cyberbullying
(Universitat Politècnica de València, 2019-07-16) Power, Aurelia; Keane, Anthony; Nolan, Brian; O'Neill, Brian
[EN] In this paper we investigate the contribution of previous discourse in identifying elements that are key to detecting public textual cyberbullying. Based on the analysis of our dataset, we first discuss the missing cyberbullying elements and the grammatical structures representative of discourse-dependent cyberbullying discourse. Then we identify four types of discourse dependent cyberbullying constructions: (1) fully inferable constructions, (2) personal marker and cyberbullying link inferable constructions, (3) dysphemistic element and cyberbullying link inferable constructions, and (4) dysphemistic element inferable constructions. Finally, we formalise a framework to resolve the missing cyberbullying elements that proposes several resolution algorithms. The resolution algorithms target the following discourse dependent message types: (1) polarity answers, (2) contradictory statements, (3) explicit ellipsis, (4) implicit affirmative answers, and (5) statements that use indefinite pronouns as placeholders for thedysphemistic element.

Browse

Recent Submissions