First, this action works towards an in-depth syntactical analysis and computational implementation of Mandarin Chinese Noun Phrases (NPs). It uses an implementable framework (Head-Driven Phrase Structure Grammar, HPSG) to design and test high-precision linguistic analyses for Mandarin Chinese text. The linguistic analysis will pay particular attention to NP quantification, reference/deixis, cognitive status and modification. The computational implementation of theoretical analyses will be made available as part of the Mandarin Resource Grammar.
Second, this action will also work towards exploiting the computational implementations, produced as part of this project, to create a high-precision error detection parser for Mandarin Chinese. To accomplish this, we use a concept known as "mal-rules". Mal-rules are able to detect grammatical errors by allowing violation/relaxation of linguistic constraints. Mal-rules are able to pinpoint, with high precision, the source of language problems, and can be used to correct or provide corrective feedback to ungrammatical sentences.
This action is strongly data-driven, and employs an interdisciplinary methodology, integrating standard methodologies in formal linguistic analysis with those employed in software development, in Computer Science. It is broadly inserted in the context of DELPH-IN — an international consortium that shares a commitment to develop open-source resources for deep linguistic processing using Head Driven Phrase Structure Grammar and Minimal Recursion Semantics.
Luis Morgado da Costa: Luis is the ‘excellent fellow’ of this action. He is an interdisciplinary researcher with a strong background in computational linguistics. He recently completed his PhD in Using Rich Models of Language in Grammatical Error Detection, from the Interdisciplinary Graduate Program at Nanyang Technological University, Singapore. Luis brings ample expertise in computational linguistics, grammar engineering and in the development of grammatical error detection applications targeting educational contexts. lmorgado.dacostagmail.com / luis.morgadodacostaupol.cz
Joanna Ut-Seong Sio:
Joanna is the supervisor of this action. She is a linguist and comedian, originally from Hong Kong. She received her PhD from Leiden University, in the Netherlands, where she worked on modification and reference in the Chinese nominal. Her research interests include Chinese languages, especially in the area of syntax and semantics, as well as the use of verbal arts in the training of communication skills. She brings her unparalleled knowledge concerning syntax of the Chinese NP to this project. joannautseong.sioupol.cz
Owsiankova Hana:
Hana is the project officer of this action. She is a project manager of national and Horizon Europe projects and mobility coordinator at the Faculty of Arts, Palacký University in Olomouc. Her main interests are science diplomacy and science business. Hana provides invaluable administrative support to the execution of this action. hana.owsiankovaupol.cz
Latest Events and Milestones
(2022.11.02-04) Invited talk at the Final Sinophone Conference: "What is Hua?" UPOL, Olomouc, Czech Republic.:
I was invited to participate in a panel on Language Diversity — Promoting Cantonese in and through the Digital World. This panel was led by Joanna Sio, my MSCA supervisor. Joanna and I gave a joint talk titled "The Cantonese Wordnet: Recent Development and Challenges". This panel also included other talks from esteemed colleagues, such as a talk by Andy Chi-on Chin (The Education University of Hong Kong) titled "From Humanities to Digital Humanities: Cantonese Studies in the Big Data Era" and a talk by Zoe Lam and Raymond Pai (University of British Columbia) on "Cantonese Language Curriculum Design for a Diverse Student Population: A Case study of a Canadian University". This was a great opportunity for networking. A joint publication summarizing the discussion we had in this panel (authored by all panelists) is planned for the near future.
(2022.11.03) Invited talk at We Connect 2022. Innovation & Business Sustainability. Ris3ok. Olomouc, Czech Republic.:
I was invited by the Innovation Center of the Olomouc Region to participate in this conference directed at researchers, businesses and policymakers. The conference aimed at discussing various areas of Innovation & Business Sustainability being actively researched/developed in the Olomouc region (where UPOL is situated). I gave a joint (bilingual) talk with Francis Bond and Frantisek Kratochvil. Our talk was titled "Machines that Understand and Teach Us" (Czech title: "Stroje, které nám rozumí a učí nás"). We received a lot of positive attention, and it was a great venue to network.
(2022.11.01) Invited Seminar for the Department of Linguistics and Modern Languages, City University of Hong Kong (CUHK), China.:
I had the honor of being invited to participate in the renowned Seminar Series hosted by the Department of Linguistics and Modern Languages, City University of Hong Kong. In a joint talk with my supervisor, Joanna Sio, we discussed our goals and experience in developing the Cantonese Wordnet Project. We also discussed some key aspects of digital lexicology, such as data formats/curation/sharing in view of sustainability, corpus-based lexicography, and some potential applications of wordnets in education. It was a very productive seminar, gathering support from attendees — many of whom showed interest in future collaborations.
This project uses constraint-based linguistic language models (i.e., computational grammars) to explicitely model common grammatical errors made by learners of Mandarin Chinese. It implements a theoretical concept known as mal-rules to identify and reconstruct ungrammatical sentences with enough precision to perform grammatical error detection, and to provide clear linguistic explanations of why a given sentence is ungrammatical.
In constraint-based linguistic language models, such as HPSG, robustness is an early and ever present concern. When compared with shallow parsing methods (i.e., statistical methods that analyze sentences without fully specifying their internal structure, or accounting for deep linguistic features such as agreement), the explicit nature of constraint-based linguistic language models tends to make these models much less robust. In other words, forms of input that were not explicitly accounted for in grammar are simply rejected. This is not necessarily a bad thing, since constraint-based models, such as HPSG, are theorized to make an implicit grammaticality judgment when they parse or reject an input – which is usually not not true for statistical-based parsers.
And so, this rigidity that may be considered a problem for some Natural Language Processing (NLP) applications, becomes an invaluable tool to deal with problems concerning grammaticality.
In HPSG, mal-rules can be seen as drawing inspiration from constraint relaxation or partial constraint satisfaction. However, instead of relaxing existing constraints, mal-rules effectively perform targeted constraint relaxation by adding new rules that are less constrained than what would be expected in a prescriptive grammar – i.e., they can parse ungrammatical input which should, in principle, be rejected by the grammar.
Within implemented grammars, mal-rules can be selectively available for parsing but not for generation, or to allow certain types of errors but not others. For grammars that produce a semantic representation, as is the case in this project, mal-rules can be designed to reconstruct the semantics of ungrammatical sentences in a way that allows the generation of corrected counterparts. And, in some cases, a single ungrammatical sentence can trigger multiple parses using mal-rules, each reconstructing different semantics that define different possible intended meanings of that specific ungrammatical input.
The implementation of mal-rules in HPSG grammars can be done through three major classes of linguistic objects: syntactic rules, lexical rules, and lexical items. Each method has some degree of specificity, making them useful in detecting different kinds of errors, but there is also some overlap in their explanatory power (i.e., similar errors can be captured in more than one way). Using different combinations of mal-rules essentially enables a grammar to offer multiple ways to correct a single sentence.
Below, there is an example of what this project is able to achieve. We contrast the parses the Mandarin Resource Grammar produces for two different sentences: (1) * 我买了二只狗。 (2) 我买了两只狗。
Sentence (1) is ungrammatical. It contains a common grammatical mistake made by learners of Mandarin Chinese. Mandarin Chinese has two words for the numeral two. Sentence (1) is ungrammatical due to the incorrect use of the numeral 二 (èr, two) as a numeral quantifier. Instead, 二 (èr, two) can be seen as the cardinal version of the concept two. When used as a quantifier, the word 两 (liǎng, two) should be used instead — as shown in (2).
The Mandarin Resource Grammar is able to make this clear distinction because the words 二 (èr, two) and 两 (liǎng, two) have profoundly different representations within the computational grammar. In order to capture this common mistake, a mal-rule (named mal_card_二_j) allows the word 二 (èr, two) to behave as if it was identical to 两 (liǎng, two). Without this mal-rule, the Mandarin Resource Grammar would not provide a parse for sentence (1). Finally, using this mal-rule, the Mandarin Resource Grammar is not only able to detect the error, it is also able to provide a span (i.e. which words are involved) of the error, along with a linguistically strong understanding of why sentence (1) is ungrammatical. This information can be used to provide corrective feedback (useful for learners) or it could ultimately be used to correct the sentence automatically.
The mal rule named mal_card_二_j could, for example, be linked to a more proper feedback message targeting learners of Mandarin Chinese. One possible feedback message would be: “It seems you have used the character 二 (èr, two) to count something in your sentence. Please remember that Mandarin Chinese has a special form for the word 'two' that must be used when counting. Try to use 两 (liǎng) instead of 二 (èr)”. It is, however, important to note that the form and quality of corrective feedback messages is not within the scope of this MSCA project.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No.101028782.