Upcoming
- Funding: DFG, SPP 2556, LaSTing (final decision pending)
- PIs: Jutta Hartmann, Anke Himmelreich, Sina Zarrieß
- Project description: The central objective of this project is to develop a novel
interdisciplinary approach that leverages language models (=LMs) as
tools for cross-linguistic research and linguistic theories as tools for
systemically assessing LMs’ robustness. For this goal, we
operationalize linguistic theories to assess how robust an LM’s
“holistic” syntactic knowledge is, by moving from evaluation on single
phenomena to systemic assessments of networks of formally-related
structures (=FORESTs). FORESTs are a network of abstract
structures that share underlying syntactic properties within languages
and/or across languages. For example, ‘Who does Peter like _ best?’
and ‘What do you think that Mary bought _ ?’ share the dependency of
a filler ‘Who/What’ to a gap (_) but differ with respect to the presence
of embedding. We use such networks of frequent and grammatical
filler-gap dependencies and compare them to infrequent and
ungrammatical island-configurations as well as infrequent but
grammatical parasitic gap constructions like ‘Who did you kiss _
without knowing _?’, where an illicit gap in an island becomes well-formed
due to a gap outside the island. Based on theoretically
informed sets of FORESTs, we develop systemic assessment
procedures that test for the presence of ``holistic’’ syntactic
knowledge in an LM. We further develop robustness scoring of these
assessments for families of models that, in the next step, allow us to test
predictions of different theoretical analyses of parasitic gaps. Current
theoretical analyses of parasitic gap structures make different
predictions as to which other structures are close members in a
network of FORESTs. We use these differences in theories to
compare the results of acceptability judgments of parasitic gaps and
related structures in humans with LMs’ performance on these
structures, by manipulating training data input to include different
forests. We will first set up this procedure for a set of theoretically
well-described FORESTs and languages. Our main goal is to connect
LMs’ assessments and cutting-edge cross-linguistic research,
focusing on the theoretically challenging case of parasitic gaps.
Bringing together theoretical linguistic knowledge and computational
expertise in LMs, the project addresses the research questions of the
Priority Programme LaSTing in various ways. First, the project
contributes to robust assessment by designing benchmark materials
in a more theory-driven and generalizable way, including a crosslinguistic
perspective. Second, experiments that vary input, model
size, and architecture will lead to a better understanding of the limits
of syntax learning in LMs and their transferability to other languages.
In the long run, these insights can contribute to making LMs more
resource-efficient and sustainable. Finally, the project aims to conduct
research on foundational questions regarding the explanatory power of LMs for linguistic theory building.
Ongoing
CRC 1646: Linguistic Creativity in Communication
- Check the CRC website for CRC projects that our group is affiliated with
- Funding: DFG
SAIL: SustAInable Life-cycle of Intelligent Socio-Technical Systems (since 2023)
- Check the SAIL website for SAIL projects that our group is affiliated with
- Funding: MKW NRW
LLM4KMU: Optimierter Einsatz von Open Source Large Language Models in KMU (since 2025)
- Check the LLM4KMU website
- Funding: Ministerium für Wirtschaft, Industrie, Klimaschutz und Energie des Landes NRW
Finished
INAS (2022-2025)
- Title: Interactive Argumentation Support in the Invasion Biology Domain
- Project description: Developing a good, new argument is not an easy task.
In real-world argumentation scenarios, arguments presented in texts (e.g. scientific publications) often constitute the end result of a long and tedious process.
A lot of work on computational argumentation has focused on detecting, analyzing and aggregating these products of argumentation processes, i.e. argumentative texts. In this project, we adopt a complementary perspective: we aim to develop an argumentation machine that supports users in and during the argumentation process in a scientific context, enabling them to follow ongoing argumentation in a scientific community and to develop their own arguments. To achieve this ambitious goal, we will focus on a particular phase of the scientific argumentation process, namely the initial phase of claim or hypothesis development.
In scientific argumentation, a carefully developed and thought-through hypothesis is often crucial for researchers to be able to conduct a successful study and, in the end, present a new, high-quality finding or argument.
Thus, an initial hypothesis needs to be specific enough that a researcher can test it based on data, but, at the same time, it should also relate to and extend previous general claims made in the community. In this project, we investigate how argumentation machines can (i) represent concrete and more abstract knowledge on hypotheses and their underlying concepts, (ii) automatically compute semantic relations between hypotheses made in scientific publications, and between hypotheses and datasets, and (iii) interactively support a user in developing her own hypothesis based on these resources. This project will thus combine methods from different disciplines: natural language processing, knowledge representation and semantic web and – as an example for a scientific domain – invasion biology.
- Funding: DFG, RATIO SPP
- PIs: Sina Zarrieß, Tina Heger, Birgitta König-Ries
NLP4VIS (Nov 2020 - Oct 2023)
- Title: A generic conversational interface for scientific data visualization
- Project description: The goal of this project is to work towards a generic Natural Language Interface (NLI) that allows users to interact with data and visualizations of data in an intuitive way, via conversational language. Given such a generic NLI, users could enter formally complex queries on their data in natural language (e.g., in a data set on exam gradings: \textit{find me a question that Master students answers significantly better than Bachelor students}), without the need to extensively familiarize with the technical backend of the visualization tool at hand (e.g., programming in \textsf{python/matplotlib}).
Compared to traditional interfaces for visualization, NLIs bear the potential to greatly improve usability of existing tools, simplifying and speeding up visual exploration and analysis of scientific data for different user groups.
- Funding: Carl Zeiss Foundation, Werkstatt project @MSCJ
- PIs: Kai Lawonn, Monique Meuschke, Sina Zarrieß
HistKI (Jan 2021 - Dec 2023)
- Project description: In many historical sciences, photographs and other images of architecture serve as a source and basis for subject- and theory-specific investigations. Although AI-based computer vision methods have developed significantly in recent years, they can support the process of source research and criticism in a rudimentary way at best, e.g. for the exploration of image repositories or the retrieval of images. HistKI aims to explore the support and modeling of image source research and criticism as a complex and fundamental historiographical working technique by multimodal AI-based methods. Related sub-questions are: How do historians and other scholars find and evaluate image sources? What generic procedures and sub-problems can be identified for this purpose? How can this be promoted with AI-based approaches? How do AI techniques impact the humanities research process? These questions will be explored using selected scenarios in which images, texts, and 3D models interact synergistically to describe architectural objects and urban ensembles for a process of analysis. With the help of machine learning methods, object sources and text sources (e.g.: captions) will be linked in HistKI in order to allow a detailed contextualization and localization of photographs in the future, thus going a significant step beyond previous methods of distant viewing.
- Funding: BMBF (Förderkennzeichen: 01UG2120A)
- Coordinator: Sander Münster, Uni Jena