Upcoming

Funding: DFG, SPP 2556, LaSTing (final decision pending)
PIs: Jutta Hartmann, Anke Himmelreich, Sina Zarrieß
Project description: The central objective of this project is to develop a novel interdisciplinary approach that leverages language models (=LMs) as tools for cross-linguistic research and linguistic theories as tools for systemically assessing LMs’ robustness. For this goal, we operationalize linguistic theories to assess how robust an LM’s “holistic” syntactic knowledge is, by moving from evaluation on single phenomena to systemic assessments of networks of formally-related structures (=FORESTs). FORESTs are a network of abstract structures that share underlying syntactic properties within languages and/or across languages. For example, ‘Who does Peter like _ best?’ and ‘What do you think that Mary bought _ ?’ share the dependency of a filler ‘Who/What’ to a gap (_) but differ with respect to the presence of embedding. We use such networks of frequent and grammatical filler-gap dependencies and compare them to infrequent and ungrammatical island-configurations as well as infrequent but grammatical parasitic gap constructions like ‘Who did you kiss _ without knowing _?’, where an illicit gap in an island becomes well-formed due to a gap outside the island. Based on theoretically informed sets of FORESTs, we develop systemic assessment procedures that test for the presence of ``holistic’’ syntactic knowledge in an LM. We further develop robustness scoring of these assessments for families of models that, in the next step, allow us to test predictions of different theoretical analyses of parasitic gaps. Current theoretical analyses of parasitic gap structures make different predictions as to which other structures are close members in a network of FORESTs. We use these differences in theories to compare the results of acceptability judgments of parasitic gaps and related structures in humans with LMs’ performance on these structures, by manipulating training data input to include different forests. We will first set up this procedure for a set of theoretically well-described FORESTs and languages. Our main goal is to connect LMs’ assessments and cutting-edge cross-linguistic research, focusing on the theoretically challenging case of parasitic gaps. Bringing together theoretical linguistic knowledge and computational expertise in LMs, the project addresses the research questions of the Priority Programme LaSTing in various ways. First, the project contributes to robust assessment by designing benchmark materials in a more theory-driven and generalizable way, including a crosslinguistic perspective. Second, experiments that vary input, model size, and architecture will lead to a better understanding of the limits of syntax learning in LMs and their transferability to other languages. In the long run, these insights can contribute to making LMs more resource-efficient and sustainable. Finally, the project aims to conduct research on foundational questions regarding the explanatory power of LMs for linguistic theory building.

Ongoing

CRC 1646: Linguistic Creativity in Communication

Check the CRC website for CRC projects that our group is affiliated with
Funding: DFG

SAIL: SustAInable Life-cycle of Intelligent Socio-Technical Systems (since 2023)

Check the SAIL website for SAIL projects that our group is affiliated with
Funding: MKW NRW

LLM4KMU: Optimierter Einsatz von Open Source Large Language Models in KMU (since 2025)

Check the LLM4KMU website
Funding: Ministerium für Wirtschaft, Industrie, Klimaschutz und Energie des Landes NRW

Finished

INAS (2022-2025)

Title: Interactive Argumentation Support in the Invasion Biology Domain
Project description: Developing a good, new argument is not an easy task. In real-world argumentation scenarios, arguments presented in texts (e.g. scientific publications) often constitute the end result of a long and tedious process. A lot of work on computational argumentation has focused on detecting, analyzing and aggregating these products of argumentation processes, i.e. argumentative texts. In this project, we adopt a complementary perspective: we aim to develop an argumentation machine that supports users in and during the argumentation process in a scientific context, enabling them to follow ongoing argumentation in a scientific community and to develop their own arguments. To achieve this ambitious goal, we will focus on a particular phase of the scientific argumentation process, namely the initial phase of claim or hypothesis development. In scientific argumentation, a carefully developed and thought-through hypothesis is often crucial for researchers to be able to conduct a successful study and, in the end, present a new, high-quality finding or argument. Thus, an initial hypothesis needs to be specific enough that a researcher can test it based on data, but, at the same time, it should also relate to and extend previous general claims made in the community. In this project, we investigate how argumentation machines can (i) represent concrete and more abstract knowledge on hypotheses and their underlying concepts, (ii) automatically compute semantic relations between hypotheses made in scientific publications, and between hypotheses and datasets, and (iii) interactively support a user in developing her own hypothesis based on these resources. This project will thus combine methods from different disciplines: natural language processing, knowledge representation and semantic web and – as an example for a scientific domain – invasion biology.
Funding: DFG, RATIO SPP
PIs: Sina Zarrieß, Tina Heger, Birgitta König-Ries

NLP4VIS (Nov 2020 - Oct 2023)

Title: A generic conversational interface for scientific data visualization
Project description: The goal of this project is to work towards a generic Natural Language Interface (NLI) that allows users to interact with data and visualizations of data in an intuitive way, via conversational language. Given such a generic NLI, users could enter formally complex queries on their data in natural language (e.g., in a data set on exam gradings: \textit{find me a question that Master students answers significantly better than Bachelor students}), without the need to extensively familiarize with the technical backend of the visualization tool at hand (e.g., programming in \textsf{python/matplotlib}). Compared to traditional interfaces for visualization, NLIs bear the potential to greatly improve usability of existing tools, simplifying and speeding up visual exploration and analysis of scientific data for different user groups.
Funding: Carl Zeiss Foundation, Werkstatt project @MSCJ
PIs: Kai Lawonn, Monique Meuschke, Sina Zarrieß

HistKI (Jan 2021 - Dec 2023)

Project description: In many historical sciences, photographs and other images of architecture serve as a source and basis for subject- and theory-specific investigations. Although AI-based computer vision methods have developed significantly in recent years, they can support the process of source research and criticism in a rudimentary way at best, e.g. for the exploration of image repositories or the retrieval of images. HistKI aims to explore the support and modeling of image source research and criticism as a complex and fundamental historiographical working technique by multimodal AI-based methods. Related sub-questions are: How do historians and other scholars find and evaluate image sources? What generic procedures and sub-problems can be identified for this purpose? How can this be promoted with AI-based approaches? How do AI techniques impact the humanities research process? These questions will be explored using selected scenarios in which images, texts, and 3D models interact synergistically to describe architectural objects and urban ensembles for a process of analysis. With the help of machine learning methods, object sources and text sources (e.g.: captions) will be linked in HistKI in order to allow a detailed contextualization and localization of photographs in the future, thus going a significant step beyond previous methods of distant viewing.
Funding: BMBF (Förderkennzeichen: 01UG2120A)
Coordinator: Sander Münster, Uni Jena

Upcoming

FORESTS: Systemic Robustness Assessments of Language Models for Cross-Linguistic Research using Formally Related Structures

Ongoing

CRC 1646: Linguistic Creativity in Communication

SAIL: SustAInable Life-cycle of Intelligent Socio-Technical Systems (since 2023)

LLM4KMU: Optimierter Einsatz von Open Source Large Language Models in KMU (since 2025)

Finished

INAS (2022-2025)

NLP4VIS (Nov 2020 - Oct 2023)

HistKI (Jan 2021 - Dec 2023)