Left icon

MaCBench

Nawaf Alampara1, Mara Schilling-Wilhelmi1, Martiño Ríos-García1, Indrajeet Mandal2, Pranav Khetarpal3, Hargun Singh Grover3, N. M. Anoop Krishnan†3,4, Kevin Maik Jablonka†1,5,6,7
1 Friedrich Schiller University Jena (FSU-Jena) 2 Dept. of Civil Engineering, IIT Delhi 3 School of Interdisciplinary Research, IIT Delhi 4 Yardhi School of AI, IIT Delhi 5 CEEC-Jena, FSU-Jena 6 HIPOLE, Jena 7 JCSM, FSU-Jena
Corresponding Authors

MaCBench questions

Explore the MaCBench dataset, a comprehensive collection of over 1,100 hand-crafted question-image pairs designed to evaluate the performance of large language models (LLMs) in chemistry. Explore examples from the different topics of MaCBench. Use the navigation arrows to browse through examples or let them rotate automatically.

Introduction

MaCBench is a pioneering benchmark designed to evaluate the multimodal reasoning capabilities of vision-language models (VLMs) in chemistry and materials science. While VLMs show promise in perception tasks, their ability to integrate scientific knowledge across text and images remains underexplored. MaCBench addresses this gap with a diverse suite of tasks spanning data extraction, experimental understanding, and results interpretation—mirroring real-world scientific workflows, as shown in Figure 1.

Distribution of tasks in the MaCBench dataset.
Figure 1: Distribution of tasks in the MaCBench dataset among the three scientific pillars considered.

MaCBench corpus

A key strength of MaCBench lies in its carefully manually curated dataset, which includes manually created visuals ranging from molecular structures to spectroscopic data, as well as photographs taken in real-world chemistry labs. This ensures that models are evaluated on authentic and scientifically relevant scenarios rather than synthetic, idealized inputs. By revealing the strengths and weaknesses of AI in chemistry and materials science, MaCBench sets the stage for more capable scientific assistants and AI-driven discoveries. All images in the MaCBench corpus can be seen in the grid at the top of the page.

Results

We systematically benchmark leading VLMs across a range of chemistry and materials science tasks. The results are shown in Figure 2, observing that while models achieve near-perfect accuracy in tasks such as laboratory equipment identification and naming of handdrawn molecules, their performance drops significantly in more complex challenges.

Main results in MaCBench.
Figure 2: Results of the evaluated models in the different pillars and tasks considered in MaCBench.

Ablation studies

To analyze these failure modes, MaCBench incorporates targeted ablation studies that isolate key challenges such as image-text integration, reasoning complexity, guidance and domain-specific terminology.

Ablation studies' results in MaCBench.
Figure 3: Results of the evaluated models designed ablation studies in MaCBench.

Figure 3 illustrates the results of these ablation studies, revealing that while models excel in basic tasks, they struggle with complex reasoning and domain-specific language. This highlights the need for further research to enhance AI's capabilities in chemistry and materials science.

BibTeX

@article{alampara2024probing,
                title   = {Probing the limitations of multimodal language models for chemistry and materials research},
                author  = {Nawaf Alampara and Mara Schilling-Wilhelmi and Martiño Ríos-García and Indrajeet Mandal and Pranav Khetarpal and Hargun Singh Grover and N. M. Anoop Krishnan and Kevin Maik Jablonka},
                year    = {2024},
                journal = {arXiv preprint arXiv: 2411.16955}
            }