About
IBM is proud to sponsor the International Conference on Very Large Data Bases 2025 (VLDB 2025).
VLDB is a premier annual international forum for data management, scalable data science, and database researchers, vendors, practitioners, application developers, and users. The forthcoming VLDB 2025 conference is poised to deliver a comprehensive program, featuring an array of research talks, keynote and invited talks, panels, tutorials, demonstrations, industrial tracks, and workshops. It will cover a spectrum of research topics related to all aspects of data management, where systems issues play a significant role, such as data management system technology and information management infrastructures, including their very large scale of experimentation, novel architectures, and demanding applications as well as their underpinning theory.
Visit us at our Sponsor table to learn more about our work at IBM Research. IBM speaking sessions & presentations can be found in the agenda section below.
Agenda
- Description:
Grounding LLMs for Database Exploration: Intent Scoping and Paraphrasing for Robust NL2SQL
Catalina Dragusin (ETH Zurich); Katsiaryna Mirylenka (Zalando SE); Christoph Miksovic (IBM Research); Michael Glass (IBM Research); Nahuel Defosse (IBM Research); Paolo Scotton (IBM Research); Thomas Gschwind (IBM Research)
- Description:
Bootstrapping Learned Cost Models with Synthetic SQL Queries
Michael Nidd (IBM Research); Christoph Miksovic (IBM Research); Thomas Gschwind (IBM Research); Francesco Fusco (IBM Research); Andrea Giovannini (IBM Research); Ioana Giurgiu (IBM Research)
- Description:
A knowledge graph (KG) represents a network of entities and illustrates relationships between them. KGs are used for various applications, including semantic search and discovery, reasoning, decision making, natural language processing, machine learning, and recommendation systems. Automatic KG construction from text is an active research area. Triple (subject-relation-object) extraction from text is the fundamental block of KG construction and has been widely studied since early benchmarks such as ACE 2002 to more recent ones such as WebNLG 2020, REBEL and SynthIE. There has also been a number of works in the last few years exploiting LLMs for KG construction. However, handcrafting reasonable task-specific prompts for LLMs is a labour-intensive task and is subject to being brittle to the changes in LLM models. Recent work in various NLP tasks (e.g. autonomy generation) using automatic prompt optimisation/engineering addresses this challenge by generating optimal or near-optimal task-specific prompts given input-output examples.
This empirical study explores the application of automatic prompt optimisation for the triple extraction task using experimental benchmarking. We evaluate different settings by changing (a) the prompting strategy, (b) the LLM being used for prompt optimisation and task execution, (c) number of canonical relations in the schema (schema complexity), (d) the length and diversity of input text, (e) the metric used to drive the prompt optimization, and (f) the dataset being used for training and testing. We evaluate three different automatic prompt optimizers, namely, DSPy, APE, and TextGrad and use two different triple extraction datasets, SynthIE and REBEL. Our main contribution is to show that automatic prompt optimisation techniques can generate reasonable prompts similar to humans for triple extraction and achieve improved results, with significant gains observed as text size and schema complexity increase.
Authors:NMNandana MihindukulasooriyaIBMNDMCHSHorst SamulowitzIBM
- Description:
Modern applications span multiple clouds to reduce costs, avoid vendor lock-in, and leverage low-availability resources in another cloud. However, standard object stores operate within a single cloud, forcing users to manually manage data placement across clouds, i.e., navigate their diverse APIs and handle heterogeneous costs for network and storage. This is often a complex choice: users must either pay to store objects in a remote cloud, or pay to transfer them over the network based on application access patterns and cloud provider cost offerings. To address this, we present SkyStore, a unified object store that addresses cost-optimal data management across regions and clouds. SkyStore introduces a virtual object and bucket API to hide the complexity of interacting with multiple clouds. At its core, SkyStore has a novel TTL-based data placement policy that dynamically replicates and evicts objects according to application access patterns while optimizing for lower cost. Our evaluation shows that across various workloads, SkyStore reduces the overall cost by up to 6x over academic baselines and commercial alternatives like AWS multi-region buckets. SkyStore also has comparable latency, and its availability and fault tolerance are on par with standard cloud offerings.
Authors:+8 more view allSLShu LiuNON-IBMXMXiangxi MoNON-IBMMHMoshik Lanir HershcovitchIBMHZHenric ZhangNON-IBMACAudrey ChengNON-IBMGGGuy GirmonskyIBM - Description:
GPEmu: A GPU Emulator for Faster and Cheaper Prototyping and Evaluation of Deep Learning System Research
Meng Wang (University of Chicago);Gus Waldspurger (University of Chicago);Naufal Ananda (Telkom University);Yuyang Huang (University of Chicago);Kemas Wiharja (Telkom University);John Bent (LANL);Swaminathan Sundararaman (IBM Research);Vijay Chidambaram (UT Austin);Haryadi S. Gunawi (University of Chicago)
- Description:
Abstract TBC
Speakers:IG - Description:
Robust Plan Evaluation based on Approximate Probabilistic Machine Learning
Amin Kamali (University of Ottawa);Verena Kantere (University of Ottawa);Calisto Zuzarte (IBM);Vincent Corvinelli (IBM)
- Description:
Data Analysis is described as the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Performing such tasks over large and heterogeneous collections of tabular data, as found in enterprise data lakes and on the Web, is extremely challenging and an attractive research topic in data management, AI, and related communities. The goal of this workshop is to bring together researchers and practitioners in these diverse communities that work on addressing the fundamental research challenges of tabular data analysis and building automated solutions in this space.
We aim to provide a forum for: a) exchange of ideas between two communities: 1) an active community of data management researchers working on data integration and schema and data matching problems over tabular data, and 2) a vibrant community of researchers in AI and Semantic Web communities working on the core challenge of matching tabular data to Knowledge Graphs as a part of the ISWC SemTab Challenges. b) presentation of late-breaking results related to several emerging research areas such as table representation learning and its applications, use of large language models (LLMs) for tabular data analysis, and automation of data science pipelines, and automation of data science pipelines that rely on tabular data. c) discussion of real-world challenges related to implementing industrial-scale tabular data analysis pipelines, and data lakes and data lakehouse solutions.
Authors:VEVasilis EfthymiouNON-IBMOHSGSainyam GalhotraNON-IBMEJErnesto Jiménez-ruizNON-IBMCLChuan LeiNON-IBM - Description:
Lakehouse systems enable the same data to be queried with multiple execution engines. However, selecting the engine best suited to run a SQL query still requires a priori knowledge of the query’s computational requirements and an engine’s capabilities, a complex and manual task that only becomes more difficult with the emergence of new engines and workloads. In this paper, we address this limitation by proposing a cross-engine optimizer that is able to automate engine selection for diverse SQL queries by means of a learned cost model. A query plan, optimized with hints, is used for query cost prediction and routing. Cost prediction is formulated as a multi-task learning problem and multiple predictor heads, corresponding to different engines and provisionings, are used in the model architecture. This effectively eliminates the need to train engine-specific models and allows the flexible addition of new engines at a minimal fine-tuning cost.
Authors:ASAndras StrauszIBMNPSenior Research Software Engineer, Next-Gen IBM Data Foundation Subtheme Lead, flex.data Challenge LeadIBMIG - Description:
One of the major challenges in enterprise data analysis is the task of finding joinable tables that are conceptually related and provide meaningful insights. Traditionally, joinable tables have been discovered through a search for similar columns, where two columns are considered similar syntactically if there is a set overlap or they are considered similar semantically if either the column embeddings or value embeddings are closer in the embedding space. However, for enterprise data lakes, column similarity is not sufficient to identify joinable columns and tables. The context of the query column is important. Hence, in this work, we first define context-aware column joinability. Then we propose a multi-criteria approach, called TOPJoin, for joinable column search. We evaluate TOPJoin against existing join search baselines over one academic and one real-world join search benchmark. Through experiments, we find that TOPJoin performs better on both benchmarks than the baselines.
Authors:+1 more view allHKAKAamod KhatiwadaIBMTPTejaswini PedapatiIBMHAHaritha AnanthakrishnanIBMOHHSHorst SamulowitzIBM - Description:
Extracting structured information from text, such as key-value pairs that could augment tabular data, is quite useful in many enterprise use cases. Although large language models (LLMs) have enabled numerous automated pipelines for converting natural language into structured formats, there is still a lack of benchmarks for eval- uating their extraction quality, especially in specific domains or focused documents specific to a given organization. Building such benchmarks by manual annotations is labour-intensive and limits the size and scalability of the benchmarks. In this work, we present StructText, an end-to-end framework for automatically generating high-fidelity benchmarks for key- value extraction from text using existing tabular data. It uses avail- able tabular data as structured ground truth, and follow a two-stage “plan-then-execute” pipeline to synthetically generate correspond- ing natural-language reports. To ensure alignment between text and structured source, we introduce a multi-dimensional evalua- tion strategy that combines (a) LLM-based judgments on factuality, hallucination, and coherence and (b) objective extraction metrics measuring unit and time accuracy. We evaluated the proposed method on 87,881 examples across 50 datasets. Results reveal that while LLMs achieve strong fac- tual accuracy and avoid hallucination, they struggle with narrative coherence in producing and producing extractable text. Notably, models presume numerical and temporal information with high fidelity yet this information becomes embedded in narratives that resist automated extraction. We release a comprehensive infrastructure, including datasets, evaluation tools, and baseline extraction systems, to support con- tinued research. Our findings highlight a critical gap: Models can generate accurate text but struggle to maintain information acces- sibility, a key requirement for practical deployment in different sectors and demanding both accuracy and machine processability.
Authors:SKSSNMNandana MihindukulasooriyaIBMHSHorst SamulowitzIBM
Upcoming events
- —
IBM Quantum Developer Conference 2025
- Atlanta, Georgia, USA
- —
Berkeley Innovation Forum 2025 at IBM Research
- San Jose, CA, USA
- —
IBM at AI_dev Europe 2025
- | Live
- Amsterdam, Netherlands