new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Nov 11

Data Scaling Laws in Imitation Learning for Robotic Manipulation

Data scaling has revolutionized fields like natural language processing and computer vision, providing models with remarkable generalization capabilities. In this paper, we investigate whether similar data scaling laws exist in robotics, particularly in robotic manipulation, and whether appropriate data scaling can yield single-task robot policies that can be deployed zero-shot for any object within the same category in any environment. To this end, we conduct a comprehensive empirical study on data scaling in imitation learning. By collecting data across numerous environments and objects, we study how a policy's generalization performance changes with the number of training environments, objects, and demonstrations. Throughout our research, we collect over 40,000 demonstrations and execute more than 15,000 real-world robot rollouts under a rigorous evaluation protocol. Our findings reveal several intriguing results: the generalization performance of the policy follows a roughly power-law relationship with the number of environments and objects. The diversity of environments and objects is far more important than the absolute number of demonstrations; once the number of demonstrations per environment or object reaches a certain threshold, additional demonstrations have minimal effect. Based on these insights, we propose an efficient data collection strategy. With four data collectors working for one afternoon, we collect sufficient data to enable the policies for two tasks to achieve approximately 90% success rates in novel environments with unseen objects.

  • 6 authors
·
Oct 24, 2024 2

ToolBridge: An Open-Source Dataset to Equip LLMs with External Tool Capabilities

Through the integration of external tools, large language models (LLMs) such as GPT-4o and Llama 3.1 significantly expand their functional capabilities, evolving from elementary conversational agents to general-purpose assistants. We argue that the primary drivers of these advancements are the quality and diversity of the training data. However, the existing LLMs with external tool integration provide only limited transparency regarding their datasets and data collection methods, which has led to the initiation of this research. Specifically, in this paper, our objective is to elucidate the detailed process involved in constructing datasets that empower LLMs to effectively learn how to utilize external tools and make this information available to the public through the introduction of ToolBridge. ToolBridge proposes to employ a collection of general open-access datasets as its raw dataset pool and applies a series of strategies to identify appropriate data entries from the pool for external tool API insertions. By supervised fine-tuning on these curated data entries, LLMs can invoke external tools in appropriate contexts to boost their predictive accuracy, particularly for basic functions including data processing, numerical computation, and factual retrieval. Our experiments rigorously isolates model architectures and training configurations, focusing exclusively on the role of data. The experimental results indicate that LLMs trained on ToolBridge demonstrate consistent performance improvements on both standard benchmarks and custom evaluation datasets. All the associated code and data will be open-source at https://github.com/CharlesPikachu/ToolBridge, promoting transparency and facilitating the broader community to explore approaches for equipping LLMs with external tools capabilities.

  • 6 authors
·
Oct 8, 2024

CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems

Compound AI systems (CASs) that employ LLMs as agents to accomplish knowledge-intensive tasks via interactions with tools and data retrievers have garnered significant interest within database and AI communities. While these systems have the potential to supplement typical analysis workflows of data analysts in enterprise data platforms, unfortunately, CASs are subject to the same data discovery challenges that analysts have encountered over the years -- silos of multimodal data sources, created across teams and departments within an organization, make it difficult to identify appropriate data sources for accomplishing the task at hand. Existing data discovery benchmarks do not model such multimodality and multiplicity of data sources. Moreover, benchmarks of CASs prioritize only evaluating end-to-end task performance. To catalyze research on evaluating the data discovery performance of multimodal data retrievers in CASs within a real-world setting, we propose CMDBench, a benchmark modeling the complexity of enterprise data platforms. We adapt existing datasets and benchmarks in open-domain -- from question answering and complex reasoning tasks to natural language querying over structured data -- to evaluate coarse- and fine-grained data discovery and task execution performance. Our experiments reveal the impact of data retriever design on downstream task performance -- a 46% drop in task accuracy on average -- across various modalities, data sources, and task difficulty. The results indicate the need to develop optimization strategies to identify appropriate LLM agents and retrievers for efficient execution of CASs over enterprise data.

  • 5 authors
·
Jun 1, 2024

Can LLMs Really Learn to Translate a Low-Resource Language from One Grammar Book?

Extremely low-resource (XLR) languages lack substantial corpora for training NLP models, motivating the use of all available resources such as dictionaries and grammar books. Machine Translation from One Book (Tanzer et al., 2024) suggests that prompting long-context LLMs with one grammar book enables English-Kalamang translation, an XLR language unseen by LLMs - a noteworthy case of linguistics helping an NLP task. We investigate the source of this translation ability, finding almost all improvements stem from the book's parallel examples rather than its grammatical explanations. We find similar results for Nepali and Guarani, seen low-resource languages, and we achieve performance comparable to an LLM with a grammar book by simply fine-tuning an encoder-decoder translation model. We then investigate where grammar books help by testing two linguistic tasks, grammaticality judgment and gloss prediction, and we explore what kind of grammatical knowledge helps by introducing a typological feature prompt that achieves leading results on these more relevant tasks. We thus emphasise the importance of task-appropriate data for XLR languages: parallel examples for translation, and grammatical data for linguistic tasks. As we find no evidence that long-context LLMs can make effective use of grammatical explanations for XLR translation, we conclude data collection for multilingual XLR tasks such as translation is best focused on parallel data over linguistic description.

  • 5 authors
·
Sep 27, 2024

TITAN: Future Forecast using Action Priors

We consider the problem of predicting the future trajectory of scene agents from egocentric views obtained from a moving platform. This problem is important in a variety of domains, particularly for autonomous systems making reactive or strategic decisions in navigation. In an attempt to address this problem, we introduce TITAN (Trajectory Inference using Targeted Action priors Network), a new model that incorporates prior positions, actions, and context to forecast future trajectory of agents and future ego-motion. In the absence of an appropriate dataset for this task, we created the TITAN dataset that consists of 700 labeled video-clips (with odometry) captured from a moving vehicle on highly interactive urban traffic scenes in Tokyo. Our dataset includes 50 labels including vehicle states and actions, pedestrian age groups, and targeted pedestrian action attributes that are organized hierarchically corresponding to atomic, simple/complex-contextual, transportive, and communicative actions. To evaluate our model, we conducted extensive experiments on the TITAN dataset, revealing significant performance improvement against baselines and state-of-the-art algorithms. We also report promising results from our Agent Importance Mechanism (AIM), a module which provides insight into assessment of perceived risk by calculating the relative influence of each agent on the future ego-trajectory. The dataset is available at https://usa.honda-ri.com/titan

  • 3 authors
·
Mar 30, 2020

Analyzing the Efficacy of an LLM-Only Approach for Image-based Document Question Answering

Recent document question answering models consist of two key components: the vision encoder, which captures layout and visual elements in images, and a Large Language Model (LLM) that helps contextualize questions to the image and supplements them with external world knowledge to generate accurate answers. However, the relative contributions of the vision encoder and the language model in these tasks remain unclear. This is especially interesting given the effectiveness of instruction-tuned LLMs, which exhibit remarkable adaptability to new tasks. To this end, we explore the following aspects in this work: (1) The efficacy of an LLM-only approach on document question answering tasks (2) strategies for serializing textual information within document images and feeding it directly to an instruction-tuned LLM, thus bypassing the need for an explicit vision encoder (3) thorough quantitative analysis on the feasibility of such an approach. Our comprehensive analysis encompasses six diverse benchmark datasets, utilizing LLMs of varying scales. Our findings reveal that a strategy exclusively reliant on the LLM yields results that are on par with or closely approach state-of-the-art performance across a range of datasets. We posit that this evaluation framework will serve as a guiding resource for selecting appropriate datasets for future research endeavors that emphasize the fundamental importance of layout and image content information.

  • 4 authors
·
Sep 25, 2023

Granite Embedding R2 Models

We introduce the Granite Embedding R2 models, a comprehensive family of high-performance English encoder-based embedding models engineered for enterprise-scale dense retrieval applications. Building upon our first-generation release, these models deliver substantial improvements, including 16x expanded context length (8,192 tokens), state-of-the-art performance across diverse retrieval domains - text, code, long-document search, multi-turn conversational, and tabular data - and measurable speed advantages of 19-44\% over leading competitors while maintaining superior accuracy. Our release encompasses both bi-encoder and cross-encoder architectures, featuring a highly effective 22-layer retriever model and its efficient 12-layer counterpart, alongside a high-quality reranker model, all trained exclusively on enterprise-appropriate data with comprehensive governance oversight. The models demonstrate exceptional versatility across standard benchmarks, IBM-developed evaluation suites, and real-world enterprise use cases, establishing new performance standards for open-source embedding models. In an era where retrieval speed and accuracy are paramount for competitive advantage, the Granite R2 models deliver a compelling combination of cutting-edge performance, enterprise-ready licensing, and transparent data provenance that organizations require for mission-critical deployments. All models are publicly available under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, enabling unrestricted research and commercial use.

  • 20 authors
·
Aug 26

Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities

Selecting appropriate training data is crucial for effective instruction fine-tuning of large language models (LLMs), which aims to (1) elicit strong capabilities, and (2) achieve balanced performance across a diverse range of tasks. Influence-based methods show promise in achieving (1) by estimating the contribution of each training example to the model's predictions, but often struggle with (2). Our systematic investigation reveals that this underperformance can be attributed to an inherent bias where certain tasks intrinsically have greater influence than others. As a result, data selection is often biased towards these tasks, not only hurting the model's performance on others but also, counterintuitively, harms performance on these high-influence tasks themselves. As a remedy, we propose BIDS, a Balanced and Influential Data Selection algorithm. BIDS first normalizes influence scores of the training data, and then iteratively balances data selection by choosing the training example with the highest influence on the most underrepresented task. Experiments with both Llama-3 and Mistral-v0.3 on seven benchmarks spanning five diverse capabilities show that BIDS consistently outperforms both state-of-the-art influence-based algorithms and other non-influence-based selection frameworks. Surprisingly, training on a 15% subset selected by BIDS can even outperform full-dataset training with a much more balanced performance. Our analysis further highlights the importance of both instance-level normalization and iterative optimization of selected data for balanced learning of diverse capabilities.

  • 4 authors
·
Jan 21

Challenges and Considerations in Annotating Legal Data: A Comprehensive Overview

The process of annotating data within the legal sector is filled with distinct challenges that differ from other fields, primarily due to the inherent complexities of legal language and documentation. The initial task usually involves selecting an appropriate raw dataset that captures the intricate aspects of legal texts. Following this, extracting text becomes a complicated task, as legal documents often have complex structures, footnotes, references, and unique terminology. The importance of data cleaning is magnified in this context, ensuring that redundant information is eliminated while maintaining crucial legal details and context. Creating comprehensive yet straightforward annotation guidelines is imperative, as these guidelines serve as the road map for maintaining uniformity and addressing the subtle nuances of legal terminology. Another critical aspect is the involvement of legal professionals in the annotation process. Their expertise is valuable in ensuring that the data not only remains contextually accurate but also adheres to prevailing legal standards and interpretations. This paper provides an expanded view of these challenges and aims to offer a foundational understanding and guidance for researchers and professionals engaged in legal data annotation projects. In addition, we provide links to our created and fine-tuned datasets and language models. These resources are outcomes of our discussed projects and solutions to challenges faced while working on them.

  • 3 authors
·
Jul 5, 2024

Scaling Laws for Downstream Task Performance of Large Language Models

Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by two metrics: downstream cross-entropy and BLEU score. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and BLEU score improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream BLEU score with good accuracy using a log-law. However, there are also cases where moderate misalignment causes the BLEU score to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these observations, we provide new practical insights for choosing appropriate pretraining data.

  • 6 authors
·
Feb 6, 2024 4

Awareness in Practice: Tensions in Access to Sensitive Attribute Data for Antidiscrimination

Organizations cannot address demographic disparities that they cannot see. Recent research on machine learning and fairness has emphasized that awareness of sensitive attributes, such as race and sex, is critical to the development of interventions. However, on the ground, the existence of these data cannot be taken for granted. This paper uses the domains of employment, credit, and healthcare in the United States to surface conditions that have shaped the availability of sensitive attribute data. For each domain, we describe how and when private companies collect or infer sensitive attribute data for antidiscrimination purposes. An inconsistent story emerges: Some companies are required by law to collect sensitive attribute data, while others are prohibited from doing so. Still others, in the absence of legal mandates, have determined that collection and imputation of these data are appropriate to address disparities. This story has important implications for fairness research and its future applications. If companies that mediate access to life opportunities are unable or hesitant to collect or infer sensitive attribute data, then proposed techniques to detect and mitigate bias in machine learning models might never be implemented outside the lab. We conclude that today's legal requirements and corporate practices, while highly inconsistent across domains, offer lessons for how to approach the collection and inference of sensitive data in appropriate circumstances. We urge stakeholders, including machine learning practitioners, to actively help chart a path forward that takes both policy goals and technical needs into account.

  • 3 authors
·
Dec 12, 2019

SentiPers: A Sentiment Analysis Corpus for Persian

Sentiment Analysis (SA) is a major field of study in natural language processing, computational linguistics and information retrieval. Interest in SA has been constantly growing in both academia and industry over the recent years. Moreover, there is an increasing need for generating appropriate resources and datasets in particular for low resource languages including Persian. These datasets play an important role in designing and developing appropriate opinion mining platforms using supervised, semi-supervised or unsupervised methods. In this paper, we outline the entire process of developing a manually annotated sentiment corpus, SentiPers, which covers formal and informal written contemporary Persian. To the best of our knowledge, SentiPers is a unique sentiment corpus with such a rich annotation in three different levels including document-level, sentence-level, and entity/aspect-level for Persian. The corpus contains more than 26000 sentences of users opinions from digital product domain and benefits from special characteristics such as quantifying the positiveness or negativity of an opinion through assigning a number within a specific range to any given sentence. Furthermore, we present statistics on various components of our corpus as well as studying the inter-annotator agreement among the annotators. Finally, some of the challenges that we faced during the annotation process will be discussed as well.

  • 5 authors
·
Jan 23, 2018

Choosing an Appropriate Platform and Workflow for Processing Camera Trap Data using Artificial Intelligence

Camera traps have transformed how ecologists study wildlife species distributions, activity patterns, and interspecific interactions. Although camera traps provide a cost-effective method for monitoring species, the time required for data processing can limit survey efficiency. Thus, the potential of Artificial Intelligence (AI), specifically Deep Learning (DL), to process camera-trap data has gained considerable attention. Using DL for these applications involves training algorithms, such as Convolutional Neural Networks (CNNs), to automatically detect objects and classify species. To overcome technical challenges associated with training CNNs, several research communities have recently developed platforms that incorporate DL in easy-to-use interfaces. We review key characteristics of four AI-powered platforms -- Wildlife Insights (WI), MegaDetector (MD), Machine Learning for Wildlife Image Classification (MLWIC2), and Conservation AI -- including data management tools and AI features. We also provide R code in an open-source GitBook, to demonstrate how users can evaluate model performance, and incorporate AI output in semi-automated workflows. We found that species classifications from WI and MLWIC2 generally had low recall values (animals that were present in the images often were not classified to the correct species). Yet, the precision of WI and MLWIC2 classifications for some species was high (i.e., when classifications were made, they were generally accurate). MD, which classifies images using broader categories (e.g., "blank" or "animal"), also performed well. Thus, we conclude that, although species classifiers were not accurate enough to automate image processing, DL could be used to improve efficiencies by accepting classifications with high confidence values for certain species or by filtering images containing blanks.

  • 6 authors
·
Feb 4, 2022

Data Selection for Language Models via Importance Resampling

Selecting a suitable training dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this data selection problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution, given some unlabeled target samples. Due to the large scale and dimensionality of the raw text data, existing methods use simple heuristics to select data that are similar to a high-quality reference corpus (e.g., Wikipedia), or leverage experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. Crucially, we work in a reduced feature space to make importance weight estimation tractable over the space of text. To determine an appropriate feature space, we first show that KL reduction, a data metric that measures the proximity between selected data and the target in a feature space, has high correlation with average accuracy on 8 downstream tasks (r=0.89) when computed with simple n-gram features. From this observation, we present Data Selection with Importance Resampling (DSIR), an efficient and scalable algorithm that estimates importance weights in a reduced feature space (e.g., n-gram features in our instantiation) and selects data with importance resampling according to these weights. When training general-domain models (target is Wikipedia + books), DSIR improves over random selection and heuristic filtering baselines by 2--2.5% on the GLUE benchmark. When performing continued pretraining towards a specific domain, DSIR performs comparably to expert curated data across 8 target distributions.

  • 4 authors
·
Feb 6, 2023

Neural Databases

In recent years, neural networks have shown impressive performance gains on long-standing AI problems, and in particular, answering queries from natural language text. These advances raise the question of whether they can be extended to a point where we can relax the fundamental assumption of database management, namely, that our data is represented as fields of a pre-defined schema. This paper presents a first step in answering that question. We describe NeuralDB, a database system with no pre-defined schema, in which updates and queries are given in natural language. We develop query processing techniques that build on the primitives offered by the state of the art Natural Language Processing methods. We begin by demonstrating that at the core, recent NLP transformers, powered by pre-trained language models, can answer select-project-join queries if they are given the exact set of relevant facts. However, they cannot scale to non-trivial databases and cannot perform aggregation queries. Based on these findings, we describe a NeuralDB architecture that runs multiple Neural SPJ operators in parallel, each with a set of database sentences that can produce one of the answers to the query. The result of these operators is fed to an aggregation operator if needed. We describe an algorithm that learns how to create the appropriate sets of facts to be fed into each of the Neural SPJ operators. Importantly, this algorithm can be trained by the Neural SPJ operator itself. We experimentally validate the accuracy of NeuralDB and its components, showing that we can answer queries over thousands of sentences with very high accuracy.

  • 6 authors
·
Oct 14, 2020

Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach

Zero-shot text classification (0Shot-TC) is a challenging NLU problem to which little attention has been paid by the research community. 0Shot-TC aims to associate an appropriate label with a piece of text, irrespective of the text domain and the aspect (e.g., topic, emotion, event, etc.) described by the label. And there are only a few articles studying 0Shot-TC, all focusing only on topical categorization which, we argue, is just the tip of the iceberg in 0Shot-TC. In addition, the chaotic experiments in literature make no uniform comparison, which blurs the progress. This work benchmarks the 0Shot-TC problem by providing unified datasets, standardized evaluations, and state-of-the-art baselines. Our contributions include: i) The datasets we provide facilitate studying 0Shot-TC relative to conceptually different and diverse aspects: the ``topic'' aspect includes ``sports'' and ``politics'' as labels; the ``emotion'' aspect includes ``joy'' and ``anger''; the ``situation'' aspect includes ``medical assistance'' and ``water shortage''. ii) We extend the existing evaluation setup (label-partially-unseen) -- given a dataset, train on some labels, test on all labels -- to include a more challenging yet realistic evaluation label-fully-unseen 0Shot-TC (Chang et al., 2008), aiming at classifying text snippets without seeing task specific training data at all. iii) We unify the 0Shot-TC of diverse aspects within a textual entailment formulation and study it this way. Code & Data: https://github.com/yinwenpeng/BenchmarkingZeroShot

  • 3 authors
·
Aug 31, 2019

Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance

Recent advancements in large language models (LLMs) have allowed the augmentation of information retrieval (IR) pipelines with synthetic data in various ways. Yet, the main training paradigm remains: contrastive learning with binary relevance labels and the InfoNCE loss, where one positive document is compared against one or more negatives. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus (a) missing subtle nuances that are useful for ranking and (b) being susceptible to annotation noise. To overcome this limitation, in this work we forgo real training documents and annotations altogether and use open-source LLMs to directly generate synthetic documents that answer real user queries according to several different levels of relevance. This fully synthetic ranking context of graduated relevance, together with an appropriate list-wise loss (Wasserstein distance), enables us to train dense retrievers in a way that better captures the ranking task. Experiments on various IR datasets show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents for training, our dense retriever significantly outperforms the same retriever trained through self-supervision. More importantly, it matches the performance of the same retriever trained on real, labeled training documents of the same dataset, while being more robust to distribution shift and clearly outperforming it when evaluated zero-shot on the BEIR dataset collection.

  • 6 authors
·
Mar 29

A Dataset for Distilling Knowledge Priors from Literature for Therapeutic Design

AI-driven discovery can greatly reduce design time and enhance new therapeutics' effectiveness. Models using simulators explore broad design spaces but risk violating implicit constraints due to a lack of experimental priors. For example, in a new analysis we performed on a diverse set of models on the GuacaMol benchmark using supervised classifiers, over 60\% of molecules proposed had high probability of being mutagenic. In this work, we introduce \ourdataset, a dataset of priors for design problems extracted from literature describing compounds used in lab settings. It is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. \ourdataset~ consists of 32.3 million pairs of natural language facts, and appropriate entity representations (i.e. SMILES or refseq IDs). To demonstrate the potential of the data, we train LLM, CLIP, and LLava architectures to reason jointly about text and design targets and evaluate on tasks from the Therapeutic Data Commons (TDC). \ourdataset~is highly effective for creating models with strong priors: in supervised prediction problems that use our data as pretraining, our best models with 15M learnable parameters outperform larger 2B TxGemma on both regression and classification TDC tasks, and perform comparably to 9B models on average. Models built with \ourdataset~can be used as constraints while optimizing for novel molecules in GuacaMol, resulting in proposals that are safer and nearly as effective. We release our dataset at https://huggingface.co/datasets/medexanon/Medex{huggingface.co/datasets/medexanon/Medex}, and will provide expanded versions as available literature grows.

  • 12 authors
·
Aug 14

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

This study is dedicated to assessing the capabilities of large language models (LLMs) such as GPT-3.5-Turbo, GPT-4, and GPT-4-Turbo in extracting structured information from scientific documents in materials science. To this end, we primarily focus on two critical tasks of information extraction: (i) a named entity recognition (NER) of studied materials and physical properties and (ii) a relation extraction (RE) between these entities. Due to the evident lack of datasets within Materials Informatics (MI), we evaluated using SuperMat, based on superconductor research, and MeasEval, a generic measurement evaluation corpus. The performance of LLMs in executing these tasks is benchmarked against traditional models based on the BERT architecture and rule-based approaches (baseline). We introduce a novel methodology for the comparative analysis of intricate material expressions, emphasising the standardisation of chemical formulas to tackle the complexities inherent in materials science information assessment. For NER, LLMs fail to outperform the baseline with zero-shot prompting and exhibit only limited improvement with few-shot prompting. However, a GPT-3.5-Turbo fine-tuned with the appropriate strategy for RE outperforms all models, including the baseline. Without any fine-tuning, GPT-4 and GPT-4-Turbo display remarkable reasoning and relationship extraction capabilities after being provided with merely a couple of examples, surpassing the baseline. Overall, the results suggest that although LLMs demonstrate relevant reasoning skills in connecting concepts, specialised models are currently a better choice for tasks requiring extracting complex domain-specific entities like materials. These insights provide initial guidance applicable to other materials science sub-domains in future work.

  • 4 authors
·
Jan 19, 2024 1

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.

  • 7 authors
·
Sep 29, 2024

DATED: Guidelines for Creating Synthetic Datasets for Engineering Design Applications

Exploiting the recent advancements in artificial intelligence, showcased by ChatGPT and DALL-E, in real-world applications necessitates vast, domain-specific, and publicly accessible datasets. Unfortunately, the scarcity of such datasets poses a significant challenge for researchers aiming to apply these breakthroughs in engineering design. Synthetic datasets emerge as a viable alternative. However, practitioners are often uncertain about generating high-quality datasets that accurately represent real-world data and are suitable for the intended downstream applications. This study aims to fill this knowledge gap by proposing comprehensive guidelines for generating, annotating, and validating synthetic datasets. The trade-offs and methods associated with each of these aspects are elaborated upon. Further, the practical implications of these guidelines are illustrated through the creation of a turbo-compressors dataset. The study underscores the importance of thoughtful sampling methods to ensure the appropriate size, diversity, utility, and realism of a dataset. It also highlights that design diversity does not equate to performance diversity or realism. By employing test sets that represent uniform, real, or task-specific samples, the influence of sample size and sampling strategy is scrutinized. Overall, this paper offers valuable insights for researchers intending to create and publish synthetic datasets for engineering design, thereby paving the way for more effective applications of AI advancements in the field. The code and data for the dataset and methods are made publicly accessible at https://github.com/cyrilpic/radcomp .

  • 3 authors
·
May 15, 2023

ClimateChat: Designing Data and Methods for Instruction Tuning LLMs to Answer Climate Change Queries

As the issue of global climate change becomes increasingly severe, the demand for research in climate science continues to grow. Natural language processing technologies, represented by Large Language Models (LLMs), have been widely applied to climate change-specific research, providing essential information support for decision-makers and the public. Some studies have improved model performance on relevant tasks by constructing climate change-related instruction data and instruction-tuning LLMs. However, current research remains inadequate in efficiently producing large volumes of high-precision instruction data for climate change, which limits further development of climate change LLMs. This study introduces an automated method for constructing instruction data. The method generates instructions using facts and background knowledge from documents and enhances the diversity of the instruction data through web scraping and the collection of seed instructions. Using this method, we constructed a climate change instruction dataset, named ClimateChat-Corpus, which was used to fine-tune open-source LLMs, resulting in an LLM named ClimateChat. Evaluation results show that ClimateChat significantly improves performance on climate change question-and-answer tasks. Additionally, we evaluated the impact of different base models and instruction data on LLM performance and demonstrated its capability to adapt to a wide range of climate change scientific discovery tasks, emphasizing the importance of selecting an appropriate base model for instruction tuning. This research provides valuable references and empirical support for constructing climate change instruction data and training climate change-specific LLMs.

  • 5 authors
·
Jun 12

Double Machine Learning meets Panel Data -- Promises, Pitfalls, and Potential Solutions

Estimating causal effect using machine learning (ML) algorithms can help to relax functional form assumptions if used within appropriate frameworks. However, most of these frameworks assume settings with cross-sectional data, whereas researchers often have access to panel data, which in traditional methods helps to deal with unobserved heterogeneity between units. In this paper, we explore how we can adapt double/debiased machine learning (DML) (Chernozhukov et al., 2018) for panel data in the presence of unobserved heterogeneity. This adaptation is challenging because DML's cross-fitting procedure assumes independent data and the unobserved heterogeneity is not necessarily additively separable in settings with nonlinear observed confounding. We assess the performance of several intuitively appealing estimators in a variety of simulations. While we find violations of the cross-fitting assumptions to be largely inconsequential for the accuracy of the effect estimates, many of the considered methods fail to adequately account for the presence of unobserved heterogeneity. However, we find that using predictive models based on the correlated random effects approach (Mundlak, 1978) within DML leads to accurate coefficient estimates across settings, given a sample size that is large relative to the number of observed confounders. We also show that the influence of the unobserved heterogeneity on the observed confounders plays a significant role for the performance of most alternative methods.

  • 2 authors
·
Sep 2, 2024

Chat2VIS: Generating Data Visualisations via Natural Language using ChatGPT, Codex and GPT-3 Large Language Models

The field of data visualisation has long aimed to devise solutions for generating visualisations directly from natural language text. Research in Natural Language Interfaces (NLIs) has contributed towards the development of such techniques. However, the implementation of workable NLIs has always been challenging due to the inherent ambiguity of natural language, as well as in consequence of unclear and poorly written user queries which pose problems for existing language models in discerning user intent. Instead of pursuing the usual path of developing new iterations of language models, this study uniquely proposes leveraging the advancements in pre-trained large language models (LLMs) such as ChatGPT and GPT-3 to convert free-form natural language directly into code for appropriate visualisations. This paper presents a novel system, Chat2VIS, which takes advantage of the capabilities of LLMs and demonstrates how, with effective prompt engineering, the complex problem of language understanding can be solved more efficiently, resulting in simpler and more accurate end-to-end solutions than prior approaches. Chat2VIS shows that LLMs together with the proposed prompts offer a reliable approach to rendering visualisations from natural language queries, even when queries are highly misspecified and underspecified. This solution also presents a significant reduction in costs for the development of NLI systems, while attaining greater visualisation inference abilities compared to traditional NLP approaches that use hand-crafted grammar rules and tailored models. This study also presents how LLM prompts can be constructed in a way that preserves data security and privacy while being generalisable to different datasets. This work compares the performance of GPT-3, Codex and ChatGPT across a number of case studies and contrasts the performances with prior studies.

  • 2 authors
·
Feb 4, 2023

Plantation Monitoring Using Drone Images: A Dataset and Performance Review

Automatic monitoring of tree plantations plays a crucial role in agriculture. Flawless monitoring of tree health helps farmers make informed decisions regarding their management by taking appropriate action. Use of drone images for automatic plantation monitoring can enhance the accuracy of the monitoring process, while still being affordable to small farmers in developing countries such as India. Small, low cost drones equipped with an RGB camera can capture high-resolution images of agricultural fields, allowing for detailed analysis of the well-being of the plantations. Existing methods of automated plantation monitoring are mostly based on satellite images, which are difficult to get for the farmers. We propose an automated system for plantation health monitoring using drone images, which are becoming easier to get for the farmers. We propose a dataset of images of trees with three categories: ``Good health", ``Stunted", and ``Dead". We annotate the dataset using CVAT annotation tool, for use in research purposes. We experiment with different well-known CNN models to observe their performance on the proposed dataset. The initial low accuracy levels show the complexity of the proposed dataset. Further, our study revealed that, depth-wise convolution operation embedded in a deep CNN model, can enhance the performance of the model on drone dataset. Further, we apply state-of-the-art object detection models to identify individual trees to better monitor them automatically.

  • 4 authors
·
Feb 12

MozzaVID: Mozzarella Volumetric Image Dataset

Influenced by the complexity of volumetric imaging, there is a shortage of established datasets useful for benchmarking volumetric deep-learning models. As a consequence, new and existing models are not easily comparable, limiting the development of architectures optimized specifically for volumetric data. To counteract this trend, we introduce MozzaVID - a large, clean, and versatile volumetric classification dataset. Our dataset contains X-ray computed tomography (CT) images of mozzarella microstructure and enables the classification of 25 cheese types and 149 cheese samples. We provide data in three different resolutions, resulting in three dataset instances containing from 591 to 37,824 images. While being general-purpose, the dataset also facilitates investigating mozzarella structure properties. The structure of food directly affects its functional properties and thus its consumption experience. Understanding food structure helps tune the production and mimicking it enables sustainable alternatives to animal-derived food products. The complex and disordered nature of food structures brings a unique challenge, where a choice of appropriate imaging method, scale, and sample size is not trivial. With this dataset we aim to address these complexities, contributing to more robust structural analysis models. The dataset can be downloaded from: https://archive.compute.dtu.dk/files/public/projects/MozzaVID/.

  • 7 authors
·
Dec 6, 2024

Agent-based Learning of Materials Datasets from Scientific Literature

Advancements in machine learning and artificial intelligence are transforming materials discovery. Yet, the availability of structured experimental data remains a bottleneck. The vast corpus of scientific literature presents a valuable and rich resource of such data. However, manual dataset creation from these resources is challenging due to issues in maintaining quality and consistency, scalability limitations, and the risk of human error and bias. Therefore, in this work, we develop a chemist AI agent, powered by large language models (LLMs), to overcome these challenges by autonomously creating structured datasets from natural language text, ranging from sentences and paragraphs to extensive scientific research articles. Our chemist AI agent, Eunomia, can plan and execute actions by leveraging the existing knowledge from decades of scientific research articles, scientists, the Internet and other tools altogether. We benchmark the performance of our approach in three different information extraction tasks with various levels of complexity, including solid-state impurity doping, metal-organic framework (MOF) chemical formula, and property relations. Our results demonstrate that our zero-shot agent, with the appropriate tools, is capable of attaining performance that is either superior or comparable to the state-of-the-art fine-tuned materials information extraction methods. This approach simplifies compilation of machine learning-ready datasets for various materials discovery applications, and significantly ease the accessibility of advanced natural language processing tools for novice users in natural language. The methodology in this work is developed as an open-source software on https://github.com/AI4ChemS/Eunomia.

  • 2 authors
·
Dec 18, 2023

AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose

How human interact with objects depends on the functional roles of the target objects, which introduces the problem of affordance-aware hand-object interaction. It requires a large number of human demonstrations for the learning and understanding of plausible and appropriate hand-object interactions. In this work, we present AffordPose, a large-scale dataset of hand-object interactions with affordance-driven hand pose. We first annotate the specific part-level affordance labels for each object, e.g. twist, pull, handle-grasp, etc, instead of the general intents such as use or handover, to indicate the purpose and guide the localization of the hand-object interactions. The fine-grained hand-object interactions reveal the influence of hand-centered affordances on the detailed arrangement of the hand poses, yet also exhibit a certain degree of diversity. We collect a total of 26.7K hand-object interactions, each including the 3D object shape, the part-level affordance label, and the manually adjusted hand poses. The comprehensive data analysis shows the common characteristics and diversity of hand-object interactions per affordance via the parameter statistics and contacting computation. We also conduct experiments on the tasks of hand-object affordance understanding and affordance-oriented hand-object interaction generation, to validate the effectiveness of our dataset in learning the fine-grained hand-object interactions. Project page: https://github.com/GentlesJan/AffordPose.

  • 5 authors
·
Sep 16, 2023

PlotGen: Multi-Agent LLM-based Scientific Data Visualization via Multimodal Feedback

Scientific data visualization is pivotal for transforming raw data into comprehensible visual representations, enabling pattern recognition, forecasting, and the presentation of data-driven insights. However, novice users often face difficulties due to the complexity of selecting appropriate tools and mastering visualization techniques. Large Language Models (LLMs) have recently demonstrated potential in assisting code generation, though they struggle with accuracy and require iterative debugging. In this paper, we propose PlotGen, a novel multi-agent framework aimed at automating the creation of precise scientific visualizations. PlotGen orchestrates multiple LLM-based agents, including a Query Planning Agent that breaks down complex user requests into executable steps, a Code Generation Agent that converts pseudocode into executable Python code, and three retrieval feedback agents - a Numeric Feedback Agent, a Lexical Feedback Agent, and a Visual Feedback Agent - that leverage multimodal LLMs to iteratively refine the data accuracy, textual labels, and visual correctness of generated plots via self-reflection. Extensive experiments show that PlotGen outperforms strong baselines, achieving a 4-6 percent improvement on the MatPlotBench dataset, leading to enhanced user trust in LLM-generated visualizations and improved novice productivity due to a reduction in debugging time needed for plot errors.

Variational Hierarchical Dialog Autoencoder for Dialog State Tracking Data Augmentation

Recent works have shown that generative data augmentation, where synthetic samples generated from deep generative models complement the training dataset, benefit NLP tasks. In this work, we extend this approach to the task of dialog state tracking for goal-oriented dialogs. Due to the inherent hierarchical structure of goal-oriented dialogs over utterances and related annotations, the deep generative model must be capable of capturing the coherence among different hierarchies and types of dialog features. We propose the Variational Hierarchical Dialog Autoencoder (VHDA) for modeling the complete aspects of goal-oriented dialogs, including linguistic features and underlying structured annotations, namely speaker information, dialog acts, and goals. The proposed architecture is designed to model each aspect of goal-oriented dialogs using inter-connected latent variables and learns to generate coherent goal-oriented dialogs from the latent spaces. To overcome training issues that arise from training complex variational models, we propose appropriate training strategies. Experiments on various dialog datasets show that our model improves the downstream dialog trackers' robustness via generative data augmentation. We also discover additional benefits of our unified approach to modeling goal-oriented dialogs: dialog response generation and user simulation, where our model outperforms previous strong baselines.

  • 6 authors
·
Jan 23, 2020

Are LLMs ready to help non-expert users to make charts of official statistics data?

In this time when biased information, deep fakes, and propaganda proliferate, the accessibility of reliable data sources is more important than ever. National statistical institutes provide curated data that contain quantitative information on a wide range of topics. However, that information is typically spread across many tables and the plain numbers may be arduous to process. Hence, this open data may be practically inaccessible. We ask the question "Are current Generative AI models capable of facilitating the identification of the right data and the fully-automatic creation of charts to provide information in visual form, corresponding to user queries?". We present a structured evaluation of recent large language models' (LLMs) capabilities to generate charts from complex data in response to user queries. Working with diverse public data from Statistics Netherlands, we assessed multiple LLMs on their ability to identify relevant data tables, perform necessary manipulations, and generate appropriate visualizations autonomously. We propose a new evaluation framework spanning three dimensions: data retrieval & pre-processing, code quality, and visual representation. Results indicate that locating and processing the correct data represents the most significant challenge. Additionally, LLMs rarely implement visualization best practices without explicit guidance. When supplemented with information about effective chart design, models showed marked improvement in representation scores. Furthermore, an agentic approach with iterative self-evaluation led to excellent performance across all evaluation dimensions. These findings suggest that LLMs' effectiveness for automated chart generation can be enhanced through appropriate scaffolding and feedback mechanisms, and that systems can already reach the necessary accuracy across the three evaluation dimensions.

  • 4 authors
·
Sep 3

ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

Determining the optimal data mixture for large language model training remains a challenging problem with an outsized impact on performance. In practice, language model developers continue to rely on heuristic exploration since no learning-based approach has emerged as a reliable solution. In this work, we propose to view the selection of training data mixtures as a black-box hyperparameter optimization problem, for which Bayesian Optimization is a well-established class of appropriate algorithms. Firstly, we cast data mixture learning as a sequential decision-making problem, in which we aim to find a suitable trade-off between the computational cost of training exploratory (proxy-) models and final mixture performance. Secondly, we systematically explore the properties of transferring mixtures learned at a small scale to larger-scale experiments, providing insights and highlighting opportunities for research at a modest scale. By proposing Multi-fidelity Bayesian Optimization as a suitable method in this common scenario, we introduce a natural framework to balance experiment cost with model fit, avoiding the risks of overfitting to smaller scales while minimizing the number of experiments at high cost. We present results for pre-training and instruction finetuning across models ranging from 1 million to 7 billion parameters, varying from simple architectures to state-of-the-art models and benchmarks spanning dozens of datasets. We demonstrate consistently strong results relative to a wide range of baselines, resulting inspeed-ups of over 500% in determining the best data mixture on our largest experiments. In addition, we broaden access to research by sharing ADMIRE IFT Runs, a dataset of 460 full training & evaluation runs worth over 13,000 GPU hours, greatly reducing the cost of conducting research in this area.

  • 5 authors
·
Aug 15

Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation

Instruction tuning has emerged to enhance the capabilities of large language models (LLMs) to comprehend instructions and generate appropriate responses. Existing methods either manually annotate or employ LLM (e.g., GPT-series) to generate data for instruction tuning. However, they often overlook associating instructions with existing annotated datasets. In this paper, we propose Dynosaur, a dynamic growth paradigm for the automatic curation of instruction-tuning data. Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions. By leveraging the existing annotated datasets, Dynosaur offers several advantages: 1) it reduces the API cost for generating instructions (e.g., it costs less than $12 USD by calling GPT-3.5-turbo for generating 800K instruction tuning samples; 2) it provides high-quality data for instruction tuning (e.g., it performs better than Alpaca and Flan on Super-NI and Longform with comparable data sizes); and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available. We further investigate a continual learning scheme for learning with the ever-growing instruction-tuning dataset, and demonstrate that replaying tasks with diverse instruction embeddings not only helps mitigate forgetting issues but generalizes to unseen tasks better. Code and data are available at https://github.com/WadeYin9712/Dynosaur.

  • 7 authors
·
May 23, 2023

Fidelity and Privacy of Synthetic Medical Data

The digitization of medical records ushered in a new era of big data to clinical science, and with it the possibility that data could be shared, to multiply insights beyond what investigators could abstract from paper records. The need to share individual-level medical data to accelerate innovation in precision medicine continues to grow, and has never been more urgent, as scientists grapple with the COVID-19 pandemic. However, enthusiasm for the use of big data has been tempered by a fully appropriate concern for patient autonomy and privacy. That is, the ability to extract private or confidential information about an individual, in practice, renders it difficult to share data, since significant infrastructure and data governance must be established before data can be shared. Although HIPAA provided de-identification as an approved mechanism for data sharing, linkage attacks were identified as a major vulnerability. A variety of mechanisms have been established to avoid leaking private information, such as field suppression or abstraction, strictly limiting the amount of information that can be shared, or employing mathematical techniques such as differential privacy. Another approach, which we focus on here, is creating synthetic data that mimics the underlying data. For synthetic data to be a useful mechanism in support of medical innovation and a proxy for real-world evidence, one must demonstrate two properties of the synthetic dataset: (1) any analysis on the real data must be matched by analysis of the synthetic data (statistical fidelity) and (2) the synthetic data must preserve privacy, with minimal risk of re-identification (privacy guarantee). In this paper we propose a framework for quantifying the statistical fidelity and privacy preservation properties of synthetic datasets and demonstrate these metrics for synthetic data generated by Syntegra technology.

  • 2 authors
·
Jan 18, 2021

Hate Speech detection in the Bengali language: A dataset and its baseline evaluation

Social media sites such as YouTube and Facebook have become an integral part of everyone's life and in the last few years, hate speech in the social media comment section has increased rapidly. Detection of hate speech on social media websites faces a variety of challenges including small imbalanced data sets, the findings of an appropriate model and also the choice of feature analysis method. further more, this problem is more severe for the Bengali speaking community due to the lack of gold standard labelled datasets. This paper presents a new dataset of 30,000 user comments tagged by crowd sourcing and varified by experts. All the comments are collected from YouTube and Facebook comment section and classified into seven categories: sports, entertainment, religion, politics, crime, celebrity and TikTok & meme. A total of 50 annotators annotated each comment three times and the majority vote was taken as the final annotation. Nevertheless, we have conducted base line experiments and several deep learning models along with extensive pre-trained Bengali word embedding such as Word2Vec, FastText and BengFastText on this dataset to facilitate future research opportunities. The experiment illustrated that although all deep learning models performed well, SVM achieved the best result with 87.5% accuracy. Our core contribution is to make this benchmark dataset available and accessible to facilitate further research in the field of in the field of Bengali hate speech detection.

  • 4 authors
·
Dec 17, 2020

SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation

Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: [https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)

  • 3 authors
·
Sep 12 2

A New Pipeline For Generating Instruction Dataset via RAG and Self Fine-Tuning

With the rapid development of large language models in recent years, there has been an increasing demand for domain-specific Agents that can cater to the unique needs of enterprises and organizations. Unlike general models, which strive for broad coverage, these specialized Agents rely on focused datasets tailored to their intended applications. This research proposes a pipeline that leverages the power of LLMs and the Retrieval-Augmented Generation related framework to construct high-quality instruction datasets for fine-tuning on specific domains using custom document collections. By ingesting domain-specific documents, the pipeline generates relevant and contextually appropriate instructions, thus effectively creating a comprehensive dataset for fine-tuning LLMs on the target domain. This approach overcomes the limitations of traditional dataset creation methods, which often rely on manual curation or web-scraping techniques that may introduce noise and irrelevant data. Notably, our pipeline offers a dynamic solution that can quickly adapt to updates or modifications in the domain-specific document collection, eliminating the need for complete retraining. Additionally, it addresses the challenge of data scarcity by enabling the generation of instruction datasets from a limited set of initial documents, rendering it suitable for unpopular or specialized domains where comprehensive datasets are scarce. As a case study, we apply this approach to the domain of psychiatry, a field requiring specialized knowledge and sensitive handling of patient information. The resulting fine-tuned LLM demonstrates showcases the viability of the proposed approach and underscores its potential for widespread adoption across various industries and domains where tailored, accurate, and contextually relevant language models are indispensable.

  • 3 authors
·
Aug 11, 2024

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at https://github.com/amphionspace/SD-Eval.

  • 9 authors
·
Jun 19, 2024

InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models

In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the language models (e.g., GPT-3). GIT can describe the image content even with scene text, while GPT-3 can generate informative dialogue based on the image description and appropriate prompting techniques. With such automatic pipeline, we can readily generate informative visual dialogue data at scale. Then, we ask human annotators to rate the generated dialogues to filter the low-quality conversations.Human analyses show that InfoVisDial covers informative and diverse dialogue topics: 54.4% of the dialogue rounds are related to image scene texts, and 36.7% require external knowledge. Each round's answer is also long and open-ended: 87.3% of answers are unique with an average length of 8.9, compared with 27.37% and 2.9 in VisDial. Last, we propose a strong baseline by adapting the GIT model for the visual dialogue task and fine-tune the model on InfoVisDial. Hopefully, our work can motivate more effort on this direction.

  • 6 authors
·
Dec 20, 2023

Cross-Domain Evaluation of Transformer-Based Vulnerability Detection on Open & Industry Data

Deep learning solutions for vulnerability detection proposed in academic research are not always accessible to developers, and their applicability in industrial settings is rarely addressed. Transferring such technologies from academia to industry presents challenges related to trustworthiness, legacy systems, limited digital literacy, and the gap between academic and industrial expertise. For deep learning in particular, performance and integration into existing workflows are additional concerns. In this work, we first evaluate the performance of CodeBERT for detecting vulnerable functions in industrial and open-source software. We analyse its cross-domain generalisation when fine-tuned on open-source data and tested on industrial data, and vice versa, also exploring strategies for handling class imbalance. Based on these results, we develop AI-DO(Automating vulnerability detection Integration for Developers' Operations), a Continuous Integration-Continuous Deployment (CI/CD)-integrated recommender system that uses fine-tuned CodeBERT to detect and localise vulnerabilities during code review without disrupting workflows. Finally, we assess the tool's perceived usefulness through a survey with the company's IT professionals. Our results show that models trained on industrial data detect vulnerabilities accurately within the same domain but lose performance on open-source code, while a deep learner fine-tuned on open data, with appropriate undersampling techniques, improves the detection of vulnerabilities.

  • 3 authors
·
Sep 11 2

Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions

Large language models (LLMs) can be used to generate text data for training and evaluating other models. However, creating high-quality datasets with LLMs can be challenging. In this work, we explore human-AI partnerships to facilitate high diversity and accuracy in LLM-based text data generation. We first examine two approaches to diversify text generation: 1) logit suppression, which minimizes the generation of languages that have already been frequently generated, and 2) temperature sampling, which flattens the token sampling probability. We found that diversification approaches can increase data diversity but often at the cost of data accuracy (i.e., text and labels being appropriate for the target domain). To address this issue, we examined two human interventions, 1) label replacement (LR), correcting misaligned labels, and 2) out-of-scope filtering (OOSF), removing instances that are out of the user's domain of interest or to which no considered label applies. With oracle studies, we found that LR increases the absolute accuracy of models trained with diversified datasets by 14.4%. Moreover, we found that some models trained with data generated with LR interventions outperformed LLM-based few-shot classification. In contrast, OOSF was not effective in increasing model accuracy, implying the need for future work in human-in-the-loop text data generation.

  • 3 authors
·
Jun 7, 2023

PANORAMA: A synthetic PII-laced dataset for studying sensitive data memorization in LLMs

The memorization of sensitive and personally identifiable information (PII) by large language models (LLMs) poses growing privacy risks as models scale and are increasingly deployed in real-world applications. Existing efforts to study sensitive and PII data memorization and develop mitigation strategies are hampered by the absence of comprehensive, realistic, and ethically sourced datasets reflecting the diversity of sensitive information found on the web. We introduce PANORAMA - Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis, a large-scale synthetic corpus of 384,789 samples derived from 9,674 synthetic profiles designed to closely emulate the distribution, variety, and context of PII and sensitive data as it naturally occurs in online environments. Our data generation pipeline begins with the construction of internally consistent, multi-attribute human profiles using constrained selection to reflect real-world demographics such as education, health attributes, financial status, etc. Using a combination of zero-shot prompting and OpenAI o3-mini, we generate diverse content types - including wiki-style articles, social media posts, forum discussions, online reviews, comments, and marketplace listings - each embedding realistic, contextually appropriate PII and other sensitive information. We validate the utility of PANORAMA by fine-tuning the Mistral-7B model on 1x, 5x, 10x, and 25x data replication rates with a subset of data and measure PII memorization rates - revealing not only consistent increases with repetition but also variation across content types, highlighting PANORAMA's ability to model how memorization risks differ by context. Our dataset and code are publicly available, providing a much-needed resource for privacy risk assessment, model auditing, and the development of privacy-preserving LLMs.

  • 2 authors
·
May 18

OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese

In recent years, visual question answering (VQA) has attracted attention from the research community because of its highly potential applications (such as virtual assistance on intelligent cars, assistant devices for blind people, or information retrieval from document images using natural language as queries) and challenge. The VQA task requires methods that have the ability to fuse the information from questions and images to produce appropriate answers. Neural visual question answering models have achieved tremendous growth on large-scale datasets which are mostly for resource-rich languages such as English. However, available datasets narrow the VQA task as the answers selection task or answer classification task. We argue that this form of VQA is far from human ability and eliminates the challenge of the answering aspect in the VQA task by just selecting answers rather than generating them. In this paper, we introduce the OpenViVQA (Open-domain Vietnamese Visual Question Answering) dataset, the first large-scale dataset for VQA with open-ended answers in Vietnamese, consists of 11,000+ images associated with 37,000+ question-answer pairs (QAs). Moreover, we proposed FST, QuMLAG, and MLPAG which fuse information from images and answers, then use these fused features to construct answers as humans iteratively. Our proposed methods achieve results that are competitive with SOTA models such as SAAA, MCAN, LORA, and M4C. The dataset is available to encourage the research community to develop more generalized algorithms including transformers for low-resource languages such as Vietnamese.

  • 4 authors
·
May 6, 2023

ARAUS: A Large-Scale Dataset and Baseline Models of Affective Responses to Augmented Urban Soundscapes

Choosing optimal maskers for existing soundscapes to effect a desired perceptual change via soundscape augmentation is non-trivial due to extensive varieties of maskers and a dearth of benchmark datasets with which to compare and develop soundscape augmentation models. To address this problem, we make publicly available the ARAUS (Affective Responses to Augmented Urban Soundscapes) dataset, which comprises a five-fold cross-validation set and independent test set totaling 25,440 unique subjective perceptual responses to augmented soundscapes presented as audio-visual stimuli. Each augmented soundscape is made by digitally adding "maskers" (bird, water, wind, traffic, construction, or silence) to urban soundscape recordings at fixed soundscape-to-masker ratios. Responses were then collected by asking participants to rate how pleasant, annoying, eventful, uneventful, vibrant, monotonous, chaotic, calm, and appropriate each augmented soundscape was, in accordance with ISO 12913-2:2018. Participants also provided relevant demographic information and completed standard psychological questionnaires. We perform exploratory and statistical analysis of the responses obtained to verify internal consistency and agreement with known results in the literature. Finally, we demonstrate the benchmarking capability of the dataset by training and comparing four baseline models for urban soundscape pleasantness: a low-parameter regression model, a high-parameter convolutional neural network, and two attention-based networks in the literature.

  • 6 authors
·
Jul 3, 2022

A Generative Self-Supervised Framework using Functional Connectivity in fMRI Data

Deep neural networks trained on Functional Connectivity (FC) networks extracted from functional Magnetic Resonance Imaging (fMRI) data have gained popularity due to the increasing availability of data and advances in model architectures, including Graph Neural Network (GNN). Recent research on the application of GNN to FC suggests that exploiting the time-varying properties of the FC could significantly improve the accuracy and interpretability of the model prediction. However, the high cost of acquiring high-quality fMRI data and corresponding phenotypic labels poses a hurdle to their application in real-world settings, such that a model na\"ively trained in a supervised fashion can suffer from insufficient performance or a lack of generalization on a small number of data. In addition, most Self-Supervised Learning (SSL) approaches for GNNs to date adopt a contrastive strategy, which tends to lose appropriate semantic information when the graph structure is perturbed or does not leverage both spatial and temporal information simultaneously. In light of these challenges, we propose a generative SSL approach that is tailored to effectively harness spatio-temporal information within dynamic FC. Our empirical results, experimented with large-scale (>50,000) fMRI datasets, demonstrate that our approach learns valuable representations and enables the construction of accurate and robust models when fine-tuned for downstream tasks.

  • 5 authors
·
Dec 4, 2023

LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM's Textual Training Data

Large language models (LLMs) can be trained or fine-tuned on data obtained without the owner's consent. Verifying whether a specific LLM was trained on particular data instances or an entire dataset is extremely challenging. Dataset watermarking addresses this by embedding identifiable modifications in training data to detect unauthorized use. However, existing methods often lack stealth, making them relatively easy to detect and remove. In light of these limitations, we propose LexiMark, a novel watermarking technique designed for text and documents, which embeds synonym substitutions for carefully selected high-entropy words. Our method aims to enhance an LLM's memorization capabilities on the watermarked text without altering the semantic integrity of the text. As a result, the watermark is difficult to detect, blending seamlessly into the text with no visible markers, and is resistant to removal due to its subtle, contextually appropriate substitutions that evade automated and manual detection. We evaluated our method using baseline datasets from recent studies and seven open-source models: LLaMA-1 7B, LLaMA-3 8B, Mistral 7B, Pythia 6.9B, as well as three smaller variants from the Pythia family (160M, 410M, and 1B). Our evaluation spans multiple training settings, including continued pretraining and fine-tuning scenarios. The results demonstrate significant improvements in AUROC scores compared to existing methods, underscoring our method's effectiveness in reliably verifying whether unauthorized watermarked data was used in LLM training.

  • 5 authors
·
Jun 17

Proactive Interaction Framework for Intelligent Social Receptionist Robots

Proactive human-robot interaction (HRI) allows the receptionist robots to actively greet people and offer services based on vision, which has been found to improve acceptability and customer satisfaction. Existing approaches are either based on multi-stage decision processes or based on end-to-end decision models. However, the rule-based approaches require sedulous expert efforts and only handle minimal pre-defined scenarios. On the other hand, existing works with end-to-end models are limited to very general greetings or few behavior patterns (typically less than 10). To address those challenges, we propose a new end-to-end framework, the TransFormer with Visual Tokens for Human-Robot Interaction (TFVT-HRI). The proposed framework extracts visual tokens of relative objects from an RGB camera first. To ensure the correct interpretation of the scenario, a transformer decision model is then employed to process the visual tokens, which is augmented with the temporal and spatial information. It predicts the appropriate action to take in each scenario and identifies the right target. Our data is collected from an in-service receptionist robot in an office building, which is then annotated by experts for appropriate proactive behavior. The action set includes 1000+ diverse patterns by combining language, emoji expression, and body motions. We compare our model with other SOTA end-to-end models on both offline test sets and online user experiments in realistic office building environments to validate this framework. It is demonstrated that the decision model achieves SOTA performance in action triggering and selection, resulting in more humanness and intelligence when compared with the previous reactive reception policies.

  • 7 authors
·
Dec 8, 2020

Beyond No: Quantifying AI Over-Refusal and Emotional Attachment Boundaries

We present an open-source benchmark and evaluation framework for assessing emotional boundary handling in Large Language Models (LLMs). Using a dataset of 1156 prompts across six languages, we evaluated three leading LLMs (GPT-4o, Claude-3.5 Sonnet, and Mistral-large) on their ability to maintain appropriate emotional boundaries through pattern-matched response analysis. Our framework quantifies responses across seven key patterns: direct refusal, apology, explanation, deflection, acknowledgment, boundary setting, and emotional awareness. Results demonstrate significant variation in boundary-handling approaches, with Claude-3.5 achieving the highest overall score (8.69/10) and producing longer, more nuanced responses (86.51 words on average). We identified a substantial performance gap between English (average score 25.62) and non-English interactions (< 0.22), with English responses showing markedly higher refusal rates (43.20% vs. < 1% for non-English). Pattern analysis revealed model-specific strategies, such as Mistral's preference for deflection (4.2%) and consistently low empathy scores across all models (< 0.06). Limitations include potential oversimplification through pattern matching, lack of contextual understanding in response analysis, and binary classification of complex emotional responses. Future work should explore more nuanced scoring methods, expand language coverage, and investigate cultural variations in emotional boundary expectations. Our benchmark and methodology provide a foundation for systematic evaluation of LLM emotional intelligence and boundary-setting capabilities.

  • 2 authors
·
Feb 20 3

QuIM-RAG: Advancing Retrieval-Augmented Generation with Inverted Question Matching for Enhanced QA Performance

This work presents a novel architecture for building Retrieval-Augmented Generation (RAG) systems to improve Question Answering (QA) tasks from a target corpus. Large Language Models (LLMs) have revolutionized the analyzing and generation of human-like text. These models rely on pre-trained data and lack real-time updates unless integrated with live data tools. RAG enhances LLMs by integrating online resources and databases to generate contextually appropriate responses. However, traditional RAG still encounters challenges like information dilution and hallucinations when handling vast amounts of data. Our approach addresses these challenges by converting corpora into a domain-specific dataset and RAG architecture is constructed to generate responses from the target document. We introduce QuIM-RAG (Question-to-question Inverted Index Matching), a novel approach for the retrieval mechanism in our system. This strategy generates potential questions from document chunks and matches these with user queries to identify the most relevant text chunks for generating accurate answers. We have implemented our RAG system on top of the open-source Meta-LLaMA3-8B-instruct model by Meta Inc. that is available on Hugging Face. We constructed a custom corpus of 500+ pages from a high-traffic website accessed thousands of times daily for answering complex questions, along with manually prepared ground truth QA for evaluation. We compared our approach with traditional RAG models using BERT-Score and RAGAS, state-of-the-art metrics for evaluating LLM applications. Our evaluation demonstrates that our approach outperforms traditional RAG architectures on both metrics.

  • 3 authors
·
Jan 5

Cross-Lingual Transfer from Related Languages: Treating Low-Resource Maltese as Multilingual Code-Switching

Although multilingual language models exhibit impressive cross-lingual transfer capabilities on unseen languages, the performance on downstream tasks is impacted when there is a script disparity with the languages used in the multilingual model's pre-training data. Using transliteration offers a straightforward yet effective means to align the script of a resource-rich language with a target language, thereby enhancing cross-lingual transfer capabilities. However, for mixed languages, this approach is suboptimal, since only a subset of the language benefits from the cross-lingual transfer while the remainder is impeded. In this work, we focus on Maltese, a Semitic language, with substantial influences from Arabic, Italian, and English, and notably written in Latin script. We present a novel dataset annotated with word-level etymology. We use this dataset to train a classifier that enables us to make informed decisions regarding the appropriate processing of each token in the Maltese language. We contrast indiscriminate transliteration or translation to mixing processing pipelines that only transliterate words of Arabic origin, thereby resulting in text with a mixture of scripts. We fine-tune the processed data on four downstream tasks and show that conditional transliteration based on word etymology yields the best results, surpassing fine-tuning with raw Maltese or Maltese processed with non-selective pipelines.

  • 5 authors
·
Jan 30, 2024

Investigating Subtler Biases in LLMs: Ageism, Beauty, Institutional, and Nationality Bias in Generative Models

LLMs are increasingly powerful and widely used to assist users in a variety of tasks. This use risks the introduction of LLM biases to consequential decisions such as job hiring, human performance evaluation, and criminal sentencing. Bias in NLP systems along the lines of gender and ethnicity has been widely studied, especially for specific stereotypes (e.g., Asians are good at math). In this paper, we investigate bias along less-studied but still consequential, dimensions, such as age and beauty, measuring subtler correlated decisions that LLMs make between social groups and unrelated positive and negative attributes. We ask whether LLMs hold wide-reaching biases of positive or negative sentiment for specific social groups similar to the ``what is beautiful is good'' bias found in people in experimental psychology. We introduce a template-generated dataset of sentence completion tasks that asks the model to select the most appropriate attribute to complete an evaluative statement about a person described as a member of a specific social group. We also reverse the completion task to select the social group based on an attribute. We report the correlations that we find for 4 cutting-edge LLMs. This dataset can be used as a benchmark to evaluate progress in more generalized biases and the templating technique can be used to expand the benchmark with minimal additional human annotation.

  • 3 authors
·
Sep 16, 2023

A Pressure Ulcer Care System For Remote Medical Assistance: Residual U-Net with an Attention Model Based for Wound Area Segmentation

Increasing numbers of patients with disabilities or elderly people with mobility issues often suffer from a pressure ulcer. The affected areas need regular checks, but they have a difficulty in accessing a hospital. Some remote diagnosis systems are being used for them, but there are limitations in checking a patient's status regularly. In this paper, we present a remote medical assistant that can help pressure ulcer management with image processing techniques. The proposed system includes a mobile application with a deep learning model for wound segmentation and analysis. As there are not enough data to train the deep learning model, we make use of a pretrained model from a relevant domain and data augmentation that is appropriate for this task. First of all, an image preprocessing method using bilinear interpolation is used to resize images and normalize the images. Second, for data augmentation, we use rotation, reflection, and a watershed algorithm. Third, we use a pretrained deep learning model generated from skin wound images similar to pressure ulcer images. Finally, we added an attention module that can provide hints on the pressure ulcer image features. The resulting model provides an accuracy of 99.0%, an intersection over union (IoU) of 99.99%, and a dice similarity coefficient (DSC) of 93.4% for pressure ulcer segmentation, which is better than existing results.

  • 3 authors
·
Jan 23, 2021

LOCOFY Large Design Models -- Design to code conversion solution

Despite rapid advances in Large Language Models and Multimodal Large Language Models (LLMs), numerous challenges related to interpretability, scalability, resource requirements and repeatability remain, related to their application in the design-to-code space. To address this, we introduce the Large Design Models (LDMs) paradigm specifically trained on designs and webpages to enable seamless conversion from design-to-code. We have developed a training and inference pipeline by incorporating data engineering and appropriate model architecture modification. The training pipeline consists of the following: 1)Design Optimiser: developed using a proprietary ground truth dataset and addresses sub-optimal designs; 2)Tagging and feature detection: using pre-trained and fine-tuned models, this enables the accurate detection and classification of UI elements; and 3)Auto Components: extracts repeated UI structures into reusable components to enable creation of modular code, thus reducing redundancy while enhancing code reusability. In this manner, each model addresses distinct but key issues for design-to-code conversion. Separately, our inference pipeline processes real-world designs to produce precise and interpretable instructions for code generation and ensures reliability. Additionally, our models illustrated exceptional end-to-end design-to-code conversion accuracy using a novel preview match score metric. Comparative experiments indicated superior performance of LDMs against LLMs on accuracy of node positioning, responsiveness and reproducibility. Moreover, our custom-trained tagging and feature detection model demonstrated high precision and consistency in identifying UI elements across a wide sample of test designs. Thus, our proposed LDMs are a reliable and superior solution to understanding designs that subsequently enable the generation of efficient and reliable production-ready code.

  • 4 authors
·
Jul 21

Creating an LLM-based AI-agent: A high-level methodology towards enhancing LLMs with APIs

Large Language Models (LLMs) have revolutionized various aspects of engineering and science. Their utility is often bottlenecked by the lack of interaction with the external digital environment. To overcome this limitation and achieve integration of LLMs and Artificial Intelligence (AI) into real-world applications, customized AI agents are being constructed. Based on the technological trends and techniques, we extract a high-level approach for constructing these AI agents, focusing on their underlying architecture. This thesis serves as a comprehensive guide that elucidates a multi-faceted approach for empowering LLMs with the capability to leverage Application Programming Interfaces (APIs). We present a 7-step methodology that begins with the selection of suitable LLMs and the task decomposition that is necessary for complex problem-solving. This methodology includes techniques for generating training data for API interactions and heuristics for selecting the appropriate API among a plethora of options. These steps eventually lead to the generation of API calls that are both syntactically and semantically aligned with the LLM's understanding of a given task. Moreover, we review existing frameworks and tools that facilitate these processes and highlight the gaps in current attempts. In this direction, we propose an on-device architecture that aims to exploit the functionality of carry-on devices by using small models from the Hugging Face community. We examine the effectiveness of these approaches on real-world applications of various domains, including the generation of a piano sheet. Through an extensive analysis of the literature and available technologies, this thesis aims to set a compass for researchers and practitioners to harness the full potential of LLMs augmented with external tool capabilities, thus paving the way for more autonomous, robust, and context-aware AI agents.

  • 1 authors
·
Dec 17, 2024

Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies

Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying natural language and social intelligence. However, most studies reduce the game to LLM-based self-play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants. Based on this dataset, we propose a novel strategy-alignment evaluation that leverages the winning faction's strategies as ground truth in two stages: 1) Speech evaluation, formulated as multiple-choice-style tasks that assess whether the model can adopt appropriate stances across five dimensions of social ability; and 2) Decision evaluation, which assesses the model's voting choices and opponent-role inferences. This framework enables a fine-grained evaluation of models' linguistic and reasoning capabilities, while capturing their ability to generate strategically coherent gameplay. Our experiments show that state-of-the-art LLMs show diverse performance, with roughly half remain below 0.50, revealing clear gaps in deception and counterfactual reasoning. We hope our dataset further inspires research on language, reasoning, and strategy in multi-agent interaction.

  • 10 authors
·
Oct 13

Foundation Models and Fair Use

Existing foundation models are trained on copyrighted material. Deploying these models can pose both legal and ethical risks when data creators fail to receive appropriate attribution or compensation. In the United States and several other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. However, there is a caveat: If the model produces output that is similar to copyrighted data, particularly in scenarios that affect the market of that data, fair use may no longer apply to the output of the model. In this work, we emphasize that fair use is not guaranteed, and additional work may be necessary to keep model development and deployment squarely in the realm of fair use. First, we survey the potential risks of developing and deploying foundation models based on copyrighted content. We review relevant U.S. case law, drawing parallels to existing and potential applications for generating text, source code, and visual art. Experiments confirm that popular foundation models can generate content considerably similar to copyrighted material. Second, we discuss technical mitigations that can help foundation models stay in line with fair use. We argue that more research is needed to align mitigation strategies with the current state of the law. Lastly, we suggest that the law and technical mitigations should co-evolve. For example, coupled with other policy mechanisms, the law could more explicitly consider safe harbors when strong technical tools are used to mitigate infringement harms. This co-evolution may help strike a balance between intellectual property and innovation, which speaks to the original goal of fair use. But we emphasize that the strategies we describe here are not a panacea and more work is needed to develop policies that address the potential harms of foundation models.

  • 6 authors
·
Mar 27, 2023 1

FinRobot: AI Agent for Equity Research and Valuation with Large Language Models

As financial markets grow increasingly complex, there is a rising need for automated tools that can effectively assist human analysts in equity research, particularly within sell-side research. While Generative AI (GenAI) has attracted significant attention in this field, existing AI solutions often fall short due to their narrow focus on technical factors and limited capacity for discretionary judgment. These limitations hinder their ability to adapt to new data in real-time and accurately assess risks, which diminishes their practical value for investors. This paper presents FinRobot, the first AI agent framework specifically designed for equity research. FinRobot employs a multi-agent Chain of Thought (CoT) system, integrating both quantitative and qualitative analyses to emulate the comprehensive reasoning of a human analyst. The system is structured around three specialized agents: the Data-CoT Agent, which aggregates diverse data sources for robust financial integration; the Concept-CoT Agent, which mimics an analysts reasoning to generate actionable insights; and the Thesis-CoT Agent, which synthesizes these insights into a coherent investment thesis and report. FinRobot provides thorough company analysis supported by precise numerical data, industry-appropriate valuation metrics, and realistic risk assessments. Its dynamically updatable data pipeline ensures that research remains timely and relevant, adapting seamlessly to new financial information. Unlike existing automated research tools, such as CapitalCube and Wright Reports, FinRobot delivers insights comparable to those produced by major brokerage firms and fundamental research vendors. We open-source FinRobot at https://github. com/AI4Finance-Foundation/FinRobot.

  • 4 authors
·
Nov 13, 2024

Optimizing Retrieval Strategies for Financial Question Answering Documents in Retrieval-Augmented Generation Systems

Retrieval-Augmented Generation (RAG) has emerged as a promising framework to mitigate hallucinations in Large Language Models (LLMs), yet its overall performance is dependent on the underlying retrieval system. In the finance domain, documents such as 10-K reports pose distinct challenges due to domain-specific vocabulary and multi-hierarchical tabular data. In this work, we introduce an efficient, end-to-end RAG pipeline that enhances retrieval for financial documents through a three-phase approach: pre-retrieval, retrieval, and post-retrieval. In the pre-retrieval phase, various query and corpus preprocessing techniques are employed to enrich input data. During the retrieval phase, we fine-tuned state-of-the-art (SOTA) embedding models with domain-specific knowledge and implemented a hybrid retrieval strategy that combines dense and sparse representations. Finally, the post-retrieval phase leverages Direct Preference Optimization (DPO) training and document selection methods to further refine the results. Evaluations on seven financial question answering datasets-FinDER, FinQABench, FinanceBench, TATQA, FinQA, ConvFinQA, and MultiHiertt-demonstrate substantial improvements in retrieval performance, leading to more accurate and contextually appropriate generation. These findings highlight the critical role of tailored retrieval techniques in advancing the effectiveness of RAG systems for financial applications. A fully replicable pipeline is available on GitHub: https://github.com/seohyunwoo-0407/GAR.

  • 4 authors
·
Mar 19

VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation

Recent advances in audio generation have focused on text-to-audio (T2A) and video-to-audio (V2A) tasks. However, T2A or V2A methods cannot generate holistic sounds (onscreen and off-screen). This is because T2A cannot generate sounds aligning with onscreen objects, while V2A cannot generate semantically complete (offscreen sounds missing). In this work, we address the task of holistic audio generation: given a video and a text prompt, we aim to generate both onscreen and offscreen sounds that are temporally synchronized with the video and semantically aligned with text and video. Previous approaches for joint text and video-to-audio generation often suffer from modality bias, favoring one modality over the other. To overcome this limitation, we introduce VinTAGe, a flow-based transformer model that jointly considers text and video to guide audio generation. Our framework comprises two key components: a Visual-Text Encoder and a Joint VT-SiT model. To reduce modality bias and improve generation quality, we employ pretrained uni-modal text-to-audio and video-to-audio generation models for additional guidance. Due to the lack of appropriate benchmarks, we also introduce VinTAGe-Bench, a dataset of 636 video-text-audio pairs containing both onscreen and offscreen sounds. Our comprehensive experiments on VinTAGe-Bench demonstrate that joint text and visual interaction is necessary for holistic audio generation. Furthermore, VinTAGe achieves state-of-the-art results on the VGGSound benchmark. Our source code and pre-trained models will be released. Demo is available at: https://www.youtube.com/watch?v=QmqWhUjPkJI.

  • 2 authors
·
Dec 14, 2024

Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

Concept Bottleneck Models (CBMs) have recently been proposed to address the 'black-box' problem of deep neural networks, by first mapping images to a human-understandable concept space and then linearly combining concepts for classification. Such models typically require first coming up with a set of concepts relevant to the task and then aligning the representations of a feature extractor to map to these concepts. However, even with powerful foundational feature extractors like CLIP, there are no guarantees that the specified concepts are detectable. In this work, we leverage recent advances in mechanistic interpretability and propose a novel CBM approach -- called Discover-then-Name-CBM (DN-CBM) -- that inverts the typical paradigm: instead of pre-selecting concepts based on the downstream classification task, we use sparse autoencoders to first discover concepts learnt by the model, and then name them and train linear probes for classification. Our concept extraction strategy is efficient, since it is agnostic to the downstream task, and uses concepts already known to the model. We perform a comprehensive evaluation across multiple datasets and CLIP architectures and show that our method yields semantically meaningful concepts, assigns appropriate names to them that make them easy to interpret, and yields performant and interpretable CBMs. Code available at https://github.com/neuroexplicit-saar/discover-then-name.

  • 4 authors
·
Jul 19, 2024

Towards Exploiting Background Knowledge for Building Conversation Systems

Existing dialog datasets contain a sequence of utterances and responses without any explicit background knowledge associated with them. This has resulted in the development of models which treat conversation as a sequence-to-sequence generation task i.e, given a sequence of utterances generate the response sequence). This is not only an overly simplistic view of conversation but it is also emphatically different from the way humans converse by heavily relying on their background knowledge about the topic (as opposed to simply relying on the previous sequence of utterances). For example, it is common for humans to (involuntarily) produce utterances which are copied or suitably modified from background articles they have read about the topic. To facilitate the development of such natural conversation models which mimic the human process of conversing, we create a new dataset containing movie chats wherein each response is explicitly generated by copying and/or modifying sentences from unstructured background knowledge such as plots, comments and reviews about the movie. We establish baseline results on this dataset (90K utterances from 9K conversations) using three different models: (i) pure generation based models which ignore the background knowledge (ii) generation based models which learn to copy information from the background knowledge when required and (iii) span prediction based models which predict the appropriate response span in the background knowledge.

  • 4 authors
·
Sep 21, 2018

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

Tables are prevalent in real-world databases, requiring significant time and effort for humans to analyze and manipulate. The advancements in large language models (LLMs) have made it possible to interact with tables using natural language input, bringing this capability closer to reality. In this paper, we present TableGPT, a unified fine-tuned framework that enables LLMs to understand and operate on tables using external functional commands. It introduces the capability to seamlessly interact with tables, enabling a wide range of functionalities such as question answering, data manipulation (e.g., insert, delete, query, and modify operations), data visualization, analysis report generation, and automated prediction. TableGPT aims to provide convenience and accessibility to users by empowering them to effortlessly leverage tabular data. At the core of TableGPT lies the novel concept of global tabular representations, which empowers LLMs to gain a comprehensive understanding of the entire table beyond meta-information. By jointly training LLMs on both table and text modalities, TableGPT achieves a deep understanding of tabular data and the ability to perform complex operations on tables through chain-of-command instructions. Importantly, TableGPT offers the advantage of being a self-contained system rather than relying on external API interfaces. Moreover, it supports efficient data process flow, query rejection (when appropriate) and private deployment, enabling faster domain data fine-tuning and ensuring data privacy, which enhances the framework's adaptability to specific use cases.

  • 25 authors
·
Jul 17, 2023 5

Calibrated Language Models Must Hallucinate

Recent language models have a mysterious tendency to generate false but plausible-sounding text. Such "hallucinations" are an obstacle to the usability of language-based AI systems and can harm people who rely upon their outputs. This work shows shows that there is an inherent statistical reason that pretrained language models hallucinate certain types of facts, having nothing to do with the transformer LM architecture or data quality. For "arbitrary" facts whose veracity cannot be determined from the training data, we show that hallucination is necessary for language models that satisfy a statistical calibration condition appropriate for generative language models. Specifically, if the maximum probability of any fact is bounded, we show that the probability of generating a hallucination is close to the fraction of facts that occur exactly once in the training data (a "Good-Turing" estimate), even assuming ideal training data without errors. One conclusion is that models pretrained to be sufficiently good predictors (i.e., calibrated) may require post-training to mitigate hallucinations on the type of arbitrary facts that tend to appear once in the training set. However, our analysis also suggests that there is no statistical reason that pretraining will lead to hallucination on facts that tend to appear more than once in the training data (like references to publications such as articles and books, whose hallucinations have been particularly notable and problematic) or on systematic facts (like arithmetic calculations). Therefore, different architectures and learning algorithms may mitigate these latter types of hallucinations.

  • 2 authors
·
Nov 24, 2023

AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine

Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the "best" parser for a particular document depends on its computational cost and the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse, when compared to state-of-the-art parsers, improves throughput by 17times while still achieving comparable accuracy (0.2 percent better) on a benchmark set of 1000 scientific documents. AdaParse's combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets. The implementation is available at https://github.com/7shoe/AdaParse/

  • 13 authors
·
Apr 23

Chinesewebtext: Large-scale high-quality Chinese web text extracted with effective evaluation model

During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of complete tool-chain for extracting clean texts from web data. Furthermore, fine-grained information of the corpus, e.g. the quality of each text, is missing. To address these challenges, we propose in this paper a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. First, similar to previous work, manually crafted rules are employed to discard explicit noisy texts from the raw crawled web contents. Second, a well-designed evaluation model is leveraged to assess the remaining relatively clean data, and each text is assigned a specific quality score. Finally, we can easily utilize an appropriate threshold to select the high-quality pre-training data for Chinese. Using our proposed approach, we release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score, facilitating the LLM researchers to choose the data according to the desired quality thresholds. We also release a much cleaner subset of 600 GB Chinese data with the quality exceeding 90%.

  • 10 authors
·
Nov 2, 2023

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code will be released later.

  • 7 authors
·
May 23 2

Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning

Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.

CoRT: Code-integrated Reasoning within Thinking

Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model's internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4\% and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at https://github.com/ChengpengLi1003/CoRT.

Efficient Model Adaptation for Continual Learning at the Edge

Most machine learning (ML) systems assume stationary and matching data distributions during training and deployment. This is often a false assumption. When ML models are deployed on real devices, data distributions often shift over time due to changes in environmental factors, sensor characteristics, and task-of-interest. While it is possible to have a human-in-the-loop to monitor for distribution shifts and engineer new architectures in response to these shifts, such a setup is not cost-effective. Instead, non-stationary automated ML (AutoML) models are needed. This paper presents the Encoder-Adaptor-Reconfigurator (EAR) framework for efficient continual learning under domain shifts. The EAR framework uses a fixed deep neural network (DNN) feature encoder and trains shallow networks on top of the encoder to handle novel data. The EAR framework is capable of 1) detecting when new data is out-of-distribution (OOD) by combining DNNs with hyperdimensional computing (HDC), 2) identifying low-parameter neural adaptors to adapt the model to the OOD data using zero-shot neural architecture search (ZS-NAS), and 3) minimizing catastrophic forgetting on previous tasks by progressively growing the neural architecture as needed and dynamically routing data through the appropriate adaptors and reconfigurators for handling domain-incremental and class-incremental continual learning. We systematically evaluate our approach on several benchmark datasets for domain adaptation and demonstrate strong performance compared to state-of-the-art algorithms for OOD detection and few-/zero-shot NAS.

  • 8 authors
·
Aug 3, 2023

Multi-Coil MRI Reconstruction Challenge -- Assessing Brain MRI Reconstruction Models and their Generalizability to Varying Coil Configurations

Deep-learning-based brain magnetic resonance imaging (MRI) reconstruction methods have the potential to accelerate the MRI acquisition process. Nevertheless, the scientific community lacks appropriate benchmarks to assess MRI reconstruction quality of high-resolution brain images, and evaluate how these proposed algorithms will behave in the presence of small, but expected data distribution shifts. The Multi-Coil Magnetic Resonance Image (MC-MRI) Reconstruction Challenge provides a benchmark that aims at addressing these issues, using a large dataset of high-resolution, three-dimensional, T1-weighted MRI scans. The challenge has two primary goals: 1) to compare different MRI reconstruction models on this dataset and 2) to assess the generalizability of these models to data acquired with a different number of receiver coils. In this paper, we describe the challenge experimental design, and summarize the results of a set of baseline and state of the art brain MRI reconstruction models. We provide relevant comparative information on the current MRI reconstruction state-of-the-art and highlight the challenges of obtaining generalizable models that are required prior to broader clinical adoption. The MC-MRI benchmark data, evaluation code and current challenge leaderboard are publicly available. They provide an objective performance assessment for future developments in the field of brain MRI reconstruction.

  • 23 authors
·
Nov 9, 2020

Lamarr: LHCb ultra-fast simulation based on machine learning models deployed within Gauss

About 90% of the computing resources available to the LHCb experiment has been spent to produce simulated data samples for Run 2 of the Large Hadron Collider at CERN. The upgraded LHCb detector will be able to collect larger data samples, requiring many more simulated events to analyze the data to be collected in Run 3. Simulation is a key necessity of analysis to interpret signal, reject background and measure efficiencies. The needed simulation will far exceed the pledged resources, requiring an evolution in technologies and techniques to produce these simulated data samples. In this contribution, we discuss Lamarr, a Gaudi-based framework to speed-up the simulation production parameterizing both the detector response and the reconstruction algorithms of the LHCb experiment. Deep Generative Models powered by several algorithms and strategies are employed to effectively parameterize the high-level response of the single components of the LHCb detector, encoding within neural networks the experimental errors and uncertainties introduced in the detection and reconstruction phases. Where possible, models are trained directly on real data, statistically subtracting any background components by applying appropriate reweighing procedures. Embedding Lamarr in the general LHCb Gauss Simulation framework allows to combine its execution with any of the available generators in a seamless way. The resulting software package enables a simulation process independent of the detailed simulation used to date.

  • 1 authors
·
Mar 20, 2023

Unsupervised Manifold Linearizing and Clustering

We consider the problem of simultaneously clustering and learning a linear representation of data lying close to a union of low-dimensional manifolds, a fundamental task in machine learning and computer vision. When the manifolds are assumed to be linear subspaces, this reduces to the classical problem of subspace clustering, which has been studied extensively over the past two decades. Unfortunately, many real-world datasets such as natural images can not be well approximated by linear subspaces. On the other hand, numerous works have attempted to learn an appropriate transformation of the data, such that data is mapped from a union of general non-linear manifolds to a union of linear subspaces (with points from the same manifold being mapped to the same subspace). However, many existing works have limitations such as assuming knowledge of the membership of samples to clusters, requiring high sampling density, or being shown theoretically to learn trivial representations. In this paper, we propose to optimize the Maximal Coding Rate Reduction metric with respect to both the data representation and a novel doubly stochastic cluster membership, inspired by state-of-the-art subspace clustering results. We give a parameterization of such a representation and membership, allowing efficient mini-batching and one-shot initialization. Experiments on CIFAR-10, -20, -100, and TinyImageNet-200 datasets show that the proposed method is much more accurate and scalable than state-of-the-art deep clustering methods, and further learns a latent linear representation of the data.

  • 6 authors
·
Jan 4, 2023

Why These Documents? Explainable Generative Retrieval with Hierarchical Category Paths

Generative retrieval has recently emerged as a new alternative of traditional information retrieval approaches. However, existing generative retrieval methods directly decode docid when a query is given, making it impossible to provide users with explanations as an answer for "Why this document is retrieved?". To address this limitation, we propose Hierarchical Category Path-Enhanced Generative Retrieval(HyPE), which enhances explainability by generating hierarchical category paths step-by-step before decoding docid. HyPE leverages hierarchical category paths as explanation, progressing from broad to specific semantic categories. This approach enables diverse explanations for the same document depending on the query by using shared category paths between the query and the document, and provides reasonable explanation by reflecting the document's semantic structure through a coarse-to-fine manner. HyPE constructs category paths with external high-quality semantic hierarchy, leverages LLM to select appropriate candidate paths for each document, and optimizes the generative retrieval model with path-augmented dataset. During inference, HyPE utilizes path-aware reranking strategy to aggregate diverse topic information, allowing the most relevant documents to be prioritized in the final ranked list of docids. Our extensive experiments demonstrate that HyPE not only offers a high level of explainability but also improves the retrieval performance in the document retrieval task.

  • 6 authors
·
Nov 8, 2024

Let LLMs Break Free from Overthinking via Self-Braking Tuning

Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have significantly enhanced their reasoning capabilities by generating longer chains of thought, demonstrating outstanding performance across a variety of tasks. However, this performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process, leading to high computational overhead and exacerbating the issue of overthinking. Although numerous existing approaches aim to address the problem of overthinking, they often rely on external interventions. In this paper, we propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process, thus eliminating the reliance on external control mechanisms. We construct a set of overthinking identification metrics based on standard answers and design a systematic method to detect redundant reasoning. This method accurately identifies unnecessary steps within the reasoning trajectory and generates training signals for learning self-regulation behaviors. Building on this foundation, we develop a complete strategy for constructing data with adaptive reasoning lengths and introduce an innovative braking prompt mechanism that enables the model to naturally learn when to terminate reasoning at an appropriate point. Experiments across mathematical benchmarks (AIME, AMC, MATH500, GSM8K) demonstrate that our method reduces token consumption by up to 60% while maintaining comparable accuracy to unconstrained models.

  • 10 authors
·
May 20 2

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

In this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile language models.

  • 16 authors
·
Apr 5, 2024 2

Statically Contextualizing Large Language Models with Typed Holes

Large language models (LLMs) have reshaped the landscape of program synthesis. However, contemporary LLM-based code completion systems often hallucinate broken code because they lack appropriate context, particularly when working with definitions not in the training data nor near the cursor. This paper demonstrates that tight integration with the type and binding structure of a language, as exposed by its language server, can address this contextualization problem in a token-efficient manner. In short, we contend that AIs need IDEs, too! In particular, we integrate LLM code generation into the Hazel live program sketching environment. The Hazel Language Server identifies the type and typing context of the hole being filled, even in the presence of errors, ensuring that a meaningful program sketch is always available. This allows prompting with codebase-wide contextual information not lexically local to the cursor, nor necessarily in the same file, but that is likely to be semantically local to the developer's goal. Completions synthesized by the LLM are then iteratively refined via further dialog with the language server. To evaluate these techniques, we introduce MVUBench, a dataset of model-view-update (MVU) web applications. These applications serve as challenge problems due to their reliance on application-specific data structures. We find that contextualization with type definitions is particularly impactful. After introducing our ideas in the context of Hazel we duplicate our techniques and port MVUBench to TypeScript in order to validate the applicability of these methods to higher-resource languages. Finally, we outline ChatLSP, a conservative extension to the Language Server Protocol (LSP) that language servers can implement to expose capabilities that AI code completion systems of various designs can use to incorporate static context when generating prompts for an LLM.

  • 4 authors
·
Sep 1, 2024 2

Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments

Large Language Models (LLMs) have shown remarkable performance across various tasks, yet significant disparities remain for non-English languages, and especially native African languages. This paper addresses these disparities by creating approximately 1 million human-translated words of new benchmark data in 8 low-resource African languages, covering a population of over 160 million speakers of: Amharic, Bambara, Igbo, Sepedi (Northern Sotho), Shona, Sesotho (Southern Sotho), Setswana, and Tsonga. Our benchmarks are translations of Winogrande and three sections of MMLU: college medicine, clinical knowledge, and virology. Using the translated benchmarks, we report previously unknown performance gaps between state-of-the-art (SOTA) LLMs in English and African languages. Finally, using results from over 400 fine-tuned models, we explore several methods to reduce the LLM performance gap, including high-quality dataset fine-tuning (using an LLM-as-an-Annotator), cross-lingual transfer, and cultural appropriateness adjustments. Key findings include average mono-lingual improvements of 5.6% with fine-tuning (with 5.4% average mono-lingual improvements when using high-quality data over low-quality data), 2.9% average gains from cross-lingual transfer, and a 3.0% out-of-the-box performance boost on culturally appropriate questions. The publicly available benchmarks, translations, and code from this study support further research and development aimed at creating more inclusive and effective language technologies.

  • 6 authors
·
Dec 16, 2024

Challenges and Complexities in Machine Learning based Credit Card Fraud Detection

Credit cards play an exploding role in modern economies. Its popularity and ubiquity have created a fertile ground for fraud, assisted by the cross boarder reach and instantaneous confirmation. While transactions are growing, the fraud percentages are also on the rise as well as the true cost of a dollar fraud. Volume of transactions, uniqueness of frauds and ingenuity of the fraudster are main challenges in detecting frauds. The advent of machine learning, artificial intelligence and big data has opened up new tools in the fight against frauds. Given past transactions, a machine learning algorithm has the ability to 'learn' infinitely complex characteristics in order to identify frauds in real-time, surpassing the best human investigators. However, the developments in fraud detection algorithms has been challenging and slow due the massively unbalanced nature of fraud data, absence of benchmarks and standard evaluation metrics to identify better performing classifiers, lack of sharing and disclosure of research findings and the difficulties in getting access to confidential transaction data for research. This work investigates the properties of typical massively imbalanced fraud data sets, their availability, suitability for research use while exploring the widely varying nature of fraud distributions. Furthermore, we show how human annotation errors compound with machine classification errors. We also carry out experiments to determine the effect of PCA obfuscation (as a means of disseminating sensitive transaction data for research and machine learning) on algorithmic performance of classifiers and show that while PCA does not significantly degrade performance, care should be taken to use the appropriate principle component size (dimensions) to avoid overfitting.

  • 1 authors
·
Aug 20, 2022

BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models

Large Language Models (LLMs) are constrained by outdated information and a tendency to generate incorrect data, commonly referred to as "hallucinations." Retrieval-Augmented Generation (RAG) addresses these limitations by combining the strengths of retrieval-based methods and generative models. This approach involves retrieving relevant information from a large, up-to-date dataset and using it to enhance the generation process, leading to more accurate and contextually appropriate responses. Despite its benefits, RAG introduces a new attack surface for LLMs, particularly because RAG databases are often sourced from public data, such as the web. In this paper, we propose to identify the vulnerabilities and attacks on retrieval parts (RAG database) and their indirect attacks on generative parts (LLMs). Specifically, we identify that poisoning several customized content passages could achieve a retrieval backdoor, where the retrieval works well for clean queries but always returns customized poisoned adversarial queries. Triggers and poisoned passages can be highly customized to implement various attacks. For example, a trigger could be a semantic group like "The Republican Party, Donald Trump, etc." Adversarial passages can be tailored to different contents, not only linked to the triggers but also used to indirectly attack generative LLMs without modifying them. These attacks can include denial-of-service attacks on RAG and semantic steering attacks on LLM generations conditioned by the triggers. Our experiments demonstrate that by just poisoning 10 adversarial passages can induce 98.2\% success rate to retrieve the adversarial passages. Then, these passages can increase the reject ratio of RAG-based GPT-4 from 0.01\% to 74.6\% or increase the rate of negative responses from 0.22\% to 72\% for targeted queries.

  • 6 authors
·
Jun 2, 2024

Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation

Emotional Support Conversation (ESC) is a task aimed at alleviating individuals' emotional distress through daily conversation. Given its inherent complexity and non-intuitive nature, ESConv dataset incorporates support strategies to facilitate the generation of appropriate responses. Recently, despite the remarkable conversational ability of large language models (LLMs), previous studies have suggested that they often struggle with providing useful emotional support. Hence, this work initially analyzes the results of LLMs on ESConv, revealing challenges in selecting the correct strategy and a notable preference for a specific strategy. Motivated by these, we explore the impact of the inherent preference in LLMs on providing emotional support, and consequently, we observe that exhibiting high preference for specific strategies hinders effective emotional support, aggravating its robustness in predicting the appropriate strategy. Moreover, we conduct a methodological study to offer insights into the necessary approaches for LLMs to serve as proficient emotional supporters. Our findings emphasize that (1) low preference for specific strategies hinders the progress of emotional support, (2) external assistance helps reduce preference bias, and (3) existing LLMs alone cannot become good emotional supporters. These insights suggest promising avenues for future research to enhance the emotional intelligence of LLMs.

  • 8 authors
·
Feb 20, 2024

Understanding Disparities in Post Hoc Machine Learning Explanation

Previous work has highlighted that existing post-hoc explanation methods exhibit disparities in explanation fidelity (across 'race' and 'gender' as sensitive attributes), and while a large body of work focuses on mitigating these issues at the explanation metric level, the role of the data generating process and black box model in relation to explanation disparities remains largely unexplored. Accordingly, through both simulations as well as experiments on a real-world dataset, we specifically assess challenges to explanation disparities that originate from properties of the data: limited sample size, covariate shift, concept shift, omitted variable bias, and challenges based on model properties: inclusion of the sensitive attribute and appropriate functional form. Through controlled simulation analyses, our study demonstrates that increased covariate shift, concept shift, and omission of covariates increase explanation disparities, with the effect pronounced higher for neural network models that are better able to capture the underlying functional form in comparison to linear models. We also observe consistent findings regarding the effect of concept shift and omitted variable bias on explanation disparities in the Adult income dataset. Overall, results indicate that disparities in model explanations can also depend on data and model properties. Based on this systematic investigation, we provide recommendations for the design of explanation methods that mitigate undesirable disparities.

  • 4 authors
·
Jan 25, 2024

MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

Text-to-image diffusion has attracted vast attention due to its impressive image-generation capabilities. However, when it comes to human-centric text-to-image generation, particularly in the context of faces and hands, the results often fall short of naturalness due to insufficient training priors. We alleviate the issue in this work from two perspectives. 1) From the data aspect, we carefully collect a human-centric dataset comprising over one million high-quality human-in-the-scene images and two specific sets of close-up images of faces and hands. These datasets collectively provide a rich prior knowledge base to enhance the human-centric image generation capabilities of the diffusion model. 2) On the methodological front, we propose a simple yet effective method called Mixture of Low-rank Experts (MoLE) by considering low-rank modules trained on close-up hand and face images respectively as experts. This concept draws inspiration from our observation of low-rank refinement, where a low-rank module trained by a customized close-up dataset has the potential to enhance the corresponding image part when applied at an appropriate scale. To validate the superiority of MoLE in the context of human-centric image generation compared to state-of-the-art, we construct two benchmarks and perform evaluations with diverse metrics and human studies. Datasets, model, and code are released at https://sites.google.com/view/mole4diffuser/.

  • 6 authors
·
Oct 30, 2024

TalkToModel: Explaining Machine Learning Models with Interactive Natural Language Conversations

Machine Learning (ML) models are increasingly used to make critical decisions in real-world applications, yet they have become more complex, making them harder to understand. To this end, researchers have proposed several techniques to explain model predictions. However, practitioners struggle to use these explainability techniques because they often do not know which one to choose and how to interpret the results of the explanations. In this work, we address these challenges by introducing TalkToModel: an interactive dialogue system for explaining machine learning models through conversations. Specifically, TalkToModel comprises of three key components: 1) a natural language interface for engaging in conversations, making ML model explainability highly accessible, 2) a dialogue engine that adapts to any tabular model and dataset, interprets natural language, maps it to appropriate explanations, and generates text responses, and 3) an execution component that constructs the explanations. We carried out extensive quantitative and human subject evaluations of TalkToModel. Overall, we found the conversational system understands user inputs on novel datasets and models with high accuracy, demonstrating the system's capacity to generalize to new situations. In real-world evaluations with humans, 73% of healthcare workers (e.g., doctors and nurses) agreed they would use TalkToModel over baseline point-and-click systems for explainability in a disease prediction task, and 85% of ML professionals agreed TalkToModel was easier to use for computing explanations. Our findings demonstrate that TalkToModel is more effective for model explainability than existing systems, introducing a new category of explainability tools for practitioners. Code & demo released here: https://github.com/dylan-slack/TalkToModel.

  • 4 authors
·
Jul 8, 2022

FinRobot: An Open-Source AI Agent Platform for Financial Applications using Large Language Models

As financial institutions and professionals increasingly incorporate Large Language Models (LLMs) into their workflows, substantial barriers, including proprietary data and specialized knowledge, persist between the finance sector and the AI community. These challenges impede the AI community's ability to enhance financial tasks effectively. Acknowledging financial analysis's critical role, we aim to devise financial-specialized LLM-based toolchains and democratize access to them through open-source initiatives, promoting wider AI adoption in financial decision-making. In this paper, we introduce FinRobot, a novel open-source AI agent platform supporting multiple financially specialized AI agents, each powered by LLM. Specifically, the platform consists of four major layers: 1) the Financial AI Agents layer that formulates Financial Chain-of-Thought (CoT) by breaking sophisticated financial problems down into logical sequences; 2) the Financial LLM Algorithms layer dynamically configures appropriate model application strategies for specific tasks; 3) the LLMOps and DataOps layer produces accurate models by applying training/fine-tuning techniques and using task-relevant data; 4) the Multi-source LLM Foundation Models layer that integrates various LLMs and enables the above layers to access them directly. Finally, FinRobot provides hands-on for both professional-grade analysts and laypersons to utilize powerful AI techniques for advanced financial analysis. We open-source FinRobot at https://github.com/AI4Finance-Foundation/FinRobot.

  • 11 authors
·
May 23, 2024

Information Extraction from Heterogeneous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation

Invoices and receipts submitted by employees are visually rich documents (VRDs) with textual, visual and layout information. To protect against the risk of fraud and abuse, it is crucial for organizations to efficiently extract desired information from submitted receipts. This helps in the assessment of key factors such as appropriateness of the expense claim, adherence to spending and transaction policies, the validity of the receipt, as well as downstream anomaly detection at various levels. These documents are heterogeneous, with multiple formats and languages, uploaded with different image qualities, and often do not contain ground truth labels for the efficient training of models. In this paper we propose Task Aware Instruction-based Labelling (TAIL), a method for synthetic label generation in VRD corpuses without labels, and fine-tune a multimodal Visually Rich Document Understanding Model (VRDU) on TAIL labels using response-based knowledge distillation without using the teacher model's weights or training dataset to conditionally generate annotations in the appropriate format. Using a benchmark external dataset where ground truth labels are available, we demonstrate conditions under which our approach performs at par with Claude 3 Sonnet through empirical studies. We then show that the resulting model performs at par or better on the internal expense documents of a large multinational organization than state-of-the-art LMM (large multimodal model) Claude 3 Sonnet while being 85% less costly and ~5X faster, and outperforms layout-aware baselines by more than 10% in Average Normalized Levenshtein Similarity (ANLS) scores due to its ability to reason and extract information from rare formats. Finally, we illustrate the usage of our approach in overpayment prevention.

  • 2 authors
·
Nov 22, 2024

A Survey on Data Selection for Language Models

A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.

  • 14 authors
·
Feb 26, 2024

Alloprof: a new French question-answer education dataset and its use in an information retrieval case study

Teachers and students are increasingly relying on online learning resources to supplement the ones provided in school. This increase in the breadth and depth of available resources is a great thing for students, but only provided they are able to find answers to their queries. Question-answering and information retrieval systems have benefited from public datasets to train and evaluate their algorithms, but most of these datasets have been in English text written by and for adults. We introduce a new public French question-answering dataset collected from Alloprof, a Quebec-based primary and high-school help website, containing 29 349 questions and their explanations in a variety of school subjects from 10 368 students, with more than half of the explanations containing links to other questions or some of the 2 596 reference pages on the website. We also present a case study of this dataset in an information retrieval task. This dataset was collected on the Alloprof public forum, with all questions verified for their appropriateness and the explanations verified both for their appropriateness and their relevance to the question. To predict relevant documents, architectures using pre-trained BERT models were fine-tuned and evaluated. This dataset will allow researchers to develop question-answering, information retrieval and other algorithms specifically for the French speaking education context. Furthermore, the range of language proficiency, images, mathematical symbols and spelling mistakes will necessitate algorithms based on a multimodal comprehension. The case study we present as a baseline shows an approach that relies on recent techniques provides an acceptable performance level, but more work is necessary before it can reliably be used and trusted in a production setting.

  • 3 authors
·
Feb 10, 2023

Fine Tuning LLM for Enterprise: Practical Guidelines and Recommendations

There is a compelling necessity from enterprises for fine tuning LLMs (Large Language Models) o get them trained on proprietary domain knowledge. The challenge is to imbibe the LLMs with domain specific knowledge using the most optimial resource and cost and in the best possible time. Many enterprises rely on RAG (Retrieval Augmented Generation) which does not need LLMs to be ine-tuned but they are limited by the quality of vector databases and their retrieval capabilities rather than the intrinsic capabilities of the LLMs themselves. In our current work we focus on fine tuning LLaMA, an open source LLM using proprietary documents and code from an enterprise repository and use the fine tuned models to evaluate the quality of responses. As part of this work, we aim to guide beginners on how to start with fine tuning an LLM for documentation and code by making educated guesses on size of GPU required and options that are available for formatting the data. We also propose pre processing recipes for both documentation and code to prepare dataset in different formats. The proposed methods of data preparation for document datasets are forming paragraph chunks, forming question and answer pairs and forming keyword and paragraph chunk pairs. For code dataset we propose forming summary and function pairs. Further, we qualitatively evaluate the results of the models for domain specific queries. Finally, we also propose practical guidelines and recommendations for fine tuning LLMs.

  • 4 authors
·
Mar 23, 2024

A Large-Scale Dataset of Search Interests Related to Disease X Originating from Different Geographic Regions

The World Health Organization added Disease X to their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic. During different virus outbreaks of the past, such as COVID-19, Influenza, Lyme Disease, and Zika virus, researchers from various disciplines utilized Google Trends to mine multimodal components of web behavior to study, investigate, and analyze the global awareness, preparedness, and response associated with these respective virus outbreaks. As the world prepares for Disease X, a dataset on web behavior related to Disease X would be crucial to contribute towards the timely advancement of research in this field. Furthermore, none of the prior works in this field have focused on the development of a dataset to compile relevant web behavior data, which would help to prepare for Disease X. To address these research challenges, this work presents a dataset of web behavior related to Disease X, which emerged from different geographic regions of the world, between February 2018 and August 2023. Specifically, this dataset presents the search interests related to Disease X from 94 geographic regions. The dataset was developed by collecting data using Google Trends. The relevant search interests for all these regions for each month in this time range are available in this dataset. This paper also discusses the compliance of this dataset with the FAIR principles of scientific data management. Finally, an analysis of this dataset is presented to uphold the applicability, relevance, and usefulness of this dataset for the investigation of different research questions in the interrelated fields of Big Data, Data Mining, Healthcare, Epidemiology, and Data Analysis with a specific focus on Disease X.

  • 5 authors
·
Dec 19, 2023

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.

  • 12 authors
·
Mar 10, 2024

Data Justice Stories: A Repository of Case Studies

The idea of "data justice" is of recent academic vintage. It has arisen over the past decade in Anglo-European research institutions as an attempt to bring together a critique of the power dynamics that underlie accelerating trends of datafication with a normative commitment to the principles of social justice-a commitment to the achievement of a society that is equitable, fair, and capable of confronting the root causes of injustice.However, despite the seeming novelty of such a data justice pedigree, this joining up of the critique of the power imbalances that have shaped the digital and "big data" revolutions with a commitment to social equity and constructive societal transformation has a deeper historical, and more geographically diverse, provenance. As the stories of the data justice initiatives, activism, and advocacy contained in this volume well evidence, practices of data justice across the globe have, in fact, largely preceded the elaboration and crystallisation of the idea of data justice in contemporary academic discourse. In telling these data justice stories, we hope to provide the reader with two interdependent tools of data justice thinking: First, we aim to provide the reader with the critical leverage needed to discern those distortions and malformations of data justice that manifest in subtle and explicit forms of power, domination, and coercion. Second, we aim to provide the reader with access to the historically effective forms of normativity and ethical insight that have been marshalled by data justice activists and advocates as tools of societal transformation-so that these forms of normativity and insight can be drawn on, in turn, as constructive resources to spur future transformative data justice practices.

  • 10 authors
·
Apr 6, 2022

Evidence Inference 2.0: More Data, Better Models

How do we most effectively treat a disease or condition? Ideally, we could consult a database of evidence gleaned from clinical trials to answer such questions. Unfortunately, no such database exists; clinical trial results are instead disseminated primarily via lengthy natural language articles. Perusing all such articles would be prohibitively time-consuming for healthcare practitioners; they instead tend to depend on manually compiled systematic reviews of medical literature to inform care. NLP may speed this process up, and eventually facilitate immediate consult of published evidence. The Evidence Inference dataset was recently released to facilitate research toward this end. This task entails inferring the comparative performance of two treatments, with respect to a given outcome, from a particular article (describing a clinical trial) and identifying supporting evidence. For instance: Does this article report that chemotherapy performed better than surgery for five-year survival rates of operable cancers? In this paper, we collect additional annotations to expand the Evidence Inference dataset by 25\%, provide stronger baseline models, systematically inspect the errors that these make, and probe dataset quality. We also release an abstract only (as opposed to full-texts) version of the task for rapid model prototyping. The updated corpus, documentation, and code for new baselines and evaluations are available at http://evidence-inference.ebm-nlp.com/.

  • 5 authors
·
May 8, 2020

BanglishRev: A Large-Scale Bangla-English and Code-mixed Dataset of Product Reviews in E-Commerce

This work presents the BanglishRev Dataset, the largest e-commerce product review dataset to date for reviews written in Bengali, English, a mixture of both and Banglish, Bengali words written with English alphabets. The dataset comprises of 1.74 million written reviews from 3.2 million ratings information collected from a total of 128k products being sold in online e-commerce platforms targeting the Bengali population. It includes an extensive array of related metadata for each of the reviews including the rating given by the reviewer, date the review was posted and date of purchase, number of likes, dislikes, response from the seller, images associated with the review etc. With sentiment analysis being the most prominent usage of review datasets, experimentation with a binary sentiment analysis model with the review rating serving as an indicator of positive or negative sentiment was conducted to evaluate the effectiveness of the large amount of data presented in BanglishRev for sentiment analysis tasks. A BanglishBERT model is trained on the data from BanglishRev with reviews being considered labeled positive if the rating is greater than 3 and negative if the rating is less than or equal to 3. The model is evaluated by being testing against a previously published manually annotated dataset for e-commerce reviews written in a mixture of Bangla, English and Banglish. The experimental model achieved an exceptional accuracy of 94\% and F1 score of 0.94, demonstrating the dataset's efficacy for sentiment analysis. Some of the intriguing patterns and observations seen within the dataset and future research directions where the dataset can be utilized is also discussed and explored. The dataset can be accessed through https://huggingface.co/datasets/BanglishRev/bangla-english-and-code-mixed-ecommerce-review-dataset.

  • 4 authors
·
Dec 17, 2024

Hubble: a Model Suite to Advance the Study of LLM Memorization

We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models -- standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens -- establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.

  • 10 authors
·
Oct 22

Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study on Telematics Data with ChatGPT

This research delves into the construction and utilization of synthetic datasets, specifically within the telematics sphere, leveraging OpenAI's powerful language model, ChatGPT. Synthetic datasets present an effective solution to challenges pertaining to data privacy, scarcity, and control over variables - characteristics that make them particularly valuable for research pursuits. The utility of these datasets, however, largely depends on their quality, measured through the lenses of diversity, relevance, and coherence. To illustrate this data creation process, a hands-on case study is conducted, focusing on the generation of a synthetic telematics dataset. The experiment involved an iterative guidance of ChatGPT, progressively refining prompts and culminating in the creation of a comprehensive dataset for a hypothetical urban planning scenario in Columbus, Ohio. Upon generation, the synthetic dataset was subjected to an evaluation, focusing on the previously identified quality parameters and employing descriptive statistics and visualization techniques for a thorough analysis. Despite synthetic datasets not serving as perfect replacements for actual world data, their potential in specific use-cases, when executed with precision, is significant. This research underscores the potential of AI models like ChatGPT in enhancing data availability for complex sectors like telematics, thus paving the way for a myriad of new research opportunities.

  • 1 authors
·
Jun 23, 2023

FAIR Jupyter: a knowledge graph approach to semantic sharing and granular exploration of a computational notebook reproducibility dataset

The way in which data are shared can affect their utility and reusability. Here, we demonstrate how data that we had previously shared in bulk can be mobilized further through a knowledge graph that allows for much more granular exploration and interrogation. The original dataset is about the computational reproducibility of GitHub-hosted Jupyter notebooks associated with biomedical publications. It contains rich metadata about the publications, associated GitHub repositories and Jupyter notebooks, and the notebooks' reproducibility. We took this dataset, converted it into semantic triples and loaded these into a triple store to create a knowledge graph, FAIR Jupyter, that we made accessible via a web service. This enables granular data exploration and analysis through queries that can be tailored to specific use cases. Such queries may provide details about any of the variables from the original dataset, highlight relationships between them or combine some of the graph's content with materials from corresponding external resources. We provide a collection of example queries addressing a range of use cases in research and education. We also outline how sets of such queries can be used to profile specific content types, either individually or by class. We conclude by discussing how such a semantically enhanced sharing of complex datasets can both enhance their FAIRness, i.e., their findability, accessibility, interoperability, and reusability, and help identify and communicate best practices, particularly with regards to data quality, standardization, automation and reproducibility.

  • 2 authors
·
Apr 19, 2024

Using clarification questions to improve software developers' Web search

Context: Recent research indicates that Web queries written by software developers are not very successful in retrieving relevant results, performing measurably worse compared to general purpose Web queries. Most approaches up to this point have addressed this problem with software engineering-specific automated query reformulation techniques, which work without developer involvement but are limited by the content of the original query. In other words, these techniques automatically improve the existing query but can not contribute new, previously unmentioned, concepts. Objective: In this paper, we propose a technique to guide software developers in manually improving their own Web search queries. We examine a conversational approach that follows unsuccessful queries with a clarification question aimed at eliciting additional query terms, thus providing to the developer a clear dimension along which the query could be improved. Methods: We describe a set of clarification questions derived from a corpus of software developer queries and a neural approach to recommending them for a newly issued query. Results: Our evaluation indicates that the recommendation technique is accurate, predicting a valid clarification question 80% of the time and outperforms simple baselines, as well as, state-of-the-art Learning To Rank (LTR) baselines. Conclusion: As shown in the experimental results, the described approach is capable at recommending appropriate clarification questions to software developers and considered useful by a sample of developers ranging from novices to experienced professionals.

  • 2 authors
·
Jul 26, 2022

Question-Answering Model for Schizophrenia Symptoms and Their Impact on Daily Life using Mental Health Forums Data

In recent years, there is strong emphasis on mining medical data using machine learning techniques. A common problem is to obtain a noiseless set of textual documents, with a relevant content for the research question, and developing a Question Answering (QA) model for a specific medical field. The purpose of this paper is to present a new methodology for building a medical dataset and obtain a QA model for analysis of symptoms and impact on daily life for a specific disease domain. The ``Mental Health'' forum was used, a forum dedicated to people suffering from schizophrenia and different mental disorders. Relevant posts of active users, who regularly participate, were extrapolated providing a new method of obtaining low-bias content and without privacy issues. Furthermore, it is shown how to pre-process the dataset to convert it into a QA dataset. The Bidirectional Encoder Representations from Transformers (BERT), DistilBERT, RoBERTa, and BioBERT models were fine-tuned and evaluated via F1-Score, Exact Match, Precision and Recall. Accurate empirical experiments demonstrated the effectiveness of the proposed method for obtaining an accurate dataset for QA model implementation. By fine-tuning the BioBERT QA model, we achieved an F1 score of 0.885, showing a considerable improvement and outperforming the state-of-the-art model for mental disorders domain.

  • 2 authors
·
Sep 30, 2023

BARS: Towards Open Benchmarking for Recommender Systems

The past two decades have witnessed the rapid development of personalized recommendation techniques. Despite significant progress made in both research and practice of recommender systems, to date, there is a lack of a widely-recognized benchmarking standard in this field. Many existing studies perform model evaluations and comparisons in an ad-hoc manner, for example, by employing their own private data splits or using different experimental settings. Such conventions not only increase the difficulty in reproducing existing studies, but also lead to inconsistent experimental results among them. This largely limits the credibility and practical value of research results in this field. To tackle these issues, we present an initiative project (namely BARS) aiming for open benchmarking for recommender systems. In comparison to some earlier attempts towards this goal, we take a further step by setting up a standardized benchmarking pipeline for reproducible research, which integrates all the details about datasets, source code, hyper-parameter settings, running logs, and evaluation results. The benchmark is designed with comprehensiveness and sustainability in mind. It covers both matching and ranking tasks, and also enables researchers to easily follow and contribute to the research in this field. This project will not only reduce the redundant efforts of researchers to re-implement or re-run existing baselines, but also drive more solid and reproducible research on recommender systems. We would like to call upon everyone to use the BARS benchmark for future evaluation, and contribute to the project through the portal at: https://openbenchmark.github.io/BARS.

  • 8 authors
·
May 19, 2022

CEERS Epoch 1 NIRCam Imaging: Reduction Methods and Simulations Enabling Early JWST Science Results

We present the data release and data reduction process for the Epoch 1 NIRCam observations for the Cosmic Evolution Early Release Science Survey (CEERS). These data consist of NIRCam imaging in six broadband filters (F115W, F150W, F200W, F277W, F356W and F444W) and one medium band filter (F410M) over four pointings, obtained in parallel with primary CEERS MIRI observations (Yang et al. in prep). We reduced the NIRCam imaging with the JWST Calibration Pipeline, with custom modifications and reduction steps designed to address additional features and challenges with the data. Here we provide a detailed description of each step in our reduction and a discussion of future expected improvements. Our reduction process includes corrections for known pre-launch issues such as 1/f noise, as well as in-flight issues including snowballs, wisps, and astrometric alignment. Many of our custom reduction processes were first developed with pre-launch simulated NIRCam imaging over the full 10 CEERS NIRCam pointings. We present a description of the creation and reduction of this simulated dataset in the Appendix. We provide mosaics of the real images in a public release, as well as our reduction scripts with detailed explanations to allow users to reproduce our final data products. These represent one of the first official public datasets released from the Directors Discretionary Early Release Science (DD-ERS) program.

  • 37 authors
·
Nov 4, 2022

Preserving Statistical Validity in Adaptive Data Analysis

A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples. We show that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.

  • 6 authors
·
Nov 10, 2014

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Open-source large language models are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the leading open-weight models creators. At the same time, there researchers are working to make language models safer. We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data. There are unique challenges to working with public domain data, as these sources differ from web text in both form and content. Many sources are historical documents and are the result of Optical Character Recognition (OCR). Consequently, current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models. In this paper, we introduce a new fully open-source pipeline for open-data toxicity filtering. Our contributions are threefold. We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions (racial/origin-based, gender/sex-based, religious, ability-based discrimination, and violence). We use this dataset to train a custom classifier, Celadon, that can be used to detect toxic content in open data more efficiently at a larger scale. Finally, we describe the balanced approach to content filtration that optimizes safety filtering with respect to the filtered data available for training.

  • 4 authors
·
Oct 29, 2024 2

MeSH Term Suggestion for Systematic Review Literature Search

High-quality medical systematic reviews require comprehensive literature searches to ensure the recommendations and outcomes are sufficiently reliable. Indeed, searching for relevant medical literature is a key phase in constructing systematic reviews and often involves domain (medical researchers) and search (information specialists) experts in developing the search queries. Queries in this context are highly complex, based on Boolean logic, include free-text terms and index terms from standardised terminologies (e.g., MeSH), and are difficult and time-consuming to build. The use of MeSH terms, in particular, has been shown to improve the quality of the search results. However, identifying the correct MeSH terms to include in a query is difficult: information experts are often unfamiliar with the MeSH database and unsure about the appropriateness of MeSH terms for a query. Naturally, the full value of the MeSH terminology is often not fully exploited. This paper investigates methods to suggest MeSH terms based on an initial Boolean query that includes only free-text terms. These methods promise to automatically identify highly effective MeSH terms for inclusion in a systematic review query. Our study contributes an empirical evaluation of several MeSH term suggestion methods. We perform an extensive analysis of the retrieval, ranking, and refinement of MeSH term suggestions for each method and how these suggestions impact the effectiveness of Boolean queries.

  • 5 authors
·
Dec 1, 2021

Coping with Information Loss and the Use of Auxiliary Sources of Data: A Report from the NISS Ingram Olkin Forum Series on Unplanned Clinical Trial Disruptions

Clinical trials disruption has always represented a non negligible part of the ending of interventional studies. While the SARS-CoV-2 (COVID-19) pandemic has led to an impressive and unprecedented initiation of clinical research, it has also led to considerable disruption of clinical trials in other disease areas, with around 80% of non-COVID-19 trials stopped or interrupted during the pandemic. In many cases the disrupted trials will not have the planned statistical power necessary to yield interpretable results. This paper describes methods to compensate for the information loss arising from trial disruptions by incorporating additional information available from auxiliary data sources. The methods described include the use of auxiliary data on baseline and early outcome data available from the trial itself and frequentist and Bayesian approaches for the incorporation of information from external data sources. The methods are illustrated by application to the analysis of artificial data based on the Primary care pediatrics Learning Activity Nutrition (PLAN) study, a clinical trial assessing a diet and exercise intervention for overweight children, that was affected by the COVID-19 pandemic. We show how all of the methods proposed lead to an increase in precision relative to use of complete case data only.

  • 12 authors
·
Jun 22, 2022

T2Ranking: A large-scale Chinese Benchmark for Passage Ranking

Passage ranking involves two stages: passage retrieval and passage re-ranking, which are important and challenging topics for both academics and industries in the area of Information Retrieval (IR). However, the commonly-used datasets for passage ranking usually focus on the English language. For non-English scenarios, such as Chinese, the existing datasets are limited in terms of data scale, fine-grained relevance annotation and false negative issues. To address this problem, we introduce T2Ranking, a large-scale Chinese benchmark for passage ranking. T2Ranking comprises more than 300K queries and over 2M unique passages from real-world search engines. Expert annotators are recruited to provide 4-level graded relevance scores (fine-grained) for query-passage pairs instead of binary relevance judgments (coarse-grained). To ease the false negative issues, more passages with higher diversities are considered when performing relevance annotations, especially in the test set, to ensure a more accurate evaluation. Apart from the textual query and passage data, other auxiliary resources are also provided, such as query types and XML files of documents which passages are generated from, to facilitate further studies. To evaluate the dataset, commonly used ranking models are implemented and tested on T2Ranking as baselines. The experimental results show that T2Ranking is challenging and there is still scope for improvement. The full data and all codes are available at https://github.com/THUIR/T2Ranking/

  • 11 authors
·
Apr 7, 2023

SynthRAD2023 Grand Challenge dataset: generating synthetic CT for radiotherapy

Purpose: Medical imaging has become increasingly important in diagnosing and treating oncological patients, particularly in radiotherapy. Recent advances in synthetic computed tomography (sCT) generation have increased interest in public challenges to provide data and evaluation metrics for comparing different approaches openly. This paper describes a dataset of brain and pelvis computed tomography (CT) images with rigidly registered CBCT and MRI images to facilitate the development and evaluation of sCT generation for radiotherapy planning. Acquisition and validation methods: The dataset consists of CT, CBCT, and MRI of 540 brains and 540 pelvic radiotherapy patients from three Dutch university medical centers. Subjects' ages ranged from 3 to 93 years, with a mean age of 60. Various scanner models and acquisition settings were used across patients from the three data-providing centers. Details are available in CSV files provided with the datasets. Data format and usage notes: The data is available on Zenodo (https://doi.org/10.5281/zenodo.7260705) under the SynthRAD2023 collection. The images for each subject are available in nifti format. Potential applications: This dataset will enable the evaluation and development of image synthesis algorithms for radiotherapy purposes on a realistic multi-center dataset with varying acquisition protocols. Synthetic CT generation has numerous applications in radiation therapy, including diagnosis, treatment planning, treatment monitoring, and surgical planning.

  • 9 authors
·
Mar 28, 2023

Enforcing public data archiving policies in academic publishing: A study of ecology journals

To improve the quality and efficiency of research, groups within the scientific community seek to exploit the value of data sharing. Funders, institutions, and specialist organizations are developing and implementing strategies to encourage or mandate data sharing within and across disciplines, with varying degrees of success. Academic journals in ecology and evolution have adopted several types of public data archiving policies requiring authors to make data underlying scholarly manuscripts freely available. Yet anecdotes from the community and studies evaluating data availability suggest that these policies have not obtained the desired effects, both in terms of quantity and quality of available datasets. We conducted a qualitative, interview-based study with journal editorial staff and other stakeholders in the academic publishing process to examine how journals enforce data archiving policies. We specifically sought to establish who editors and other stakeholders perceive as responsible for ensuring data completeness and quality in the peer review process. Our analysis revealed little consensus with regard to how data archiving policies should be enforced and who should hold authors accountable for dataset submissions. Themes in interviewee responses included hopefulness that reviewers would take the initiative to review datasets and trust in authors to ensure the completeness and quality of their datasets. We highlight problematic aspects of these thematic responses and offer potential starting points for improvement of the public data archiving process.

  • 4 authors
·
Oct 30, 2018

Healthsheet: Development of a Transparency Artifact for Health Datasets

Machine learning (ML) approaches have demonstrated promising results in a wide range of healthcare applications. Data plays a crucial role in developing ML-based healthcare systems that directly affect people's lives. Many of the ethical issues surrounding the use of ML in healthcare stem from structural inequalities underlying the way we collect, use, and handle data. Developing guidelines to improve documentation practices regarding the creation, use, and maintenance of ML healthcare datasets is therefore of critical importance. In this work, we introduce Healthsheet, a contextualized adaptation of the original datasheet questionnaire ~gebru2018datasheets for health-specific applications. Through a series of semi-structured interviews, we adapt the datasheets for healthcare data documentation. As part of the Healthsheet development process and to understand the obstacles researchers face in creating datasheets, we worked with three publicly-available healthcare datasets as our case studies, each with different types of structured data: Electronic health Records (EHR), clinical trial study data, and smartphone-based performance outcome measures. Our findings from the interviewee study and case studies show 1) that datasheets should be contextualized for healthcare, 2) that despite incentives to adopt accountability practices such as datasheets, there is a lack of consistency in the broader use of these practices 3) how the ML for health community views datasheets and particularly Healthsheets as diagnostic tool to surface the limitations and strength of datasets and 4) the relative importance of different fields in the datasheet to healthcare concerns.

  • 10 authors
·
Feb 25, 2022