R&D Publications | ILLUIN Technology

🔭 R&D publications

ViDoRe Benchmark V3: A comprehensive evaluation of RAG in real-world use cases





✍️ G Viaud, Q Macé, A Edy, V Xing, M Faysse, A Loison, T Balough, G de Souza, B Liu | 📅 November 2025 | 🔗 Hugging Face

Discover ViDoRe V3, a benchmark designed and developed with contributions from NVIDIA and ILLUIN Technology, to evaluate RAG pipelines on visually rich corporate documents. It includes 10 datasets, 26,000 pages and annotations verified by human experts in 6 languages...

Context is Gold to find the Gold Passage: Evaluating and Training Contextual Doc Embeddings





✍️ M Conti, M Faysse, G Viaud, A Bosselut, C Hudelot, P Colombo | 📅 May 2025 | 🔗 arXiv

Current embedding methods treat passages in a document separately, often losing the overall context. We introduce ConTEB, a benchmark for context awareness. SOTA models fail in these cases. To remedy this, we propose InSeNT, a contrastive post-training approach combined with late chunking pooling. It significantly improves search quality, remains efficient and makes embeddings more robust.

ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval





✍️ Q Macé, A Loison, M Faysse | 📅 May 2025 | 🔗 arXiv, Google Scholar

The ViDoRe V1 benchmark reached saturation point with over 90% nDCG@5, limiting the measurement of progress. ViDoRe V2 introduces more realistic and challenging search scenarios: context-blind, long, cross-document queries, and generated via a synthetic and human mix. It includes four multilingual datasets with clear instructions. Initial results show strong potential for improvement and invite the community to enrich this living benchmark.

EuroBERT: Scaling Multilingual Encoders for European Languages (contribution)





✍️ N Boizard et al | 📅 March 2025 | 🔗 arXiv, Google Scholar

Multilingual vector representations, often derived from bidirectional encoders, are now eclipsed by generative models. However, several recent advances can also benefit encoders. This work presents EuroBERT, a family of multilingual encoders covering Europe and other major languages. The models outperform alternatives on many tasks, handle up to 8192 tokens and are published with data, checkpoints and framework.

MMTEB: Massive Multilingual Text Embedding Benchmark (contribution) (contribution)





✍️ K Enevoldsen et al | 📅 February 2025 | 🔗 arXiv, Google Scholar

Text embeddings are often evaluated on few tasks, limited in language and diversity. To remedy this, MMTEB extends MTEB with over 500 controlled tasks covering 250+ languages, including instruction following, long-document retrieval and code. Giant LLMs perform well, but the best public model remains multilingual-e5-large-instruct (560M). MMTEB also offers optimized sampling and splits, greatly reducing computation costs while preserving rankings.

EuroLLM: Multilingual Language Models for Europe (contribution)





✍️ Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins | 📅 September 2025 | 🔗 arXiv, Google Scholar

Open-weight LLMs are making progress, but remain English-centric. The EuroLLM project aims to create a suite of multilingual models covering all official EU languages and other key languages. The authors describe data collection and filtering, scaling, the multilingual tokenizer and modeling choices. They publish EuroLLM-1.7B and EuroLLM-1.7B-Instruct, evaluated on multilingual benchmarks and machine translation.

GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering





✍️ S Muller, A Loison, B Omrani, G Viaud | 📅 September 2024 | 🔗 arXiv, Google Scholar, Connected Papers

RAG is the obvious choice for combining LLMs and knowledge bases, but LLM-as-a-Judge evaluation remains problematic. The authors identify 7 failure modes and present GroUSE, a benchmark of 144 unit tests. They show that existing frameworks, even with GPT-4, ignore key errors. Their pipeline reveals that open-source judges generalize poorly. Llama-3 finetuning on GPT-4 reasoning significantly improves correlation, calibration and failure detection.

ColPali: Efficient Document Retrieval with Vision Language Models





✍️ M Faysse, H Sibille, T Wu, B Omrani, G Viaud, C Hudelot, P Colombo | 📅 September 2024 | 🔗 ArXiv, Hugging Face

Documents convey information not only through text, but also through page layout, tables and fonts, elements that are little exploited by current search systems. To meet this challenge, ViDoRe proposes retrieval tasks on visually rich documents. The authors present ColPali, a vision-language model generating multi-vector embeddings directly from pages. With a late interaction mechanism, it far surpasses existing pipelines, while being simpler and faster.

Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism (contribution)





✍️ H Gisserot-Boukhlef, M Faysse, E Malherbe, C Hudelot, P Colombo | 📅 April 2024 | 🔗 ArXiv, Google Scholar, Connected Papers

Neural Information Retrieval (NIR) has outperformed heuristic approaches, but still often fails to find relevant documents. The authors propose a lightweight abstraction mechanism, adapted to real-life constraints and targeting the reranking phase. They introduce an evaluation protocol in a black-box context, demonstrate the effectiveness of this approach and present a simple, data-driven method. The code is published as open source to facilitate replication and adoption.

Copyright Traps for Large Language Models (contribution)





✍️ M Meeus, I Shilov, M Faysse, Y A de Montjoye | 📅 June 2024 | 🔗 ArXiv, Google Scholar, Connected Papers

The use of protected content to train LLMs raises debate. Current methods of memorization inference fail with medium-sized models that memorize little. The authors propose the use of "copyright traps": fictitious sentences inserted into works of art. In a controlled protocol, they show that only long sequences repeated many times are detectable (AUC=0.75). This approach also sheds light on the memory mechanisms of LLMs.

CroissantLLM: A Truly Bilingual French-English Language Model





✍️ M Faysse, P Fernandes, N M. Guerreiro, A Loison, D M Alves, C Corro, N Boizard, J Alves, R Rei, P H Martins, A B Casademunt, F Yvon, A F T Martins, G Viaud, C Hudelot, P Colombo | 📅 March 2024 | 🔗 ArXiv, Hugging Face, Google Scholar, Connected Papers

CroissantLLM is a 1.3B parameter model pre-trained on 3T English and French tokens with a 1:1 ratio, a dedicated tokenizer and bilingual finetuning games. It is aimed at high-performance, open-source use on consumer hardware. The authors publish data, code, checkpoints and derived models, as well as FrenchBench for French evaluation. The model achieves 81% of the FMTI transparency criteria, far surpassing existing open initiatives and reinforcing multilingual research.

Revisiting Instruction Fine-tuned Model Evaluation to Guide Industrial Applications





✍️ M Faysse, G Viaud, C Hudelot, P Colombo | 📅 March 2024 | 🔗 ACL (EMNLP), Connected Papers

Instruction Fine-Tuning (IFT) greatly improves the zero-shot capabilities of LLMs, but imposes new evaluation requirements. The authors show that LLMs-based metrics meet these requirements and use them to analyze different task specialization strategies. They quantify the associated trade-offs and provide practitioners with concrete avenues for the industrial deployment of IFT models.

FQuAD2.0: French Question Answering and Knowing When You Don't Know





✍️ Q Heinrich, G Viaud, W Belblidia | 📅 June 2022 | 🔗 ACL (LREC), Connected Papers

Question Answering has made great strides, but remains focused on English. For French, Illuin Technology has launched FQuAD1.1 (60k QAs from Wikipedia). Its limitations: the inability to detect unanswered questions. FQuAD2.0 adds 17k unanswerable questions, for a total of 80k, enabling models capable of distinguishing these cases to be trained. A fine-tuned CamemBERT-large achieves 82.3% F1 in classification and 83% in reading comprehension.

Structural analysis of an all-purpose question answering model





✍️ V Micheli, Q Heinrich, F Fleuret, W Belblidia | 📅 April 2021 | 🔗 ArXiv, Google Scholar, Connected Papers

Attention is central to pre-trained language models, allowing multiple tasks to be tackled at once. The authors present a new multi-task Question Answering model and find that it retains its single-task performance despite low transfer between tasks. Their analysis shows that attention heads specialize by task, and that some are more decisive than others, in both multi-task and single-task contexts.

On the importance of pre-training data volume for compact language models





✍️ V Micheli, M d'Hoffschmidt, F Fleuret | 📅 November 2020 | 🔗 ACL (EMNLP), Connected Papers

Recent language models are resource-intensive. With a view to sustainability, the authors study the impact of the volume of pre-training data on compact models based on BERT in French. By evaluating them on FQuAD, they show that a good level of performance is achieved with just 100 MB of text. What's more, beyond very low volumes, intermediate pre-training on specific corpora doesn't bring any noticeable gain.

FQuAD: French Question Answering Dataset





✍️ M d'Hoffschmidt, W Belblidia, Q Heinrich, T Brendlé, M Vidal | 📅 November 2020 | 🔗 ACL (EMNLP), Connected Papers

Recent advances in NLP have greatly improved reading comprehension, but mainly in English due to a lack of resources in other languages. The authors present FQuAD, a native French AQ dataset on Wikipedia: 25k examples for v1.0 and 60k for v1.1. A base model reaches 92.2 F1 and 82.1 EM. To keep track of progress, a leaderboard is available and v1.0 is freely accessible via the dedicated website.