这是用户在 2024-8-26 15:28 为 https://arxiv.org/html/2404.16130?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: CC BY 4.0
许可:CC BY 4.0
arXiv:2404.16130v1 [cs.CL] 24 Apr 2024

From Local to Global: A Graph RAG Approach to Query-Focused Summarization
从局部到全局:一种基于图 RAG 的方法,用于查询导向的摘要生成

Darren Edge1† &Ha Trinh1† &Newman Cheng2 &Joshua Bradley2 &Alex Chao3 &Apurva Mody3 &Steven Truitt2 &Jonathan Larson1

1Microsoft Research
2Microsoft Strategic Missions and Technologies
3Microsoft Office of the CTO

{daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso}
@microsoft.com

These authors contributed equally to this work
Abstract 摘要

The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as “What are the main themes in the dataset?”, since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text to be indexed. Our approach uses an LLM to build a graph-based text index in two stages: first to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely-related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG leads to substantial improvements over a naïve RAG baseline for both the comprehensiveness and diversity of generated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is forthcoming at https://aka.ms/graphrag.
使用增强检索生成(RAG)从外部知识源检索相关信息,使大型语言模型(LLMs)能够回答涉及私有和/或未见过的文档集合的问题。然而,RAG 在针对整个文本语料库的全球性问题时失败,如“数据集的主要主题是什么?”因为这本质上是一个查询导向的总结(QFS)任务,而不是明确的检索任务。同时,先前的 QFS 方法无法扩展到典型 RAG 系统索引的文本数量。为了结合这些对比方法的优点,我们提出了一种基于图的 RAG 方法,用于对私有文本语料库进行问答,该方法能够根据用户问题的通用性和要索引的源文本数量进行扩展。我们的方法使用LLM在两个阶段构建基于图的文本索引:首先从源文档中推导出实体知识图,然后为所有紧密相关实体的群体预生成社区摘要。给定一个问题,每个社区摘要用于生成部分响应,然后将所有部分响应再次总结为最终用户响应。对于数据集范围在 100 万令牌类别的全球性问题,我们展示了基于图的 RAG 相对于朴素的 RAG 基线在生成答案的全面性和多样性方面取得了显著改进。基于 Python 的全球和局部基于图的 RAG 方法的开源实现即将在 https://aka.ms/graphrag 发布。

1 Introduction 1. 引言

Human endeavors across a range of domains rely on our ability to read and reason about large collections of documents, often reaching conclusions that go beyond anything stated in the source texts themselves. With the emergence of large language models (LLMs), we are already witnessing attempts to automate human-like sensemaking in complex domains like scientific discovery (Microsoft,, 2023) and intelligence analysis (Ranade and Joshi,, 2023), where sensemaking is defined as “a motivated, continuous effort to understand connections (which can be among people, places, and events) in order to anticipate their trajectories and act effectively” (Klein et al., 2006a, ). Supporting human-led sensemaking over entire text corpora, however, needs a way for people to both apply and refine their mental model of the data (Klein et al., 2006b, ) by asking questions of a global nature.
人类在各个领域的努力依赖于我们阅读和理解大量文档的能力,经常得出的结论超出了原始文本本身所陈述的内容。随着大型语言模型(LLMs)的出现,我们已经开始见证在科学发现(Microsoft,, 2023)和情报分析(Ranade 和 Joshi,, 2023)等复杂领域中自动化人类样式的理解尝试,其中理解被定义为“一种有动机的、持续的努力,旨在理解连接(这些连接可以是人、地点和事件之间的)以预测它们的轨迹并有效地行动”(Klein et al., 2006a,)。然而,支持基于人类的整个文本语料库的理解,需要一种方法,让人们能够应用并完善他们对数据的心理模型(Klein et al., 2006b,)通过提出全球性的问题。

Retrieval-augmented generation (RAG, Lewis et al.,2020) is an established approach to answering user questions over entire datasets, but it is designed for situations where these answers are contained locally within regions of text whose retrieval provides sufficient grounding for the generation task. Instead, a more appropriate task framing is query-focused summarization (QFS, Dang,2006), and in particular, query-focused abstractive summarization that generates natural language summaries and not just concatenated excerpts (Yao et al.,, 2017; Baumel et al.,, 2018; Laskar et al.,, 2020) . In recent years, however, such distinctions between summarization tasks that are abstractive versus extractive, generic versus query-focused, and single-document versus multi-document, have become less relevant. While early applications of the transformer architecture showed substantial improvements on the state-of-the-art for all such summarization tasks (Liu and Lapata,, 2019; Laskar et al.,, 2022; Goodwin et al.,, 2020), these tasks are now trivialized by modern LLMs, including the GPT (Brown et al.,, 2020; Achiam et al.,, 2023), Llama (Touvron et al.,, 2023), and Gemini (Anil et al.,, 2023) series, all of which can use in-context learning to summarize any content provided in their context window.
检索增强生成(RAG,Lewis 等,2020 年)是回答用户问题的成熟方法,适用于整个数据集,但其设计旨在处理答案包含在文本局部区域的情况,这些区域的检索为生成任务提供了足够的基础。相反,更合适的方法是查询焦点摘要(QFS,Dang,2006 年),特别是生成自然语言摘要的查询焦点抽象摘要,而不仅仅是拼接摘录(Yao 等,2017 年;Baumel 等,2018 年;Laskar 等,2020 年)。然而,在最近几年,摘要任务中抽象与提取、通用与查询焦点、单文档与多文档之间的区别变得不那么相关。早期应用的转换器架构在所有此类摘要任务上显示了显著的改进(Liu 和 Lapata,2019 年;Laskar 等,2022 年;Goodwin 等,2020 年),但现代LLMs,包括 GPT(Brown 等,2020 年;Achiam 等,2023 年),Llama(Touvron 等,2023 年)和 Gemini(Anil 等,2023 年)系列,都可以利用上下文学习来总结其上下文窗口中提供的任何内容。

Source Documents Text Chunks text extraction and chunking Element Instances domain-tailored summarization Element Summaries domain-tailored summarization Graph Communities community detection Community Summaries domain-tailored summarization Community Answers query-focused summarization Global Answer query-focused summarization Indexing Time Query Time Pipeline Stage
Figure 1: Graph RAG pipeline using an LLM-derived graph index of source document text. This index spans nodes (e.g., entities), edges (e.g., relationships), and covariates (e.g., claims) that have been detected, extracted, and summarized by LLM prompts tailored to the domain of the dataset. Community detection (e.g., Leiden, Traag et al.,, 2019) is used to partition the graph index into groups of elements (nodes, edges, covariates) that the LLM can summarize in parallel at both indexing time and query time. The “global answer” to a given query is produced using a final round of query-focused summarization over all community summaries reporting relevance to that query.
图 1:使用基于源文档文本的LLM衍生图索引的 Graph RAG 流水线。此索引覆盖了通过针对数据集领域的LLM定制提示检测、提取和总结的节点(例如实体)、边(例如关系)和协变量(例如声明)。社区检测(例如 Leiden,Traag 等人,2019)用于将图索引划分为元素(节点、边、协变量)的组,LLM可以在索引时间和查询时间并行汇总这些元素。对于给定查询的“全局答案”是在所有报告与该查询相关性的社区摘要上进行的最终查询焦点汇总产生的。

The challenge remains, however, for query-focused abstractive summarization over an entire corpus. Such volumes of text can greatly exceed the limits of LLM context windows, and the expansion of such windows may not be enough given that information can be “lost in the middle” of longer contexts (Liu et al.,, 2023; Kuratov et al.,, 2024). In addition, although the direct retrieval of text chunks in naïve RAG is likely inadequate for QFS tasks, it is possible that an alternative form of pre-indexing could support a new RAG approach specifically targeting global summarization.
挑战仍然存在于对整个文集进行基于查询的抽象总结中。如此大量的文本可能大大超过LLM上下文窗口的限制,而扩展这样的窗口可能不足以解决信息在更长上下文中“丢失在中间”的问题(Liu et al., 2023; Kuratov et al., 2024)。此外,虽然在原始 RAG 中直接检索文本片段可能对基于查询的总结任务不够充分,但可能通过一种预索引的替代形式,可以支持一种新的 RAG 方法,专门针对全球总结进行优化。

In this paper, we present a Graph RAG approach based on global summarization of an LLM-derived knowledge graph (Figure 1). In contrast with related work that exploits the structured retrieval and traversal affordances of graph indexes (subsection 4.2), we focus on a previously unexplored quality of graphs in this context: their inherent modularity (Newman,, 2006) and the ability of community detection algorithms to partition graphs into modular communities of closely-related nodes (e.g., Louvain, Blondel et al.,2008; Leiden, Traag et al.,2019). LLM-generated summaries of these community descriptions provide complete coverage of the underlying graph index and the input documents it represents. Query-focused summarization of an entire corpus is then made possible using a map-reduce approach: first using each community summary to answer the query independently and in parallel, then summarizing all relevant partial answers into a final global answer.
在这篇论文中,我们基于LLM-衍生的知识图谱全局总结提出了一种基于图的 RAG 方法(图 1)。与相关工作利用图索引的结构检索和遍历功能(子节 4.2)不同,我们专注于在这种上下文中图的先前未被探索的特性:它们的内在模块化(Newman, 2006)以及社区检测算法将图分割成紧密相关节点的模块化社区的能力(例如,Louvain, Blondel et al., 2008;Leiden, Traag et al., 2019)。LLM生成的这些社区描述的总结提供了对底层图索引和它所代表的输入文档的完整覆盖。然后,使用映射-减少方法进行基于查询的整个文集总结成为可能:首先,使用每个社区总结独立并行地回答查询,然后将所有相关的部分答案汇总成最终的全局答案。

To evaluate this approach, we used an LLM to generate a diverse set of activity-centered sensemaking questions from short descriptions of two representative real-world datasets, containing podcast transcripts and news articles respectively. For the target qualities of comprehensiveness, diversity, and empowerment (defined in subsection 3.4) that develop understanding of broad issues and themes, we both explore the impact of varying the the hierarchical level of community summaries used to answer queries, as well as compare to naïve RAG and global map-reduce summarization of source texts. We show that all global approaches outperform naïve RAG on comprehensiveness and diversity, and that Graph RAG with intermediate- and low-level community summaries shows favorable performance over source text summarization on these same metrics, at lower token costs.
为了评估这种方法,我们使用了一个LLM,从两个代表性的现实世界数据集的简短描述中生成了一系列多样化的以活动为中心的感知问题,这些数据集分别包含播客转录和新闻文章。对于在第 3.4 节中定义的全面性、多样性和赋权等目标质量,这些目标旨在深化对广泛问题和主题的理解,我们探讨了使用不同层次的社区摘要回答查询的影响,并将其与原始 RAG 和全局地图-减少源文本的摘要进行了比较。我们展示了所有全球方法在全面性和多样性方面都优于原始 RAG,而图 RAG 使用中等和低层次的社区摘要,在这些相同的度量标准上表现出优于源文本摘要的性能,同时在 token 成本上更低。

2 Graph RAG Approach & Pipeline
2Graph RAG 方法与流程

We now unpack the high-level data flow of the Graph RAG approach (Figure 1) and pipeline, describing key design parameters, techniques, and implementation details for each step.
我们现在拆解图 RAG 方法的高级数据流程(图 1)和管道,描述每个步骤的关键设计参数、技术以及实现细节。

2.1 Source Documents \rightarrow Text Chunks
2.1 源文档 \rightarrow 文本片段

A fundamental design decision is the granularity with which input texts extracted from source documents should be split into text chunks for processing. In the following step, each of these chunks will be passed to a set of LLM prompts designed to extract the various elements of a graph index. Longer text chunks require fewer LLM calls for such extraction, but suffer from the recall degradation of longer LLM context windows (Liu et al.,, 2023; Kuratov et al.,, 2024). This behavior can be observed in Figure 2 in the case of a single extraction round (i.e., with zero gleanings): on a sample dataset (HotPotQA, Yang et al.,2018), using a chunk size of 600 token extracted almost twice as many entity references as when using a chunk size of 2400. While more references are generally better, any extraction process needs to balance recall and precision for the target activity.
基础设计决策之一是,从源文档中提取的输入文本应如何细分为文本块进行处理。在接下来的步骤中,这些块将被传递给一组LLM提示,用于提取图形索引的各种元素。较长的文本块需要较少的LLM调用进行此类提取,但会受到较长LLM上下文窗口召回率下降的影响(Liu 等人,2023;Kuratov 等人,2024)。在单轮提取(即,无收集)的情况下,可以观察到这种行为:在样本数据集(HotPotQA,Yang 等人,2018)中,使用提取大小为 600 个 Token 的块几乎比使用提取大小为 2400 的块多提取了两倍的实体引用。虽然更多的引用通常更好,但任何提取过程都需要在目标活动的召回率和精确率之间进行权衡。

001111222233330100002000030000Number of gleanings performedEntity references detected600 chunk size1200 chunk size2400 chunk size
Figure 2: How the entity references detected in the HotPotQA dataset (Yang et al.,, 2018)
图 2:HotPotQA 数据集(Yang 等人,2018 年)中检测到的实体引用

varies with chunk size and gleanings for our generic entity extraction prompt with gpt-4-turbo.
根据片段大小和我们用于通用实体提取提示的 gpt-4-turbo 的获取情况而变化。

2.2 Text Chunks \rightarrow Element Instances
2.2 文本片段 \rightarrow 元素实例

The baseline requirement for this step is to identify and extract instances of graph nodes and edges from each chunk of source text. We do this using a multipart LLM prompt that first identifies all entities in the text, including their name, type, and description, before identifying all relationships between clearly-related entities, including the source and target entities and a description of their relationship. Both kinds of element instance are output in a single list of delimited tuples.
这一步的基本要求是识别并从每段源文本中提取图节点和边的实例。我们使用一个分部的LLM提示来完成这一任务,首先识别文本中的所有实体,包括它们的名称、类型和描述,然后识别所有相关实体之间的关系,包括源实体、目标实体和它们之间关系的描述。两种元素实例都在一个分隔的元组列表中输出。

The primary opportunity to tailor this prompt to the domain of the document corpus lies in the choice of few-shot examples provided to the LLM for in-context learning (Brown et al.,, 2020). For example, while our default prompt extracting the broad class of “named entities” like people, places, and organizations is generally applicable, domains with specialized knowledge (e.g., science, medicine, law) will benefit from few-shot examples specialized to those domains. We also support a secondary extraction prompt for any additional covariates we would like to associate with the extracted node instances. Our default covariate prompt aims to extract claims linked to detected entities, including the subject, object, type, description, source text span, and start and end dates.
文档集领域的关键机会在于为LLM提供的少数几个示例,用于上下文学习(Brown 等人,2020)。例如,尽管我们默认的提示提取“命名实体”类别的广义概念,如人物、地点和组织,通常适用,但具有专门知识的领域(例如,科学、医学、法律)将从专门针对这些领域的少数几个示例中受益。我们还支持用于提取任何额外协变量的次要提取提示,这些协变量与提取的节点实例相关联。我们默认的协变量提示旨在提取与检测到的实体相关的声明,包括主题、对象、类型、描述、源文本跨度和开始和结束日期。

To balance the needs of efficiency and quality, we use multiple rounds of “gleanings”, up to a specified maximum, to encourage the LLM to detect any additional entities it may have missed on prior extraction rounds. This is a multi-stage process in which we first ask the LLM to assess whether all entities were extracted, using a logit bias of 100 to force a yes/no decision. If the LLM responds that entities were missed, then a continuation indicating that “MANY entities were missed in the last extraction” encourages the LLM to glean these missing entities. This approach allows us to use larger chunk sizes without a drop in quality (Figure 2) or the forced introduction of noise.
为了平衡效率和质量的需求,我们使用多个“收集”轮次,直到指定的最大轮次,以鼓励LLM检测任何先前提取轮次中可能遗漏的额外实体。这是一个多阶段过程,首先我们要求LLM评估所有实体是否都被提取,使用对数偏置 100 来强制进行是/否决策。如果LLM回答实体被遗漏,那么一个表明“在上一次提取中遗漏了大量实体”的延续鼓励LLM收集这些缺失的实体。这种方法允许我们使用更大的块大小,而不会降低质量(图 2)或被迫引入噪音。 注意:在翻译中,特定术语和公司名称保持不变,复杂术语使用了括号注释形式。

2.3 Element Instances \rightarrow Element Summaries
2.3 元素实例 \rightarrow 元素摘要

The use of an LLM to “extract” descriptions of entities, relationships, and claims represented in source texts is already a form of abstractive summarization, relying on the LLM to create independently meaningful summaries of concepts that may be implied but not stated by the text itself (e.g., the presence of implied relationships). To convert all such instance-level summaries into single blocks of descriptive text for each graph element (i.e., entity node, relationship edge, and claim covariate) requires a further round of LLM summarization over matching groups of instances.
使用LLM来“提取”源文本中实体、关系和声明的描述,已经是一种抽象概括的形式,依赖于LLM来创建独立有意义的概念摘要,这些概念可能被文本暗示但并未明确表述(例如,存在的暗示关系)。将所有此类实例级摘要转换为每个图表元素(即实体节点、关系边和声明协变量)的单一描述性文本块,需要对匹配实例组进行进一步的LLM概括。

A potential concern at this stage is that the LLM may not consistently extract references to the same entity in the same text format, resulting in duplicate entity elements and thus duplicate nodes in the entity graph. However, since all closely-related “communities” of entities will be detected and summarized in the following step, and given that LLMs can understand the common entity behind multiple name variations, our overall approach is resilient to such variations provided there is sufficient connectivity from all variations to a shared set of closely-related entities.
当前阶段的一个潜在问题可能是,LLM 不会始终一致地从相同文本格式中提取到同一实体的引用,导致实体元素重复,从而在实体图中产生重复的节点。然而,由于接下来的步骤中会检测并总结所有紧密相关的实体“社区”,并且考虑到LLMs 能够理解多个名称变体背后的共通实体,只要所有变体都能与一组紧密相关的实体有足够的连接性,我们的整体方法就能应对这种变化。

Overall, our use of rich descriptive text for homogeneous nodes in a potentially noisy graph structure is aligned with both the capabilities of LLMs and the needs of global, query-focused summarization. These qualities also differentiate our graph index from typical knowledge graphs, which rely on concise and consistent knowledge triples (subject, predicate, object) for downstream reasoning tasks.
总体而言,我们在潜在嘈杂图结构中对同质节点使用丰富的描述性文本,与LLMs的能力以及全球、查询导向摘要的需求相一致。这些特点也使我们的图索引与典型的知识图谱区分开来,后者依赖于简洁且一致的知识三元组(主题,谓词,对象)来完成下游推理任务。

Refer to caption Refer to caption
(a) Root communities at level 0
层级 0 的根社区
(b) Sub-communities at level 1
(b) 第一级的子社区
Figure 3: Graph communities detected using the Leiden algorithm (Traag et al.,, 2019) over the MultiHop-RAG (Tang and Yang,, 2024) dataset as indexed. Circles represent entity nodes with size proportional to their degree. Node layout was performed via OpenORD (Martin et al.,, 2011) and Force Atlas 2 (Jacomy et al.,, 2014). Node colors represent entity communities, shown at two levels of hierarchical clustering: (a) Level 0, corresponding to the hierarchical partition with maximum modularity, and (b) Level 1, which reveals internal structure within these root-level communities.
图 3:使用 Leiden 算法(Traag 等人,2019 年)在 MultiHop-RAG(Tang 和 Yang,2024 年)数据集上检测的图社区。数据集通过索引进行。圆圈表示实体节点,其大小与度成正比。节点布局通过 OpenORD(Martin 等人,2011 年)和 Force Atlas 2(Jacomy 等人,2014 年)完成。节点颜色表示实体社区,显示了层次聚类的两个级别:(a)第 0 级,对应最大模数的层次划分,和(b)第 1 级,揭示了这些根级社区内部的结构。

2.4 Element Summaries \rightarrow Graph Communities
2.4 元素摘要 \rightarrow 图社区

The index created in the previous step can be modelled as an homogeneous undirected weighted graph in which entity nodes are connected by relationship edges, with edge weights representing the normalized counts of detected relationship instances. Given such a graph, a variety of community detection algorithms may be used to partition the graph into communities of nodes with stronger connections to one another than to the other nodes in the graph (e.g., see the surveys by Fortunato,2010 and Jin et al.,2021). In our pipeline, we use Leiden (Traag et al.,, 2019) on account of its ability to recover hierarchical community structure of large-scale graphs efficiently (Figure 3). Each level of this hierarchy provides a community partition that covers the nodes of the graph in a mutually-exclusive, collective-exhaustive way, enabling divide-and-conquer global summarization.
在上一步创建的索引可以被建模为一个同质的无向加权图,在其中实体节点通过关系边相互连接,边权重表示检测到的关系实例的归一化计数。给定这样的图,可以使用多种社区检测算法将图划分为节点之间的连接强度大于图中其他节点的社区(例如,参见 Fortunato 的调查,2010 年和 Jin 等人,2021 年)。在我们的管道中,我们使用 Leiden(Traag 等人,2019 年)是因为它能够有效地恢复大规模图的层次社区结构(图 3)。这个层次结构的每一层都提供了一个社区划分,以互斥、集体完全的方式覆盖图中的节点,从而实现全球总结的分而治之。

2.5 Graph Communities \rightarrow Community Summaries
2.5 图社区 \rightarrow 社区摘要

The next step is to create report-like summaries of each community in the Leiden hierarchy, using a method designed to scale to very large datasets. These summaries are independently useful in their own right as a way to understand the global structure and semantics of the dataset, and may themselves be used to make sense of a corpus in the absence of a question. For example, a user may scan through community summaries at one level looking for general themes of interest, then follow links to the reports at the lower level that provide more details for each of the subtopics. Here, however, we focus on their utility as part of a graph-based index used for answering global queries.
下一步是使用旨在适应大规模数据集的方法,为莱登层次结构中的每个社区生成报告式的摘要。这些摘要本身在理解数据集的全球结构和语义方面具有独立的价值,并且在没有问题的情况下,可以用来理解一个语料库。例如,用户可以在一个层次上浏览社区摘要,寻找感兴趣的普遍主题,然后通过链接查看提供每个子主题更详细信息的较低层次的报告。然而,在这里,我们关注的是它们作为基于图的索引的一部分,用于回答全球查询的实用性。

Community summaries are generated in the following way:
社区摘要是通过以下方式生成的:

  • Leaf-level communities. The element summaries of a leaf-level community (nodes, edges, covariates) are prioritized and then iteratively added to the LLM context window until the token limit is reached. The prioritization is as follows: for each community edge in decreasing order of combined source and target node degree (i.e., overall prominance), add descriptions of the source node, target node, linked covariates, and the edge itself.


    • 叶级社区。叶级社区(节点、边、协变量)的元素摘要(节点、边、协变量)优先排序,然后逐次添加到LLM上下文窗口中,直至达到 Token 限制。排序方式如下:按照每个社区边的源节点和目标节点的综合度(即整体显要性)的降序,依次添加源节点描述、目标节点描述、关联协变量描述和边本身。
  • Higher-level communities. If all element summaries fit within the token limit of the context window, proceed as for leaf-level communities and summarize all element summaries within the community. Otherwise, rank sub-communities in decreasing order of element summary tokens and iteratively substitute sub-community summaries (shorter) for their associated element summaries (longer) until fit within the context window is achieved.


    • 更高层次的社区。如果所有元素摘要都符合上下文窗口的令牌限制,则按照叶级社区的方式进行,将所有元素摘要概括在社区中。否则,按照元素摘要令牌数量的递减顺序对子社区进行排名,并迭代地用子社区摘要(较短)替换其关联的元素摘要(较长),直到达到符合上下文窗口的条件。

2.6 Community Summaries \rightarrow Community Answers \rightarrow Global Answer
2.6 社区摘要 \rightarrow 社区回答 \rightarrow 全球回答

Given a user query, the community summaries generated in the previous step can be used to generate a final answer in a multi-stage process. The hierarchical nature of the community structure also means that questions can be answered using the community summaries from different levels, raising the question of whether a particular level in the hierarchical community structure offers the best balance of summary detail and scope for general sensemaking questions (evaluated in section 3).
给定用户查询,前一步生成的社区摘要可以在多阶段过程中用于生成最终答案。社区结构的层次性质也意味着可以使用不同级别的社区摘要来回答问题,引发了一个问题:层次社区结构中的特定级别是否为一般意义理解问题提供最佳的摘要细节与范围平衡(在第 3 节中进行评估)。

For a given community level, the global answer to any user query is generated as follows:
对于给定的社区级别,任何用户查询的全球答案生成如下:

  • Prepare community summaries. Community summaries are randomly shuffled and divided into chunks of pre-specified token size. This ensures relevant information is distributed across chunks, rather than concentrated (and potentially lost) in a single context window.


    • 准备社区摘要。社区摘要随机打乱并分成预设 Token 大小的片段。这确保相关信息分布在各个片段中,而不是集中在单一的上下文窗口中,从而避免信息集中可能导致的丢失。
  • Map community answers. Generate intermediate answers in parallel, one for each chunk. The LLM is also asked to generate a score between 0-100 indicating how helpful the generated answer is in answering the target question. Answers with score 0 are filtered out.


    • 映射社区答案。并行生成中间答案,为每个片段生成一个。LLM也被要求生成一个 0-100 之间的分数,表示生成的答案在回答目标问题时有多有帮助。分数为 0 的答案被过滤掉。
  • Reduce to global answer. Intermediate community answers are sorted in descending order of helpfulness score and iteratively added into a new context window until the token limit is reached. This final context is used to generate the global answer returned to the user.


    • 将答案减少到全球级别。中间社区答案按照有用性评分降序排序,并迭代添加到新的上下文窗口中,直到达到令牌限制。最后生成的上下文用于生成返回给用户的全球答案。

3 Evaluation 评价

3.1 Datasets 3.1 数据集

We selected two datasets in the one million token range, each equivalent to about 10 novels of text and representative of the kind of corpora that users may encounter in their real world activities:
我们选择了两个一百万词元范围的数据集,每个相当于大约 10 本小说的文字量,代表了用户在实际活动中可能遇到的类型的数据集:

  • Podcast transcripts. Compiled transcripts of podcast conversations between Kevin Scott, Microsoft CTO, and other technology leaders (Behind the Tech, Scott,2024). Size: 1669 ×\times× 600-token text chunks, with 100-token overlaps between chunks (similar-to\sim1 million tokens).


    • 音频播客转录。由微软 CTO 凯文·斯科特与其他科技领袖进行的播客对话的编译转录(Behind the Tech, Scott,, 2024)。大小:1669 ×\times× 600-token 文本片段,片段之间有 100-token 重叠( similar-to\sim 1 千万 token)。
  • News articles. Benchmark dataset comprising news articles published from September 2013 to December 2023 in a range of categories, including entertainment, business, sports, technology, health, and science (MultiHop-RAG; Tang and Yang,2024). Size: 3197 ×\times× 600-token text chunks, with 100-token overlaps between chunks (similar-to\sim1.7 million tokens).


    • 新闻文章。基准数据集包含从 2013 年 9 月至 2023 年 12 月在娱乐、商业、体育、科技、健康和科学等多个类别中发布的新闻文章(MultiHop-RAG;Tang 和 Yang, 2024)。大小:3197 个 ×\times× 600-token 文本片段,片段之间有 100-token 重叠( similar-to\sim 170 万 token)。

3.2 Queries 3.2 查询

Many benchmark datasets for open-domain question answering exist, including HotPotQA (Yang et al.,, 2018), MultiHop-RAG (Tang and Yang,, 2024), and MT-Bench (Zheng et al.,, 2024). However, the associated question sets target explicit fact retrieval rather than summarization for the purpose of data sensemaking, i.e., the process though which people inspect, engage with, and contextualize data within the broader scope of real-world activities (Koesten et al.,, 2021). Similarly, methods for extracting latent summarization queries from source texts also exist (Xu and Lapata,, 2021), but such extracted questions can target details that betray prior knowledge of the texts.
存在许多用于开放域问答的基准数据集,包括 HotPotQA(Yang 等人,2018 年),MultiHop-RAG(Tang 和 Yang,2024 年)和 MT-Bench(Zheng 等人,2024 年)。然而,相关的问题集旨在检索明确的事实,而不是为了数据理解过程(即人们在更广泛的现实世界活动范围内检查、参与和上下文化数据的过程,Koesten 等人,2021 年)的目的进行总结。同样,从源文本中提取潜在总结查询的方法也存在(Xu 和 Lapata,2021 年),但提取的问题可能针对的是泄露了文本先前知识的细节。

To evaluate the effectiveness of RAG systems for more global sensemaking tasks, we need questions that convey only a high-level understanding of dataset contents, and not the details of specific texts. We used an activity-centered approach to automate the generation of such questions: given a short description of a dataset, we asked the LLM to identify N𝑁Nitalic_N potential users and N𝑁Nitalic_N tasks per user, then for each (user, task) combination, we asked the LLM to generate N𝑁Nitalic_N questions that require understanding of the entire corpus. For our evaluation, a value of N𝑁Nitalic_N = 5 resulted in 125 test questions per dataset. Table 1 shows example questions for each of the two evaluation datasets.
为了评估 RAG 系统在更广泛的全局理解任务中的有效性,我们需要提出只传达数据集内容高级理解的问题,而不是特定文本的细节。我们采用以活动为中心的方法来自动化生成此类问题:给定一个数据集的简短描述,我们要求LLM识别每个用户 N𝑁Nitalic_N 个潜在用户和 N𝑁Nitalic_N 个任务,然后对于每个(用户,任务)组合,我们要求LLM生成 N𝑁Nitalic_N 个问题,需要理解整个文库。在我们的评估中, N𝑁Nitalic_N = 5 的结果产生了每个数据集 125 个测试问题。表 1 显示了两个评估数据集的示例问题。

Dataset 数据集 Example activity framing and generation of global sensemaking questions
例活动框架和全球理解问题的生成
Podcast transcripts 播客转录 User: A tech journalist looking for insights and trends in the tech industry
用户:一位寻求科技行业洞察和趋势的科技记者

Task: Understanding how tech leaders view the role of policy and regulation
任务:理解科技领导者如何看待政策和法规的角色

Questions:  要求: 1. 翻译应精确描述事实和背景 2. 保留知名术语,如 FLAC、JPEG 等,以及大公司名称,如 Microsoft、Meta 等 3. 对难以翻译的术语,在后面使用括号显示完整术语和简写。 注意翻译这些术语: * Transformer -> Transformer * Token -> Token * LLM/Large Language Model -> 大语言模型 * Generative AI -> 生成式 AI * SAM -> 分割一切模型 * Segment Anything -> 分割一切
1. Which episodes deal primarily with tech policy and government regulation?
1. 哪些集主要涉及科技政策和政府监管?

2. How do guests perceive the impact of privacy laws on technology development?
2. 客户如何看待隐私法规对技术发展的影响?

3. Do any guests discuss the balance between innovation and ethical considerations?
3. 任何嘉宾讨论过创新与伦理考量之间的平衡吗?

4. What are the suggested changes to current policies mentioned by the guests?
4. 由嘉宾提出的当前政策的建议更改是什么?

5. Are collaborations between tech companies and governments discussed and how?
5. 技术公司与政府之间的合作是否被讨论,以及如何进行?
News articles 新闻文章 User: Educator incorporating current affairs into curricula
用户:教育者将时事融入课程

Task: Teaching about health and wellness
任务:教授关于健康与福祉

Questions:  要求: 1. 翻译应精确描述事实和背景 2. 保留知名术语,如 FLAC、JPEG 等,以及大公司名称,如 Microsoft、Meta 等 3. 对难以翻译的术语,在后面使用括号显示完整术语和简写。 注意翻译这些术语: * Transformer -> Transformer * Token -> Token * LLM/Large Language Model -> 大语言模型 * Generative AI -> 生成式 AI * SAM -> 分割一切模型 * Segment Anything -> 分割一切
1. What current topics in health can be integrated into health education curricula?
1. 当前哪些健康主题可以整合到健康教育课程中?

2. How do news articles address the concepts of preventive medicine and wellness?
2. 新闻文章如何处理预防医学和健康促进的概念?

3. Are there examples of health articles that contradict each other, and if so, why?
3. 是否有关于健康的文章相互矛盾的例子,如果是,为什么?

4. What insights can be gleaned about public health priorities based on news coverage?
4. 根据新闻报道,可以得出哪些关于公共卫生优先事项的见解?

5. How can educators use the dataset to highlight the importance of health literacy?
5. 教育者如何使用数据集来强调健康素养的重要性?
Table 1: Examples of potential users, tasks, and questions generated by the LLM based on short descriptions of the target datasets. Questions target global understanding rather than specific details.
表 1:基于目标数据集的简要描述,LLM 可能的用户、任务和问题示例。问题旨在针对全局理解,而非具体细节。

3.3 Conditions 3.3 条件

We compare six different conditions in our analysis, including Graph RAG using four levels of graph communities (C0, C1, C2, C3), a text summarization method applying our map-reduce approach directly to source texts (TS), and a naïve “semantic search” RAG approach (SS):
我们在分析中比较了六种不同的情况,包括使用四个层次的图社区(C0,C1,C2,C3)的 Graph RAG,直接将我们的 map-reduce 方法应用于源文本的文本摘要方法(TS),以及一种朴素的“语义搜索”RAG 方法(SS):

  • CO. Uses root-level community summaries (fewest in number) to answer user queries.


    公司使用层级最低的社区摘要(数量最少)来回答用户查询。
  • C1. Uses high-level community summaries to answer queries. These are sub-communities of C0, if present, otherwise C0 communities projected down.


    • C1. 使用高级社区摘要来回答查询。这些是 C0 的子社区,如果存在,则是 C0,否则是将 C0 社区投影下来的结果。
  • C2. Uses intermediate-level community summaries to answer queries. These are sub-communities of C1, if present, otherwise C1 communities projected down.


    • C2. 使用中级水平的社区摘要来回答查询。这些是 C1 的子社区,如果存在的话,否则是将 C1 社区投影下来。
  • C3. Uses low-level community summaries (greatest in number) to answer queries. These are sub-communities of C2, if present, otherwise C2 communities projected down.


    • C3. 使用低级社区摘要(数量最多)来回答查询。这些是 C2 的子社区,如果存在,则是 C2,否则是将 C2 社区投影下来的结果。
  • TS. The same method as in subsection 2.6, except source texts (rather than community summaries) are shuffled and chunked for the map-reduce summarization stages.


    • TS. 在子节 2.6 中所采用的方法,除了将源文本(而非社区摘要)打乱并分块用于映射-减少总结阶段。
  • SS. An implementation of naïve RAG in which text chunks are retrieved and added to the available context window until the specified token limit is reached.


    • SS. 在实现中,通过检索文本片段并将其添加到可用上下文窗口中,直到达到指定的令牌限制,来实现朴素的 RAG。

The size of the context window and the prompts used for answer generation are the same across all six conditions (except for minor modifications to reference styles to match the types of context information used). Conditions only differ in how the contents of the context window are created.
上下文窗口的大小和用于生成答案的提示在所有六个条件(除了对参考样式进行轻微修改以匹配所使用上下文信息的类型外)中都是相同的。仅在上下文窗口的内容是如何创建方面存在差异。

The graph index supporting conditions C0-C3 was created using our generic prompts for entity and relationship extraction only, with entity types and few-shot examples tailored to the domain of the data. The graph indexing process used a context window size of 600 tokens with 1 gleaning for the Podcast dataset and 0 gleanings for the News dataset.
支持条件 C0-C3 的图索引使用了仅针对实体和关系提取的通用提示创建,实体类型和少量示例针对数据的领域进行了定制。图索引过程使用了 600 个令牌的上下文窗口大小,对于播客数据集使用了 1 次获取,对于新闻数据集未使用获取。

Question 问题 Which public figures are repeatedly mentioned across various entertainment articles?
哪些公众人物在各种娱乐文章中被反复提及?
Graph RAG 图 RAG Overview of Prominent Public Figures in Entertainment
知名娱乐界公众人物概览
The entertainment industry is vast and diverse, encompassing film, television, music, sports, and digital media. Certain public figures stand out due to their significant contributions and influence across these sectors. The following summary highlights key individuals who are repeatedly mentioned in various entertainment articles, reflecting their impact and presence within the industry.
娱乐产业庞大而多样,包括电影、电视、音乐、体育和数字媒体等领域。某些公众人物因其在这些领域的重大贡献和影响力而脱颖而出。以下概要强调了在各种娱乐文章中反复提及的关键人物,反映了他们在行业中的影响和存在。
Actors and Directors […] Public Figures in Controversy […] Musicians and Executives […] Athletes and Coaches […] Influencers and Entrepreneurs […]
演员和导演 [...] 争议公众人物 [...] 音乐家和高管 [...] 运动员和教练 [...] 影响者和企业家 [...]
The repeated mention of these figures in entertainment articles signifies their ongoing relevance and the public’s interest in their work. Their influence spans across various aspects of entertainment, from shaping cultural narratives in film and television to driving trends in music and digital media. These individuals not only contribute to their respective fields but also influence the broader cultural landscape, often becoming central figures in social discussions and public discourse.
在娱乐文章中对这些人物的反复提及,表明了他们持续的相关性和公众对其工作的兴趣。他们的影响力跨越了娱乐的各个方面,从塑造电影和电视中的文化叙事,到推动音乐和数字媒体的趋势。这些个人不仅为各自的领域做出了贡献,而且也影响了更广泛的文化景观,往往成为社会讨论和公众话语中的核心人物。
Naïve RAG 朴素 RAG Public figures who are repeatedly mentioned across various entertainment articles include Taylor Swift, Travis Kelce, Britney Spears, and Justin Timberlake. These individuals have been highlighted for various reasons, ranging from their professional achievements to their personal lives.
在各类娱乐文章中反复提及的公众人物包括泰勒·斯威夫特、特拉维斯·凯尔西、布兰妮·斯皮尔斯和贾斯汀·汀布莱克。这些人的突出原因各不相同,从他们的职业成就到个人生活。
Taylor Swift […] Travis Kelce […] Britney Spears […] Justin Timberlake […]
泰勒·斯威夫特 […] 特拉维斯·凯尔 […] 布兰妮·斯皮尔斯 […] 贾斯汀·汀布莱克 […]
These figures are frequently covered due to their high-profile status and the public’s interest in their careers and personal lives. Their activities, whether in music, sports, or personal relationships, have significant cultural and economic impacts, as evidenced by the media coverage and public reactions.
这些人物经常受到关注,由于他们高调的地位和公众对他们职业生涯和个人生活的兴趣。他们的活动,无论是在音乐、体育还是个人关系上,都对文化和经济产生了重大影响,这从媒体的报道和公众的反应中可以得到证明。
LLM Decision 决策 Comprehensiveness: Winner=1 (Graph RAG)
全面性:胜者=1(图 RAG)

Answer 1 is better because it provides a more comprehensive and detailed list of public figures from a wider range of entertainment sectors, including film, television, music, sports, gaming, and digital media. It also includes specific examples of their contributions and the impact they have on their respective fields, as well as mentions of controversies and their implications. Answer 2, while detailed in its coverage of a few individuals, is limited to a smaller number of public figures and focuses primarily on their personal lives and relationships rather than a broad spectrum of their professional influence across the entertainment industry.
答案 1 更好,因为它提供了来自电影、电视、音乐、体育、游戏和数字媒体等更广泛娱乐领域的公众人物的更全面和详细的名单。它还包括了他们贡献的具体例子以及他们在各自领域的影响,提到了争议及其影响。答案 2 虽然在对少数个体的覆盖上详细,但仅限于少数公众人物,并主要关注他们的个人生活和关系,而不是娱乐行业他们专业影响力的广泛范围。

Diversity: Winner=1 (Graph RAG)
多样性:Winner=1(图 RAG)

Answer 1 is better because it provides a more varied and rich response by covering a wide range of public figures from different sectors of the entertainment industry, including film, television, music, sports, gaming, and digital media. It offers insights into the contributions and influence of these figures, as well as controversies and their impact on public discourse. The answer also cites specific data sources for each mentioned figure, indicating a diverse range of evidence to support the claims. In contrast, Answer 2 focuses on a smaller group of public figures, primarily from the music industry and sports, and relies heavily on a single source for data, which makes it less diverse in perspectives and insights.
答案 1 更好,因为它提供了更为多样和丰富的内容,涵盖了娱乐行业不同领域的众多公众人物,包括电影、电视、音乐、体育、游戏和数字媒体。它提供了这些人物的贡献和影响力、争议及其对公众讨论的影响的见解。答案还引用了每个提及人物的具体数据来源,表明有多种证据支持这些声明。相比之下,答案 2 主要关注音乐行业和体育领域的较小一群公众人物,并且主要依赖单一数据来源,这使得视角和见解的多样性较低。

Empowerment: Winner=1 (Graph RAG)
赋能:胜者=1(图 RAG)

Answer 1 is better because it provides a comprehensive and structured overview of public figures across various sectors of the entertainment industry, including film, television, music, sports, and digital media. It lists multiple individuals, providing specific examples of their contributions and the context in which they are mentioned in entertainment articles, along with references to data reports for each claim. This approach helps the reader understand the breadth of the topic and make informed judgments without being misled. In contrast, Answer 2 focuses on a smaller group of public figures and primarily discusses their personal lives and relationships, which may not provide as broad an understanding of the topic. While Answer 2 also cites sources, it does not match the depth and variety of Answer 1.
答案 1 更好,因为它提供了娱乐行业(包括电影、电视、音乐、体育和数字媒体)各个领域公众人物的全面且结构化的概述。它列出了多个个体,提供了他们贡献的具体示例以及在娱乐文章中提及的上下文,并为每项声明引用了数据报告。这种方法有助于读者理解主题的广度,使其能够做出有根据的判断而不被误导。相比之下,答案 2 主要关注较小的公众人物群体,主要讨论他们的个人生活和关系,这可能不会提供对主题的广泛理解。虽然答案 2 也引用了来源,但其深度和多样性无法与答案 1 相匹配。

Directness: Winner=2 (Naïve RAG)
直接性:胜者=2(朴素 RAG)

Answer 2 is better because it directly lists specific public figures who are repeatedly mentioned across various entertainment articles, such as Taylor Swift, Travis Kelce, Britney Spears, and Justin Timberlake, and provides concise explanations for their frequent mentions. Answer 1, while comprehensive, includes a lot of detailed information about various figures in different sectors of entertainment, which, while informative, does not directly answer the question with the same level of conciseness and specificity as Answer 2.
回答 2 更好,因为它直接列出了在各种娱乐文章中反复提及的具体公众人物,如泰勒·斯威夫特、特拉维斯·凯尔西、布兰妮·斯皮尔斯和贾斯汀·汀布莱克,并提供了他们频繁提及的简洁解释。回答 1 虽然全面,包含了娱乐界不同领域的各种人物的详细信息,虽然信息丰富,但与回答 2 相比,没有以同样简洁和具体的方式直接回答问题。
Table 2: Example question for the News article dataset, with generated answers from Graph RAG (C2) and Naïve RAG, as well as LLM-generated assessments.
表 2:新闻文章数据集的示例问题,以及由 Graph RAG (C2) 和 Naïve RAG 生成的答案,以及 LLM 生成的评估。

3.4 Metrics 3.4 指标

LLMs have been shown to be good evaluators of natural language generation, achieving state-of-the-art or competitive results compared against human judgements (Wang et al., 2023a, ; Zheng et al.,, 2024). While this approach can generate reference-based metrics when gold standard answers are known, it is also capable of measuring the qualities of generated texts (e.g., fluency) in a reference-free style (Wang et al., 2023a, ) as well as in head-to-head comparison of competing outputs (LLM-as-a-judge, Zheng et al.,2024). LLMs have also shown promise at evaluating the performance of conventional RAG systems, automatically evaluating qualities like context relevance, faithfulness, and answer relevance (RAGAS, Es et al.,2023).
LLMs 已经证明是自然语言生成的优秀评估者,与人类判断相比,它们在最先进的或竞争性的结果方面表现出色(Wang et al., 2023a; Zheng et al., 2024)。尽管这种方法在知道标准答案时可以生成参考指标,但它也能够以无参考方式衡量生成文本的质量(例如,流畅性)(Wang et al., 2023a),以及在比较竞争输出的直接对抗中(LLM-as-a-judge, Zheng et al., 2024)。LLMs 也显示出在评估传统 RAG 系统性能方面的潜力,自动评估如上下文相关性、忠实度和答案相关性等质量(RAGAS, Es et al., 2023)。

Given the multi-stage nature of our Graph RAG mechanism, the multiple conditions we wanted to compare, and the lack of gold standard answers to our activity-based sensemaking questions, we decided to adopt a head-to-head comparison approach using an LLM evaluator. We selected three target metrics capturing qualities that are desirable for sensemaking activities, as well as a control metric (directness) used as a indicator of validity. Since directness is effectively in opposition to comprehensiveness and diversity, we would not expect any method to win across all four metrics.
鉴于我们 Graph RAG 机制的多阶段性质,以及我们想要比较的多个条件,加上我们活动为基础的感性理解问题缺乏黄金标准答案,我们决定采用LLM评估器进行一对一比较的方法。我们选择了三个目标指标,捕捉感性理解活动所期望的品质,以及一个作为有效性指标的控制指标(直接性)。由于直接性本质上与全面性和多样性相对立,我们不期望任何方法在所有四个指标上都能胜出。

Our head-to-head measures computed using an LLM evaluator are as follows:
我们的使用LLM评估器计算的头对头度量如下:

  • Comprehensiveness. How much detail does the answer provide to cover all aspects and details of the question?


    • 完整性。答案提供了多少细节来涵盖问题的所有方面和细节?
  • Diversity. How varied and rich is the answer in providing different perspectives and insights on the question?


    • 多样性。回答在提供对问题的不同视角和洞察力方面有多多样和丰富?
  • Empowerment. How well does the answer help the reader understand and make informed judgements about the topic?


    • 授权。答案如何帮助读者理解并做出关于主题的明智判断?
  • Directness. How specifically and clearly does the answer address the question?


    • 直接性。答案如何具体和清晰地回答问题?

For our evaluation, the LLM is provided with the question, target metric, and a pair of answers, and asked to assess which answer is better according to the metric, as well as why. It returns the winner if one exists, otherwise a tie if they are fundamentally similar and the differences are negligible. To account for the stochasticity of LLMs, we run each comparison five times and use mean scores. Table 2 shows an example of LLM-generated assessment.
为了我们的评估,LLM 提供了问题、目标指标以及一对答案,并要求根据指标评估哪个答案更好,以及为什么。如果存在胜者,它会返回胜者;否则,如果它们本质上相似且差异可以忽略,则返回平局。为了应对 LLMs 的随机性,我们对每次比较运行五次并使用平均得分。表 2 展示了 LLM 生成的评估的一个示例。

3.5 Configuration 3.5 配置

The effect of context window size on any particular task is unclear, especially for models like gpt-4-turbo with a large context size of 128k tokens. Given the potential for information to be “lost in the middle” of longer contexts (Liu et al.,, 2023; Kuratov et al.,, 2024), we wanted to explore the effects of varying the context window size for our combinations of datasets, questions, and metrics. In particular, our goal was to determine the optimum context size for our baseline condition (SS) and then use this uniformly for all query-time LLM use. To that end, we tested four context window sizes: 8k, 16k, 32k and 64k. Surprisingly, the smallest context window size tested (8k) was universally better for all comparisons on comprehensiveness (average win rate of 58.1%), while performing comparably with larger context sizes on diversity (average win rate = 52.4%), and empowerment (average win rate = 51.3%). Given our preference for more comprehensive and diverse answers, we therefore used a fixed context window size of 8k tokens for the final evaluation.
上下文窗口大小对特定任务的影响尚不明确,尤其是对于像 gpt-4-turbo 这样的大型上下文大小为 128k 令牌的模型。鉴于在更长上下文中可能存在“信息丢失”的潜在风险(Liu et al.,, 2023; Kuratov et al.,, 2024),我们想要探索在我们的数据集组合、问题和指标中变化上下文窗口大小的影响。特别是,我们的目标是确定基线条件(SS)的最佳上下文大小,然后将此统一应用于所有查询时间LLM使用。为此,我们测试了四个上下文窗口大小:8k,16k,32k 和 64k。令人惊讶的是,测试的最小上下文窗口大小(8k)在所有比较中都普遍更好(全面性平均胜率 58.1%),同时在多样性和赋权方面与较大的上下文大小表现相当(平均胜率分别为 52.4%和 51.3%)。鉴于我们更偏好全面性和多样性的答案,因此我们使用固定上下文窗口大小 8k 令牌进行最终评估。

3.6 Results 3.6 结果

The indexing process resulted in a graph consisting of 8564 nodes and 20691 edges for the Podcast dataset, and a larger graph of 15754 nodes and 19520 edges for the News dataset. Table 3 shows the number of community summaries at different levels of each graph community hierarchy.
索引过程产生了包含 8564 个节点和 20691 条边的播客数据集的图,以及包含 15754 个节点和 19520 条边的更大规模的新闻数据集的图。表 3 显示了每个图社区层次结构中不同级别的社区摘要的数量。

Podcast transcripts 播客转录 501728252221835050484344725050535049755247505250785750485052795651504850000000000000Comprehensiveness 501823251919825050504346775050504644755050504445815754565048815456555250000000000000Diversity 504257524951585059555251434150494748484551504950514853515051494952504950000000000000Empowerment 505665606060445055525152354550474848404853505050404952505050404852505050000000000000Directness

News articles 新闻文章 502028252121805044413836725650525452755948505855796246425059796448454150000000000000Comprehensiveness 503338352931675053454440624750404141655560505050715659505051696059504950000000000000Diversity 504757495050535058505048434250424544515058505251505055485050505256495050000000000000Empowerment 5054595555544650555352 52414550484847454752504949454852515049464853515150000000000000Directness

Figure 4: Head-to-head win rate percentages of (row condition) over (column condition) across two datasets, four metrics, and 125 questions per comparison (each repeated five times and averaged). The overall winner per dataset and metric is shown in bold. Self-win rates were not computed but are shown as the expected 50% for reference. All Graph RAG conditions outperformed naïve RAG on comprehensiveness and diversity. Conditions C1-C3 also showed slight improvements in answer comprehensiveness and diversity over TS (global text summarization without a graph index).
图 4:在两个数据集、四种指标以及每项比较 125 个问题(每项比较重复五次并取平均)的情况下,(行条件)相对于(列条件)的头对头胜率百分比。每个数据集和指标的总体胜者以粗体显示。未计算自我胜率,但作为参考显示为预期的 50%。所有 Graph RAG 条件在全面性和多样性上均优于朴素的 RAG。条件 C1-C3 在回答的全面性和多样性上也表现出对 TS(全局文本摘要,无图索引)的轻微改进。
Podcast Transcripts 播客转录 News Articles 新闻文章
C0 C1 C2 C3 TS C0 C1 C2 C3 TS
Units 单位 34 367 969 1310 1669 55 555 1797 2142 3197
Tokens Token 26657 225756 565720 746100 1014611 39770 352641 980898 1140266 1707694
% Max % 最大 2.6 22.2 55.8 73.5 100 2.3 20.7 57.4 66.8 100
Table 3: Number of context units (community summaries for C0-C3 and text chunks for TS), corresponding token counts, and percentage of the maximum token count. Map-reduce summarization of source texts is the most resource-intensive approach requiring the highest number of context tokens. Root-level community summaries (C0) require dramatically fewer tokens per query (9x-43x).
表 3:上下文单元数量(社区摘要 C0-C3 和文本片段 TS),对应的 Token 计数,以及相对于最大 Token 计数的百分比。基于 Map-reduce 的源文本摘要是最耗资源的方法,需要最多的上下文 Token。根级社区摘要(C0)每次查询所需的 Token 数量显著较少(9 倍-43 倍)。

Global approaches vs. naïve RAG. As shown in Figure 4, global approaches consistently outperformed the naïve RAG (SS) approach in both comprehensiveness and diversity metrics across datasets. Specifically, global approaches achieved comprehensiveness win rates between 72-83% for Podcast transcripts and 72-80% for News articles, while diversity win rates ranged from 75-82% and 62-71% respectively. Our use of directness as a validity test also achieved the expected results, i.e., that naïve RAG produces the most direct responses across all comparisons.
全球方法与朴素 RAG。如图 4 所示,全球方法在全面性和多样性指标上始终优于朴素 RAG(SS)方法,跨数据集。具体而言,全球方法在播客转录中实现了 72-83%的全面性胜率,在新闻文章中实现了 72-80%的全面性胜率,而多样性胜率分别在 75-82%和 62-71%之间。我们使用直接性作为有效性测试也得到了预期的结果,即朴素 RAG 在所有比较中产生最直接的响应。

Community summaries vs. source texts. When comparing community summaries to source texts using Graph RAG, community summaries generally provided a small but consistent improvement in answer comprehensiveness and diversity, except for root-level summaries. Intermediate-level summaries in the Podcast dataset and low-level community summaries in the News dataset achieved comprehensiveness win rates of 57% and 64%, respectively. Diversity win rates were 57% for Podcast intermediate-level summaries and 60% for News low-level community summaries. Table 3 also illustrates the scalability advantages of Graph RAG compared to source text summarization: for low-level community summaries (C3), Graph RAG required 26-33% fewer context tokens, while for root-level community summaries (C0), it required over 97% fewer tokens. For a modest drop in performance compared with other global methods, root-level Graph RAG offers a highly efficient method for the iterative question answering that characterizes sensemaking activity, while retaining advantages in comprehensiveness (72% win rate) and diversity (62% win rate) over naïve RAG.
社区摘要与源文本的比较。使用 Graph RAG 比较社区摘要与源文本时,社区摘要通常在答案的全面性和多样性上提供了小但一致的改进,除了根级摘要。播客数据集中的中级摘要和新闻数据集中的低级社区摘要分别实现了 57%和 64%的全面性胜率。多样性胜率分别为播客中级摘要的 57%和新闻低级社区摘要的 60%。表 3 还展示了 Graph RAG 与源文本摘要相比的可扩展性优势:对于低级社区摘要(C3),Graph RAG 需要的上下文令牌比源文本摘要少了 26-33%,而对于根级社区摘要(C0),它只需要源文本摘要的 97%以上的令牌。与其它全球方法相比,性能略有下降,根级 Graph RAG 提供了一种高度有效的方法来回答特征化于理解活动的迭代问题,同时在全面性(72%胜率)和多样性(62%胜率)上保留了朴素 RAG 的优势。

Empowerment. Empowerment comparisons showed mixed results for both global approaches versus naïve RAG (SS) and Graph RAG approaches versus source text summarization (TS). Ad-hoc LLM use to analyze LLM reasoning for this measure indicated that the ability to provide specific examples, quotes, and citations was judged to be key to helping users reach an informed understanding. Tuning element extraction prompts may help to retain more of these details in the Graph RAG index.
赋能。赋能比较显示,全球方法与朴素 RAG(SS)和图 RAG 方法与源文本摘要(TS)之间的比较结果参差不齐。临时LLM使用以分析LLM此度量的推理表明,提供具体示例、引文和引用的能力被判断为帮助用户获得知情理解的关键。调整元素提取提示可能有助于在图 RAG 索引中保留更多这些细节。

4 Related Work 4. 相关工作

4.1 RAG Approaches and Systems
4.1RAG 方法和系统

When using LLMs, RAG involves first retrieving relevant information from external data sources, then adding this information to the context window of the LLM along with the original query (Ram et al.,, 2023). Naïve RAG approaches (Gao et al.,, 2023) do this by converting documents to text, splitting text into chunks, and embedding these chunks into a vector space in which similar positions represent similar semantics. Queries are then embedded into the same vector space, with the text chunks of the nearest k vectors used as context. More advanced variations exist, but all solve the problem of what to do when an external dataset of interest exceeds the LLM’s context window.
使用LLMs时,RAG 首先从外部数据源检索相关的信息,然后将这些信息添加到LLM的上下文窗口中,与原始查询一起(Ram 等人,2023 年)。朴素的 RAG 方法(Gao 等人,2023 年)通过将文档转换为文本,将文本分割为片段,并将这些片段嵌入到一个向量空间中来实现这一过程,其中相似的位置表示相似的语义。然后将查询嵌入到相同的向量空间中,使用最接近的 k 个向量的文本片段作为上下文。存在更高级的变体,但所有方法都解决了当感兴趣的外部数据集超出LLM的上下文窗口时该怎么办的问题。

Advanced RAG systems include pre-retrieval, retrieval, post-retrieval strategies designed to overcome the drawbacks of Naïve RAG, while Modular RAG systems include patterns for iterative and dynamic cycles of interleaved retrieval and generation (Gao et al.,, 2023). Our implementation of Graph RAG incorporates multiple concepts related to other systems. For example, our community summaries are a kind of self-memory (Selfmem, Cheng et al.,2024) for generation-augmented retrieval (GAR, Mao et al.,2020) that facilitates future generation cycles, while our parallel generation of community answers from these summaries is a kind of iterative (Iter-RetGen, Shao et al.,2023) or federated (FeB4RAG, Wang et al.,2024) retrieval-generation strategy. Other systems have also combined these concepts for multi-document summarization (CAiRE-COVID, Su et al.,2020) and multi-hop question answering (ITRG, Feng et al.,2023; IR-CoT, Trivedi et al.,2022; DSP,  Khattab et al.,2022). Our use of a hierarchical index and summarization also bears resemblance to further approaches, such as generating a hierarchical index of text chunks by clustering the vectors of text embeddings (RAPTOR,  Sarthi et al.,2024) or generating a “tree of clarifications” to answer multiple interpretations of ambiguous questions (Kim et al.,, 2023). However, none of these iterative or hierarchical approaches use the kind of self-generated graph index that enables Graph RAG.
高级 RAG 系统包括预检索、检索、后检索策略,旨在克服朴素 RAG 的缺点。而模块化 RAG 系统包括交织检索和生成的迭代和动态循环模式(Gao et al., 2023)。我们的 Graph RAG 实现整合了与其他系统相关的多个概念。例如,我们的社区摘要是一种生成增强检索(GAR,Mao et al., 2020)的自我记忆(Selfmem,Cheng et al., 2024),有助于未来的生成周期,而我们从这些摘要中并行生成社区答案是一种迭代(Iter-RetGen,Shao et al., 2023)或联邦(FeB4RAG,Wang et al., 2024)检索-生成策略。其他系统也结合了这些概念进行多文档摘要(CAiRE-COVID,Su et al., 2020)和多跳问答(ITRG,Feng et al., 2023;IR-CoT,Trivedi et al., 2022;DSP,Khattab et al., 2022)。我们使用层次索引和摘要也与进一步的方法类似,例如通过聚类文本嵌入的向量生成文本片段的层次索引(RAPTOR,Sarthi et al., 2024)或生成“澄清树”以回答模糊问题的多种解释(Kim et al., 2023)。然而,这些迭代或层次化方法中没有使用 Graph RAG 所具有的那种自我生成的图索引。

4.2 Graphs and LLMs 4.2 图表和LLMs

Use of graphs in connection with LLMs and RAG is a developing research area, with multiple directions already established. These include using LLMs for knowledge graph creation (Trajanoska et al.,, 2023) and completion (Yao et al.,, 2023), as well as for the extraction of causal graphs (Ban et al.,, 2023; Zhang et al.,, 2024) from source texts. They also include forms of advanced RAG (Gao et al.,, 2023) where the index is a knowledge graph (KAPING, Baek et al.,2023), where subsets of the graph structure (G-Retriever,  He et al.,2024) or derived graph metrics (Graph-ToolFormer, Zhang,2023) are the objects of enquiry, where narrative outputs are strongly grounded in the facts of retrieved subgraphs (SURGE, Kang et al.,2023), where retrieved event-plot subgraphs are serialized using narrative templates (FABULA, Ranade and Joshi,2023), and where the system supports both creation and traversal of text-relationship graphs for multi-hop question answering (Wang et al., 2023b, ). In terms of open-source software, a variety a graph databases are supported by both the LangChain (LangChain,, 2024) and LlamaIndex (LlamaIndex,, 2024) libraries, while a more general class of graph-based RAG applications is also emerging, including systems that can create and reason over knowledge graphs in both Neo4J (NaLLM,  Neo4J,2024) and NebulaGraph (GraphRAG, NebulaGraph,2024) formats. Unlike our Graph RAG approach, however, none of these systems use the natural modularity of graphs to partition data for global summarization.
使用图与LLMs和 RAG 的结合在研究领域中不断发展,已经确立了多个方向。这些包括使用LLMs创建知识图谱(Trajanoska 等人,2023 年)和完成(Yao 等人,2023 年),以及从源文本中提取因果图(Ban 等人,2023 年;Zhang 等人,2024 年)。还包括高级 RAG 形式(Gao 等人,2023 年),其中索引是知识图谱(KAPING,Baek 等人,2023 年),其中图结构的子集(G-Retriever,He 等人,2024 年)或衍生的图度量(Graph-ToolFormer,Zhang,2023 年)是研究的对象,其中叙述输出强烈基于检索子图的事实(SURGE,Kang 等人,2023 年),其中检索的事件-情节子图使用叙述模板进行序列化(FABULA,Ranade 和 Joshi,2023 年),以及系统支持多跳问题回答中文本关系图的创建和遍历(Wang 等人,2023b)。在开源软件方面,LangChain(LangChain,2024 年)和 LlamaIndex(LlamaIndex,2024 年)库支持多种图数据库,而基于图的 RAG 应用的更广泛类别也正在出现,包括可以创建和在 Neo4J(NaLLM,Neo4J,2024 年)和 NebulaGraph(GraphRAG,NebulaGraph,2024 年)格式中推理知识图谱的系统。然而,与我们的图 RAG 方法不同,这些系统没有利用图的自然模块性对数据进行全局总结。

5 Discussion 讨论 注意翻译以下术语: * Transformer -> Transformer * Token -> Token * LLM/Large Language Model -> 大语言模型 * Generative AI -> 生成式 AI * SAM -> 分割一切模型 * Segment Anything -> 分割一切

Limitations of evaluation approach. Our evaluation to date has only examined a certain class of sensemaking questions for two corpora in the region of 1 million tokens. More work is needed to understand how performance varies across different ranges of question types, data types, and dataset sizes, as well as to validate our sensemaking questions and target metrics with end users. Comparison of fabrication rates, e.g., using approaches like SelfCheckGPT (Manakul et al.,, 2023), would also improve on the current analysis.
评价方法的局限性。到目前为止,我们的评价仅针对特定类别的理解问题,对两个约一百万 Token 的语料库进行了研究。需要更多的工作来理解性能在不同问题类型、数据类型和数据集大小范围内的变化,以及用最终用户验证我们的理解问题和目标指标。例如,通过使用如 SelfCheckGPT(Manakul 等人,2023)等方法比较伪造率,当前的分析将得到改进。

Trade-offs of building a graph index. We consistently observed Graph RAG achieve the best head-to-head results against other methods, but in many cases the graph-free approach to global summarization of source texts performed competitively. The real-world decision about whether to invest in building a graph index depends on multiple factors, including the compute budget, expected number of lifetime queries per dataset, and value obtained from other aspects of the graph index (including the generic community summaries and the use of other graph-related RAG approaches).
构建图索引的权衡。我们始终观察到图 RAG 在与其他方法一对一比较时取得最佳结果,但在许多情况下,无图方法在源文本全局总结方面的表现与之竞争。在现实世界中,是否投资构建图索引的真正决定取决于多个因素,包括计算预算、每组数据预期的查询次数以及从图索引的其他方面(包括通用社区摘要和其他图相关的 RAG 方法的使用)获得的价值。

Future work. The graph index, rich text annotations, and hierarchical community structure supporting the current Graph RAG approach offer many possibilities for refinement and adaptation. This includes RAG approaches that operate in a more local manner, via embedding-based matching of user queries and graph annotations, as well as the possibility of hybrid RAG schemes that combine embedding-based matching against community reports before employing our map-reduce summarization mechanisms. This “roll-up” operation could also be extended across more levels of the community hierarchy, as well as implemented as a more exploratory “drill down” mechanism that follows the information scent contained in higher-level community summaries.
未来工作。当前的 Graph RAG 方法中支持的图索引、丰富文本注释和层次社区结构提供了许多改进和适应的可能性。这包括以更局部的方式操作的 RAG 方法,通过基于嵌入的匹配用户查询和图注释,以及结合在社区报告之前使用我们的 map-reduce 总结机制的嵌入式匹配的混合 RAG 方案的可能性。这个“汇总”操作也可以跨社区层次结构的更多级别扩展,以及实现为更探索性的“深入”机制,该机制遵循更高层次社区摘要中包含的信息线索。

6 Conclusion 结论

We have presented a global approach to Graph RAG, combining knowledge graph generation, retrieval-augmented generation (RAG), and query-focused summarization (QFS) to support human sensemaking over entire text corpora. Initial evaluations show substantial improvements over a naïve RAG baseline for both the comprehensiveness and diversity of answers, as well as favorable comparisons to a global but graph-free approach using map-reduce source text summarization. For situations requiring many global queries over the same dataset, summaries of root-level communities in the entity-based graph index provide a data index that is both superior to naïve RAG and achieves competitive performance to other global methods at a fraction of the token cost.
我们提出了一种全局方法来处理图 RAG,结合了知识图谱生成、检索增强生成(RAG)和查询焦点摘要(QFS),以支持对整个文本语料库的人类理解。初步评估显示,在答案的全面性和多样性方面,与简单的 RAG 基线相比有显著改进,同时与使用 map-reduce 源文本摘要的全局但无图方法相比有有利的比较。对于需要对同一数据集进行大量全局查询的情况,基于实体的图索引中的根级社区摘要提供了一个数据索引,该索引优于简单的 RAG,并且在 token 成本远低于其他全局方法的情况下实现了与其它全局方法相竞争的性能。

An open-source, Python-based implementation of both global and local Graph RAG approaches is forthcoming at https://aka.ms/graphrag.
一个基于 Python 的开源实现,涵盖了全局和局部 Graph RAG 方法,即将在 https://aka.ms/graphrag 发布。

Acknowledgements 致谢

We would also like to thank the following people who contributed to the work: Alonso Guevara Fernández, Amber Hoak, Andrés Morales Esquivel, Ben Cutler, Billie Rinaldi, Chris Sanchez, Chris Trevino, Christine Caggiano, David Tittsworth, Dayenne de Souza, Douglas Orbaker, Ed Clark, Gabriel Nieves-Ponce, Gaudy Blanco Meneses, Kate Lytvynets, Katy Smith, Mónica Carvajal, Nathan Evans, Richard Ortega, Rodrigo Racanicci, Sarah Smith, and Shane Solomon.
我们也要感谢以下为这项工作做出贡献的人:阿尔松·古埃拉·费尔南德斯,安伯·霍克,安德烈斯·莫拉莱斯·埃斯奎维尔,本·库特勒,比莉·里纳尔迪,克里斯·桑切斯,克里斯·特里诺,克里斯汀·卡加诺,大卫·蒂特索,黛亚妮·德·索萨,道格拉斯·奥尔巴克,埃德·克拉克,加布里埃尔·尼埃斯-蓬塞,高迪·布拉沃·梅内塞斯,凯特·利特文特斯,凯蒂·史密斯,莫妮卡·卡瓦贾尔,内森·埃文斯,里奇·奥尔特加,罗德里戈·拉纳尼奇,莎拉·史密斯,和肖恩·索罗门。

References 参考文献

  • Achiam et al., (2023) 阿奇亚姆等人,(2023) Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., 等人(2023). Gpt-4 技术报告. arXiv 预印本 arXiv:2303.08774.
  • Anil et al., (2023) 安尼尔等人,(2023) Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
    安尼尔, R., 布尔格奥德, S., 吴, Y., 阿拉亚尔克, J.-B., 余, J., 索里库特, R., 施尔克维克, J., 戴, A. M., 哈斯, A., 等人. (2023). 双子座: 高效能多模态模型家族. arXiv 预印本 arXiv:2312.11805.
  • Baek et al., (2023) Baek 等人,(2023) Baek, J., Aji, A. F., and Saffari, A. (2023). Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. arXiv preprint arXiv:2306.04136.
    Baek, J., Aji, A. F., 和 Saffari, A. (2023). 零样本知识图谱问答的知识增强语言模型提示。arXiv 预印本 arXiv:2306.04136。
  • Ban et al., (2023) 巴恩等人,(2023) Ban, T., Chen, L., Wang, X., and Chen, H. (2023). From query tools to causal architects: Harnessing large language models for advanced causal discovery from data.
    范, T., 陈, L., 王, X., 和 陈, H. (2023). 从查询工具到因果建筑师:利用大型语言模型从数据中进行高级因果发现。
  • Baumel et al., (2018) Baumel, T., Eyal, M., and Elhadad, M. (2018). Query focused abstractive summarization: Incorporating query relevance, multi-document coverage, and summary length constraints into seq2seq models. arXiv preprint arXiv:1801.07704.
  • Blondel et al., (2008) Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008.
  • Brown et al., (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Cheng et al., (2024) Cheng, X., Luo, D., Chen, X., Liu, L., Zhao, D., and Yan, R. (2024). Lift yourself up: Retrieval-augmented text generation with self-memory. Advances in Neural Information Processing Systems, 36.
  • Dang, (2006) Dang, H. T. (2006). Duc 2005: Evaluation of question-focused summarization systems. In Proceedings of the Workshop on Task-Focused Summarization and Question Answering, pages 48–55.
  • Es et al., (2023) Es, S., James, J., Espinosa-Anke, L., and Schockaert, S. (2023). Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217.
  • Feng et al., (2023) Feng, Z., Feng, X., Zhao, D., Yang, M., and Qin, B. (2023). Retrieval-generation synergy augmented large language models. arXiv preprint arXiv:2310.05149.
  • Fortunato, (2010) Fortunato, S. (2010). Community detection in graphs. Physics reports, 486(3-5):75–174.
  • Gao et al., (2023) Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., and Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
  • Goodwin et al., (2020) Goodwin, T. R., Savery, M. E., and Demner-Fushman, D. (2020). Flight of the pegasus? comparing transformers on few-shot and zero-shot multi-document abstractive summarization. In Proceedings of COLING. International Conference on Computational Linguistics, volume 2020, page 5640. NIH Public Access.
  • He et al., (2024) He, X., Tian, Y., Sun, Y., Chawla, N. V., Laurent, T., LeCun, Y., Bresson, X., and Hooi, B. (2024). G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. arXiv preprint arXiv:2402.07630.
  • Jacomy et al., (2014) Jacomy, M., Venturini, T., Heymann, S., and Bastian, M. (2014). Forceatlas2, a continuous graph layout algorithm for handy network visualization designed for the gephi software. PLoS ONE 9(6): e98679. https://doi.org/10.1371/journal.pone.0098679.
  • Jin et al., (2021) Jin, D., Yu, Z., Jiao, P., Pan, S., He, D., Wu, J., Philip, S. Y., and Zhang, W. (2021). A survey of community detection approaches: From statistical modeling to deep learning. IEEE Transactions on Knowledge and Data Engineering, 35(2):1149–1170.
  • Kang et al., (2023) Kang, M., Kwak, J. M., Baek, J., and Hwang, S. J. (2023). Knowledge graph-augmented language models for knowledge-grounded dialogue generation. arXiv preprint arXiv:2305.18846.
  • Khattab et al., (2022) Khattab, O., Santhanam, K., Li, X. L., Hall, D., Liang, P., Potts, C., and Zaharia, M. (2022). Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024.
  • Kim et al., (2023) Kim, G., Kim, S., Jeon, B., Park, J., and Kang, J. (2023). Tree of clarifications: Answering ambiguous questions with retrieval-augmented large language models. arXiv preprint arXiv:2310.14696.
  • (21) Klein, G., Moon, B., and Hoffman, R. R. (2006a). Making sense of sensemaking 1: Alternative perspectives. IEEE intelligent systems, 21(4):70–73.
  • (22) Klein, G., Moon, B., and Hoffman, R. R. (2006b). Making sense of sensemaking 2: A macrocognitive model. IEEE Intelligent systems, 21(5):88–92.
  • Koesten et al., (2021) Koesten, L., Gregory, K., Groth, P., and Simperl, E. (2021). Talking datasets–understanding data sensemaking behaviours. International journal of human-computer studies, 146:102562.
  • Kuratov et al., (2024) Kuratov, Y., Bulatov, A., Anokhin, P., Sorokin, D., Sorokin, A., and Burtsev, M. (2024). In search of needles in a 11m haystack: Recurrent memory finds what llms miss.
  • LangChain, (2024) LangChain (2024). Langchain graphs. https://python.langchain.com/docs/use_cases/graph/.
  • Laskar et al., (2020) Laskar, M. T. R., Hoque, E., and Huang, J. (2020). Query focused abstractive summarization via incorporating query relevance and transfer learning with transformer models. In Advances in Artificial Intelligence: 33rd Canadian Conference on Artificial Intelligence, Canadian AI 2020, Ottawa, ON, Canada, May 13–15, 2020, Proceedings 33, pages 342–348. Springer.
  • Laskar et al., (2022) Laskar, M. T. R., Hoque, E., and Huang, J. X. (2022). Domain adaptation with pre-trained transformers for query-focused abstractive text summarization. Computational Linguistics, 48(2):279–320.
  • Lewis et al., (2020) Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  • Liu et al., (2023) Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. (2023). Lost in the middle: How language models use long contexts. arXiv:2307.03172.
  • Liu and Lapata, (2019) Liu, Y. and Lapata, M. (2019). Hierarchical transformers for multi-document summarization. arXiv preprint arXiv:1905.13164.
  • LlamaIndex, (2024) LlamaIndex (2024). LlamaIndex Knowledge Graph Index. https://docs.llamaindex.ai/en/stable/examples/index_structs/knowledge_graph/KnowledgeGraphDemo.html.
  • Manakul et al., (2023) Manakul, P., Liusie, A., and Gales, M. J. (2023). Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
  • Mao et al., (2020) Mao, Y., He, P., Liu, X., Shen, Y., Gao, J., Han, J., and Chen, W. (2020). Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553.
  • Martin et al., (2011) Martin, S., Brown, W. M., Klavans, R., and Boyack, K. (2011). Openord: An open-source toolbox for large graph layout. SPIE Conference on Visualization and Data Analysis (VDA).
  • Microsoft, (2023) Microsoft (2023). The impact of large language models on scientific discovery: a preliminary study using gpt-4.
  • NebulaGraph, (2024) NebulaGraph (2024). Nebulagraph launches industry-first graph rag: Retrieval-augmented generation with llm based on knowledge graphs. https://www.nebula-graph.io/posts/graph-RAG.
  • Neo4J, (2024) Neo4J (2024). Project NaLLM. https://github.com/neo4j/NaLLM.
  • Newman, (2006) Newman, M. E. (2006). Modularity and community structure in networks. Proceedings of the national academy of sciences, 103(23):8577–8582.
  • Ram et al., (2023) Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua, A., Leyton-Brown, K., and Shoham, Y. (2023). In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331.
  • Ranade and Joshi, (2023) Ranade, P. and Joshi, A. (2023). Fabula: Intelligence report generation using retrieval-augmented narrative construction. arXiv preprint arXiv:2310.13848.
  • Sarthi et al., (2024) Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., and Manning, C. D. (2024). Raptor: Recursive abstractive processing for tree-organized retrieval. arXiv preprint arXiv:2401.18059.
  • Scott, (2024) Scott, K. (2024). Behind the Tech. https://www.microsoft.com/en-us/behind-the-tech.
  • Shao et al., (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., and Chen, W. (2023). Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294.
  • Su et al., (2020) Su, D., Xu, Y., Yu, T., Siddique, F. B., Barezi, E. J., and Fung, P. (2020). Caire-covid: A question answering and query-focused multi-document summarization system for covid-19 scholarly information management. arXiv preprint arXiv:2005.03975.
  • Tang and Yang, (2024) Tang, Y. and Yang, Y. (2024). MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries. arXiv preprint arXiv:2401.15391.
  • Touvron et al., (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Traag et al., (2019) Traag, V. A., Waltman, L., and Van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1).
  • Trajanoska et al., (2023) Trajanoska, M., Stojanov, R., and Trajanov, D. (2023). Enhancing knowledge graph construction using large language models. ArXiv, abs/2305.04676.
  • Trivedi et al., (2022) Trivedi, H., Balasubramanian, N., Khot, T., and Sabharwal, A. (2022). Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509.
  • (50) Wang, J., Liang, Y., Meng, F., Sun, Z., Shi, H., Li, Z., Xu, J., Qu, J., and Zhou, J. (2023a). Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  • Wang et al., (2024) Wang, S., Khramtsova, E., Zhuang, S., and Zuccon, G. (2024). Feb4rag: Evaluating federated search in the context of retrieval augmented generation. arXiv preprint arXiv:2402.11891.
  • (52) Wang, Y., Lipka, N., Rossi, R. A., Siu, A., Zhang, R., and Derr, T. (2023b). Knowledge graph prompting for multi-document question answering.
  • Xu and Lapata, (2021) Xu, Y. and Lapata, M. (2021). Text summarization with latent queries. arXiv preprint arXiv:2106.00104.
  • Yang et al., (2018) Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • Yao et al., (2017) Yao, J.-g., Wan, X., and Xiao, J. (2017). Recent advances in document summarization. Knowledge and Information Systems, 53:297–336.
  • Yao et al., (2023) Yao, L., Peng, J., Mao, C., and Luo, Y. (2023). Exploring large language models for knowledge graph completion.
  • Zhang, (2023) Zhang, J. (2023). Graph-toolformer: To empower llms with graph reasoning ability via prompt augmented by chatgpt. arXiv preprint arXiv:2304.11116.
  • Zhang et al., (2024) Zhang, Y., Zhang, Y., Gan, Y., Yao, L., and Wang, C. (2024). Causal graph discovery with retrieval-augmented generation based large language models. arXiv preprint arXiv:2402.15301.
  • Zheng et al., (2024) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. (2024). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.