这是用户在 2024-10-15 15:53 为 https://moosie.us/parade_db_ecto 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Putting a full power search engine in Ecto
在 Ecto 中集成全功率搜索引擎

Posted: 9/25/2024, 11:14 AM
Updated: 9/25/2024, 01:20 PM
Changelog

Mosquito PR MK XVI

Preface: Existing options for search
前言:现有的搜索选项

I'd like to start by not-so-briefly outlining the existing approaches to text search and their use-cases, primarily around Postgres.
我想先不那么简略地概述一下现有的文本搜索方法及其应用场景,主要围绕 Postgres 展开。

Basic string comparisons 基本字符串比较

These methods are ideal for searching short strings such as product codes, addresses, or names:
这些方法非常适合搜索产品代码、地址或名称等短字符串:

  • LIKE and ILIKE - Operators that provide basic string and sub-string matching.
    LIKEILIKE - 提供基本字符串和子字符串匹配的运算符。
  • Fuzzy matching / fuzzystrmatch - Including Soundex, Metaphone, and Levenstein Distance.
    模糊匹配 / fuzzystrmatch - 包含 Soundex、Metaphone 和 Levenstein 距离。
  • Trigrams / pg_tgrm - Measures the similarity of words based on 3-character segments.
    基于 3 字符片段测量词语相似度。

They all have certain advantages when it comes to handling partial matches, misspellings, and phonetically similar names, but aren't suitable for searching longer text passages.
在处理部分匹配、拼写错误和发音相似的名称时,它们都具有一定的优势,但不适用于搜索较长的文本段落。

Postgres Full Text Search
PostgreSQL 全文搜索

Postgres' Full Text Search is best characterized by its convenience-to-capability ratio. It accomodates basic document retrieval without the need to synchronize or Extract-Transform-Load (ETL) your data to another service, but languishes in terms of capability and ranking of results.
Postgres 的全文搜索以其便利性与功能性的比例为最佳特征。它无需同步或进行数据提取-转换-加载(ETL)至其他服务,即可实现基本的文档检索,但在功能性和结果排序方面则显得力不从心。

Dedicated search engines 专用搜索引擎

Dedicated search engines are standalone applications characterized by the following:
专用搜索引擎是独立应用程序,具有以下特点:

  • They implement BM25 text search and ranking, the gold standard for search results.
    他们采用 BM25 文本搜索与排序,这是搜索结果的黄金标准。
  • Support a variety of data types for querying, sorting, and analysis.
    支持多种数据类型进行查询、排序和分析。
  • Make it reasonably straightforward to perform complex queries.
    使执行复杂查询变得合理简单。
  • Maintain fast response times over massive datasets using specialized data structures.
    使用专门的数据结构,在海量数据集上保持快速响应时间。
  • Offer some sort of horizontal scaling or sharding.
    提供某种水平扩展或分片。

Incumbent offerings in the search engine space include Apache Solr, Elasticsearch, and more recently OpenSearch. All of these are built on top of Apache Lucene and the JVM. A newer wave of search engines has also come about, built on Tantivy and Rust. Two prominent examples include Meilisearch (focuses on simplicity) and Quickwit (focused on logs and object storage backing).
搜索引擎领域的现有产品包括 Apache Solr、Elasticsearch,以及最近出现的 OpenSearch。这些引擎均基于 Apache Lucene 和 JVM 构建。新一代搜索引擎也应运而生,它们基于 Tantivy 和 Rust 构建。其中两个突出的例子是 Meilisearch(专注于简洁性)和 Quickwit(专注于日志和对象存储支持)。

A major drawback of search engines is that they must be kept synchronized with your database:
搜索引擎的一个主要缺点是它们必须与您的数据库保持同步:

  • Your application must reflect any creation, update, or deletion in your database to the search engine.
    您的应用程序必须将数据库中的任何创建、更新或删除操作反映到搜索引擎中。
  • Small and large data pushes to search engines often require separate code paths. Small updates call incremental procedures, whereas large updates often necessitate rebuilding search indexes entirely.
    向搜索引擎推送小数据和大数据通常需要不同的代码路径。小更新调用增量过程,而大更新通常需要完全重建搜索索引。
  • Dissonance often occurs when either system provides some query functionality you need to utilize in the same context.
    当两个系统在同一上下文中提供您需要使用的某些查询功能时,不协调现象常常发生。
  • You must manage deploying and hosting any search engine, in addition to your existing infrastructure.
    您必须管理搜索引擎的部署和托管,此外还要管理现有的基础设施。

Introducing ParadeDB (in Ecto)
介绍 ParadeDB(在 Ecto 中)

ParadeDB is a set of extensions that add pretty amazing search and analytics features to Postgres. In particular, ParadeDB embeds Tantivy as an extension via pgrx.
ParadeDB 是一组扩展,为 Postgres 增添了极为出色的搜索和分析功能。特别是,ParadeDB 通过 pgrx 嵌入了 Tantivy 作为扩展。

A few weeks ago I began exploring what it’d take to support ParadeDB in Ecto. My primary goal was to figure out how much work it'd take, and what the developer experience would feel like. I've created an initial implementation and sample project to demonstrate. The sample project contains transcripts of publicly available police scanner calls generated by OpenAI's Whisper.
几周前,我开始探索在 Ecto 中支持 ParadeDB 所需的条件。我的主要目标是弄清楚需要多少工作量,以及开发者体验会是怎样的。我已经创建了一个初步的实现和示例项目来展示。该示例项目包含了由 OpenAI 的 Whisper 生成的公开可用的警方扫描仪通话记录。

A simple search query for all calls with the word "drive" or "driving" in them would look like this:
包含“drive”或“driving”的所有通话的简单搜索查询如下所示:

from(
  c in Call,
  search: parse(c, "transcript:drive")
) |> Repo.all()

The above example uses ParadeDB's mini query language, which can be accessed via parse/2. More advanced and structured queries are also available.
上述示例使用了 ParadeDB 的迷你查询语言,可通过 parse/2 访问。更高级和结构化的查询同样可用。

We could limit our search query to calls between 5 and 10 seconds long, and retrieve fewer fields using Ecto's existing select/3:
我们可以将搜索查询限制在 5 到 10 秒长的通话范围内,并利用 Ecto 现有的 select/3 检索更少的字段:

from(
  c in Call,
  search: parse(c, "transcript:drive"),
  search: int4range(c.call_length, 5, 10, "[]"),
  select: {c.id, c.transcript}
) |> Repo.all()
[
  {423,
   "Ambulance 725 probably was responding interperson from a fall. PLS, 15,320 Pine Orchard Drive, unit number 1 at Foxtrot."},
  {172,
   "Charlie here for truck inspection. We have on track transportation driving a white van with no logo. This is the first time. Please look out for us."},
  {37,
   "Morning, Sam. Go ahead. I have a hit and run on Woodfield and Deanna Drive and the Mary 2. I only have Mary 33 right now. All right. You can hold it."}
]

For a more maximal demonstration, here's a rather complex query:
为了更极致的展示,这里有一个相当复杂的查询:

slop = 1

query =
  from(
    c in Call,
    search: boolean([
      must: [
        parse(c, "transcript:ambulance"),
        disjunction_max([
          boost(parse(c, "transcript:(BLS ALS)"), 10),
          parse(c, "transcript:(fall)"),
          phrase(c.transcript, ["routine", "response"], ^slop)
        ])
      ],
    ])
  )

Repo.all(query)

This query will: 此查询将:

  • Return calls with the word "ambulance" in them
    返回包含 "ambulance" 的通话记录
  • Include at least one of: "ALS", "BLS", "fall", or "routine response" within a word of one another.
    在一个词中至少包含以下之一: "ALS""BLS""fall""routine response"
  • Rank calls that mention "ALS" or "BLS" higher, having them appear first.
    将提及 "ALS""BLS" 的排名调高,使其优先显示。
  • Map the results to %Call{} structs as like any other Ecto query.
    将结果映射到 %Call{} 结构体,如同其他 Ecto 查询一样。

To show how the search query can compose with the rest of SQL and Ecto, let's narrow the results somewhat to suit this post:
为了展示搜索查询如何与 SQL 和 Ecto 的其他部分组合,让我们稍微缩小结果范围以适应这篇文章:

query
|> join(:inner, [c], tg in assoc(c, :talk_group))
|> where([_, tg], ilike(tg.tag, "%dispatch%"))
#  work-alike to: where([c, _], c.call_length <= 20)
|> search([c, _], int4range(c.call_length, nil, 20, "(]"))
|> select([c, tg], %{
  id: c.id,
  text: c.transcript,
  duration: c.call_length,
  talk_group: tg.description
})
|> Repo.all()
[
  %{
    id: 170,
    text: "Trader Joe's, 6831 Wisconsin Avenue, Suite 400, Cross Street, Stanford Street, Decrease LLC, ALS 1, Paramedic in 706, Ambulance 741, Bravo, Response 7, Alpha 1, Box 061, 1524.",
    duration: 13,
    talk_group: "7A2 Dispatch"
  },
  %{
    id: 375,
    text: "Are the mailboxes 4302 Randolph Road, Crosstown, Veras, Miller Road, Interperson, Firm of Fall, BOS, Ambulance 705, Response 701, Vox Air 21-2, 1545.",
    duration: 11,
    talk_group: "7A2 Dispatch"
  }
]

If you're absolutely curious what the original audio sounds like, here you are: 170 and 375. Be aware that they start with loud alert tones.
如果你绝对好奇原始音频听起来是什么样子,这里就是:170 和 375。请注意,它们以响亮的警报声开始。

Overall I'm quite pleased with the results thus far!
总的来说,我对迄今为止的结果相当满意!

  • Search indexes are handily created in migration files. ParadeDB transparently updates the search index when rows are inserted, updated, or deleted, while retaining ACID compliance. This obviates the need to maintain any synchronization or ETL pipeline between your database and search service.
    搜索索引在迁移文件中便捷地创建。ParadeDB 在插入、更新或删除行时透明地更新搜索索引,同时保持 ACID 合规性。这消除了在数据库与搜索服务之间维护同步或 ETL 管道的需要。
  • Search queries compose quite fluently with the rest of SQL. This flexibility does introduce some performance considerations, but it's nice to even have the discretion available.
    搜索查询与 SQL 的其他部分流畅地组合在一起。这种灵活性确实带来了一些性能方面的考虑,但能够拥有这种选择权还是很不错的。
  • Ecto provides tons of features search engine clients often lack, providing an unparalleled (to my knowledge) developer experience.
    Ecto 提供了大量搜索引擎客户端通常缺乏的功能,提供了无与伦比的(据我所知)开发者体验。
  • Although not demonstrated here, it's possible to perform search queries across separate indexes and JOIN them as you would any other SQL query.
    尽管这里没有展示,但可以在不同的索引之间执行搜索查询,并将它们像其他 SQL 查询一样进行 JOIN 操作。

Current implementation and next steps
当前实现与下一步计划

The current Ecto implementation needs a lot more work before it’s production-ready. ParadeDB already has several especially lucrative features that I’ve yet to expose:
当前的 Ecto 实现还需要更多的工作才能达到生产就绪状态。ParadeDB 已经具备多个极具吸引力的特性,我尚未将其公开:

  • BM25 scores can be returned from ranked queries.
    BM25 分数可以从排序查询中返回。
  • Searches also support hybrid search when used with pg_vector and AI generated embeddings.
    搜索还支持与 pg_vector 和 AI 生成的嵌入结合使用的混合搜索。
  • Aggregates and facets are available, although limited to enterprise customers for now. (This will probably change at some point)
    聚合和分面功能可用,但目前仅限企业客户使用。(这一点可能会在某个时候发生变化)
  • ParadeDB 0.10.0 just dropped, and adds lots of optimizations and features I've not been able to touch yet.
    ParadeDB 0.10.0 刚刚发布,新增了大量优化和功能,我尚未有机会深入探讨。

There’s also lots more getting the implementation 'generally correct', well tested, and working with all of Ecto’s existing features.
还有许多工作需要确保实现“基本正确”,经过充分测试,并与 Ecto 的所有现有功能兼容。

Looking forward 期待着

My tentative plan's to continue working on this with the goal of maintaining a downstream fork of :ecto and :ecto_sql others can use. That said, I’m just one person, so it’ll probably be a while before I’m able to fully realize that. If any folks are interested in contributing to accelerate the process, drop a PR, or message me on the Elixir Slack or Discord, @Moosieus.
我初步计划是继续致力于此,目标是维护一个 :ecto:ecto_sql 的下游分支,供其他人使用。话虽如此,我毕竟只是一个人,所以可能需要一段时间才能完全实现这一目标。如果有朋友对此感兴趣并希望加速进程,欢迎提交 PR,或在 Elixir Slack 或 Discord 上联系我,@Moosieus。

ParadeDB itself is only about a year old, so it's early days yet. The core team's small but talented, and they've fostered a community with lots of active contributions. Tantivy itself is also seeing continued improvement which will benefit ParadeDB in turn.
ParadeDB 本身仅有一年左右的历史,因此目前仍处于早期阶段。核心团队虽小但才华横溢,他们培养了一个充满活跃贡献者的社区。Tantivy 本身也在持续改进,这将反过来使 ParadeDB 受益。

The code I'm working on is located here: moosieus/ecto and moosieus/ecto_sql.
我正在处理的代码位于这里:moosieus/ecto 和 moosieus/ecto_sql。

Q&A 问答

Why not just use fragments?
为什么不直接使用片段?

ParadeDB introduces the need to compose its own queries within Ecto, which is a more comprehensive problem than fragments can address.
ParadeDB 引入了在 Ecto 中编写自定义查询的需求,这比片段所能解决的问题更为复杂。

Doesn't Postgrex support extensions?
Postgrex 不支持扩展吗?

Postgrex extensions allow users to encode and decode additional data types. ParadeDB's API uses Postgres' existing data types, so the extension API isn't of use here.
Postgrex 扩展允许用户编码和解码额外的数据类型。ParadeDB 的 API 使用了 Postgres 现有的数据类型,因此扩展 API 在此处并不适用。

Wouldn't a fork not get updates from Ecto?
叉子不会从 Ecto 获取更新吗?

It'd be a downstream fork, meaning changes from Ecto would be merged in and published regularly. The versions might have to look a bit funny though, something like 3.10.2-paradedb-v1. I haven't fully thought this part out.
这将是一个下游分支,意味着 Ecto 的变更将被合并并定期发布。版本号可能需要看起来有点特别,比如 3.10.2-paradedb-v1 。我还没有完全想清楚这部分。