陈昊某/文本去重工具：一站式文本去重解决方案

Installation 安装

pip install text-dedup

or 或

pip install git+https://github.com/ChenghaoMou/text-dedup

Documentation 文档

Github Pages GitHub 页面

Features 特征

This repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:
此仓库包含一系列即用型文本去重脚本，可根据您的需求进行修改：

RETSim/UniSim, an embedding-based near deduplication (WIP)
RETSim/UniSim，一种基于嵌入的近似重复数据删除技术（工作进行中）
MinHash + MinHashLSH, including a spark implementation suitable for large (TB) datasets
MinHash + MinHashLSH，包含适用于大型（TB 级）数据集的 Spark 实现
64 or 128 bit SimHash
64 位或 128 位 SimHash
SuffixArray Substring 后缀数组子串
Bloom Filter 布隆过滤器
Exact Hash (document-level, line-level/ccnet)
精确哈希（文档级，行级/ccnet）

I also have big plans for the future:
我同样对未来有着宏伟的规划：

Memory benchmark for streaming processing
流处理内存基准测试
Inter-dataset deduplication
跨数据集去重
Rewrite suffix array in Python
在 Python 中重写后缀数组
A collections of other deduplication methods: SuperMinHash, ProbMinHash, TreeMinHash, BagMinHash, Optimal Densification for Fast and Accurate Minwise Hashing, Fast Similarity Sketching
其他去重方法集锦：SuperMinHash、ProbMinHash、TreeMinHash、BagMinHash、快速准确的最小哈希稠密化优化、快速相似性草图绘制

However, I do not intent to build a general purpose deduplication library, which was the goal of this repo early on. I will gradually retire the pypi package as well. The reason behind it is that each use-case can be wildly different and requires careful design and consideration. I sincerely encourage you to read the script first (they are relatively short) so you can understand what are at stake here when using it. You can use it to bootstrap your own script, or just use it as a reference.
然而，我并不打算构建一个通用目的的去重库，这曾是该仓库早期的目标。我也会逐步淘汰 pypi 包。背后的原因是，每个使用场景可能大相径庭，需要精心设计和考虑。我诚挚地鼓励您先阅读脚本（它们相对简短），以便理解在使用时所涉及的关键问题。您可以利用它来启动自己的脚本，或仅作为参考。

Acknowledgements 致谢

This repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in BigScience (Apache 2.0) and BigCode (Apache 2.0). There is a blog post about the journey. Feedbacks are welcome!
本仓库的灵感源自以下项目，并深受我在参与 BigScience（Apache 2.0）和 BigCode（Apache 2.0）过程中所学教训的影响。关于这段旅程，有一篇博客文章。欢迎提供反馈！

Datasketch (MIT) 数据草图（麻省理工）
simhash-py and simhash-cpp (MIT)
simhash-py 和 simhash-cpp（MIT 许可）
Deduplicating Training Data Makes Language Models Better (Apache 2.0)
去重训练数据使语言模型更优（Apache 2.0）
Gaoya (MIT) 高压（麻省理工）

Quick Examples 快速示例

Native PySpark 原生 PySpark

MODIFY text_dedup/minhash_spark.py FOR YOUR OWN PROJECT AND DATASET FIRST!
首先根据您自己的项目和数据集修改 text_dedup/minhash_spark.py ！

Assuming you have a downloaded dataset (in parquet files) under "./temp-data", you can process with file with your local compute by:
假设您在"./temp-data"目录下有一个已下载的数据集（以 parquet 文件格式存储），您可以通过以下方式使用本地计算资源处理这些文件：

export PYSPARK_PYTHON="path to your python with scipy, xxhash, and numpy installed"
spark-submit --executor-memory 16g \
    --driver-memory 20g \
    --executor-cores 3 \
    --num-executors 2 \
    --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 \
    --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \
    --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \
    text_dedup/minhash_spark.py\
    --input "./temp-data" \
    --output "./temp-output" \
    --column "text" \
    --threshold 0.7

DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Using B=25, R=10
DEBUG __main__ - Loaded documents: 88803
DEBUG __main__ - args.input='./temp-data'
DEBUG __main__ - args.output='./temp-output'
DEBUG __main__ - args.threshold=0.7
DEBUG __main__ - args.ngram_size=5
DEBUG __main__ - args.min_length=5
DEBUG __main__ - args.num_perm=250
DEBUG __main__ - args.column='text'
DEBUG __main__ - id                                                              : bigint
DEBUG __main__ - text                                                            : string
DEBUG __main__ - meta                                                            : struct<warc_headers:struct<warc-record-id:string,warc-date:string,content-type:string,content-length:int,warc-type:string,warc-identified-content-language:string,warc-refers-to:string,warc-target-uri:string,warc-block-digest:string>,identification:struct<label:string,prob:float>,annotations:array<string>,line_identifications:array<struct<label:string,prob:float>>>
DEBUG __main__ - __id__                                                          : bigint
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Initial edges: 52102
DEBUG __main__ - Edges DataFrame: 52102
DEBUG __main__ - Vertices DataFrame: 50206
DEBUG __main__ - Assignment DataFrame: 50206
DEBUG __main__ - Merging records: 88803
INFO  __main__ - Saving with 1 partitions and 44092 rows each
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Number of rows before:    88803
DEBUG __main__ - Number of rows after:     44092
DEBUG __main__ - Percentage of rows kept:  49.65%
DEBUG __main__ - Output:                   ./temp-output
DEBUG __main__ - Time:                     68.80s
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------

Or take a look at bigcode-v2/run.sh on how to run the job with GCP DataProc.
或者查看 bigcode-v2/run.sh 文件，了解如何使用 GCP DataProc 运行作业。

UniSim (WIP) UniSim（工作进行中）

Based on Google's RETSim model(Github, Arxiv), it is an embedding based on near-deduplication method.
基于谷歌的 RETSim 模型（Github，Arxiv），这是一种基于近似去重方法的嵌入技术。

For a large dataset, it would require GPU(s) for fast inference.
对于大型数据集，需要使用 GPU 以实现快速推理。

python text_dedup/ann_unisim.py --path truthful_qa --name generation --split validation --output temp --column question

Output: 输出：翻译文本：

INFO     Load Dataset                    : 5.56s
INFO     Index Dataset                   : 8.13s
INFO     Clustering                      : 8.72s
INFO     Filtering                       : 0.35s
INFO     Saving                          : 0.01s
INFO     Cleaning                        : 0.00s
INFO     Total                           : 22.77s
INFO     Before                          : 817
INFO     After                           : 788

Suffix Array Substring Exact Deduplication
后缀数组子串精确去重

# input
python -m text_dedup.suffix_array \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/suffix_array/oscar_gl_dedup" \
    --column "text" \
    --google_repo_path "/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets" \
    --use_auth_token true

# output
INFO     Loading                       : 2.75 seconds
INFO     Preprocessing                 : 4.78 seconds
INFO     SuffixArray                   : 98.29 seconds
INFO     SelfSimilar                   : 4.24 seconds
INFO     Restore                       : 0.25 seconds
INFO     Deduplicate                   : 6.23 seconds
INFO     Saving                        : 8.91 seconds
INFO     Total                         : 125.45 seconds
INFO     Before                        : 180332342 bytes (88803)
INFO     After                         : 97646271 bytes (40404)

MinHash Near Deduplication
MinHash 近似去重

# input
python -m text_dedup.minhash \
  --path "oscar-corpus/OSCAR-2201" \
  --name "gl" \
  --split "train" \
  --cache_dir "./cache" \
  --output "output/minhash/oscar_gl_dedup" \
  --column "text" \
  --batch_size 10000 \
  --use_auth_token true

# output
INFO     Loading                         : 2.62 seconds
INFO     MinHashing                      : 0.08 seconds
INFO     Clustering                      : 2.20 seconds
INFO     Filtering                       : 0.53 seconds
INFO     Saving                          : 9.86 seconds
INFO     Total                           : 15.29 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 44124 (49.69%)
INFO     Duplicate Number                : 44679 (50.31%)
INFO     🤗 Happy Deduplicating 🤗

SimHash Near Deduplication
SimHash 近似去重

# input
python -m text_dedup.simhash \
  --path "oscar-corpus/OSCAR-2201" \
  --name "gl" \
  --split "train" \
  --cache_dir "./cache" \
  --output "output/simhash/oscar_gl_dedup" \
  --column "text" \
  --batch_size 10000 \
  --use_auth_token true

# output
INFO     Loading                         : 2.60 seconds
INFO     SimHashing                      : 0.04 seconds
INFO     Indexing                        : 28.88 seconds
INFO     Filtering                       : 0.88 seconds
INFO     Saving                          : 10.41 seconds
INFO     Total                           : 42.80 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 46163 (51.98%)
INFO     Duplicate Number                : 42640 (48.02%)
INFO     🤗 Happy Deduplicating 🤗

Exact Hash Exact Deduplication
精确哈希精确去重

# input
python -m text_dedup.exact_hash \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/exact_hash/oscar_gl_dedup" \
    --column "text" \
    --batch_size 1000 \
    --use_auth_token true

# output
INFO     Loading                       : 2.95s
INFO     Processing                    : 3.79s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.89s
INFO     Total                         : 9.72s
INFO     Before                        : 88803
INFO     After                         : 47049

Bloom Filter Exact Deduplication
布隆过滤器精确去重

# input
python -m text_dedup.bloom_filter \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/bloom_filter/oscar_gl_dedup" \
    --error_rate 1e-5 \
    --column "text" \
    --use_auth_token true    --batch_size 1000

# output
INFO     Loading                       : 2.72s
INFO     Processing                    : 4.84s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.88s
INFO     Total                         : 10.54s
INFO     Before                        : 88803
INFO     After                         : 47045

Benchmarks 基准测试

Note 注意

Spark implementation has some overhead for small datasets, so I recommend using the script only when you have a large dataset and enough compute resources.
Spark 实现对于小型数据集存在一定开销，因此建议仅在拥有大型数据集和充足计算资源时使用该脚本。

pinecone/core-2020-05-10-deduplication
松果/核心-2020-05-10-去重

See tests/benchmark_core.py for reproduction.
参见 tests/benchmark_core.py 以供复制。

Algorithm 算法	Precision (Duplicates) 精确度（重复性）	Recall (Duplicates) 召回（重复项）	Precision (Non Duplicates) 精确性（无重复）	Recall (Non Duplicates) 召回（非重复项）	Macro F1 score 宏 F1 分数	Accuracy 准确性	Time 时间
UniSim	0.9307	0.8924	0.9055	0.9394	0.9181	0.9054	1305.79s 1305.79 秒
MinHash Spark	0.957	0.9445	0.9471	0.959	0.952	0.9202	691.77s 691.77 秒
MinHash 最小哈希	0.9594	0.9445	0.9474	0.9616	0.9534	0.924	18.88s 18.88 秒
SimHash	0.9042	0.721	0.792	0.9329	0.8481	0.8321	644.36s 644.36 秒
Exact Title 精确标题	0.8302	0.5521	0.7098	0.9065	0.77	0.7456	-
Exact Title Matching ¹ 精确标题匹配 ¹	0.830	0.50	0.709	0.992	0.757	0.746	-
Simhash Matching ¹ Simhash 匹配 ¹	0.697	0.247	0.598	0.985	0.631	0.616	-
Document Vector Similarity ¹ 文档向量相似度 ¹	0.912	0.779	0.861	0.986	0.885	0.883	-
Hybrid Method ¹ 混合方法 ¹	0.908	0.828	0.899	0.979	0.904	0.903	-
LaBSE²	0.937	0.923	0.930	0.943	0.933	0.919	-
Multilingual USE² 多语言使用 ²	0.917	0.907	0.918	0.927	0.917	0.909	-
Multilingual E5-Base² 多语言 E5 基础 ²	0.931	0.908	0.919	0.939	0.924	0.920	-
MinHash + LSH²	0.929	0.902	0.915	0.938	0.921	0.918	-
RETSim Partial-Dup² RETSim 部分重复 ²	0.945	0.941	0.945	0.949	0.945	0.928	-
RETSim Near-Dup² RETSim 近似重复 ²	0.928	0.937	0.942	0.934	0.935	0.926	-

NEWS-COPY 新闻稿

See tests/benchmark_news.py for reproduction.
参见 tests/benchmark_news.py 以供复制。

Adjusted Rand Index (ARI) on NEWS-COPY dataset:
NEWS-COPY 数据集上的调整兰德指数（ARI）：

Model/Algorithm	ARI
SimHash	0.612
MinHash (Spark)	0.740
MinHash	0.742
RETSim Near-Dup + ANN*	0.051
n-gram ³	0.440
SimHash²	0.695
MinHash³	0.737
MinHash²	0.783
Multilingual USE²	0.730
Multilingual E5-Base²	0.742
S-BERT³	0.700
RETSim Partial-Dup²	0.831
RETSim Near-Dup²	0.704
Re-ranking ³	0.937
Bi-encoder ³	0.915

*: I can't seem to reproduce the results from the paper.

License 许可证

Apache 2.0

Citations 引文

Generally, you can cite this repository as:
通常，您可以将此仓库引用为：

@software{chenghao_mou_2023_8364980,
  author       = {Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu},
  title        = {ChenghaoMou/text-dedup: Reference Snapshot},
  month        = sep,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {2023.09.20},
  doi          = {10.5281/zenodo.8364980},
  url          = {https://doi.org/10.5281/zenodo.8364980}
}

The spark version was born from BigCode (Apache 2.0) and BigScience (Apache 2.0), and you can cite the original paper if you want:
Spark 版本源自 BigCode（Apache 2.0）和 BigScience（Apache 2.0），如需引用，可参考原始论文：

@article{
kocetkov2023the,
title={The Stack: 3 {TB} of permissively licensed source code},
author={Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu{\~n}oz Ferrandis and Sean Hughes and Thomas Wolf and Dzmitry Bahdanau and Leandro Von Werra and Harm de Vries},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=pxpbTdUEpD},
note={}
}

Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings ↩ ↩² ↩³ ↩⁴
利用局部敏感哈希和词嵌入的学术文档去重
RETSim: Resilient and Efficient Text Similarity ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹²
RETSim：弹性高效文本相似度 ↩ ↩ ² ↩ ³ ↩ ⁴ ↩ ⁵ ↩ ⁶ ↩ ⁷ ↩ ⁸ ↩ ⁹ ↩ ¹⁰ ↩ ¹¹ ↩ ¹²
Noise-Robust De-Duplication at Scale ↩ ↩² ↩³ ↩⁴ ↩⁵
大规模噪声鲁棒去重

Name	Name	Last commit message	Last commit date
Latest commit ChenghaoMou add debug nodetebook and update deps May 22, 2024 85dd927 · May 22, 2024 History 363 Commits
.github/workflows	.github/workflows	add stale bot	Mar 9, 2024
deduplicate-text-datasets @ fcf7432	deduplicate-text-datasets @ fcf7432	add inline call	Mar 16, 2024
docs/source	docs/source	update cli args	Apr 30, 2024
notebooks	notebooks	add debug nodetebook and update deps	May 22, 2024
tests	tests	factoring and update tests	Mar 29, 2024
text_dedup	text_dedup	factoring and update tests	Mar 29, 2024
utils	utils	hyperparameter sweeps	Mar 19, 2024
.gitignore	.gitignore	update tests	Feb 25, 2024
.gitmodules	.gitmodules	update submodule config	Jun 20, 2022
.pre-commit-config.yaml	.pre-commit-config.yaml	minor fixes	Mar 17, 2024
CITATION.bib	CITATION.bib	update citations	Sep 21, 2023
Dockerfile	Dockerfile	add debug nodetebook and update deps	May 22, 2024
LICENSE	LICENSE	rename class names to avoid redundancy	Nov 4, 2022
Makefile	Makefile	factoring and update tests	Mar 29, 2024
README.md	README.md	update cli args	Apr 30, 2024
banner.png	banner.png	update banner	Mar 26, 2023
cobertura.xml	cobertura.xml	factoring and update tests	Mar 29, 2024
compose.yaml	compose.yaml	add news copy benchmark	Mar 18, 2024
log4j.properties	log4j.properties	refactor spark	Feb 19, 2024
poetry.lock	poetry.lock	add debug nodetebook and update deps	May 22, 2024
pyproject.toml	pyproject.toml	add debug nodetebook and update deps	May 22, 2024

Navigate back to

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create list

Unstar this repository?

Repository files navigation

Installation 安装

Documentation 文档

Features 特征

Acknowledgements 致谢

Quick Examples 快速示例

Benchmarks 基准测试

License 许可证

Citations 引文

About

Releases 1

Packages

Contributors 7

Deployments 118

Languages

Create list

Unstar this repository?

License

ChenghaoMou/text-dedup

Add file

Add file

Folders and files

Latest commit

History

Repository files navigation

Installation 安装

Documentation 文档

Features 特征

Acknowledgements 致谢

Quick Examples 快速示例

Benchmarks 基准测试

License 许可证

Citations 引文

Footnotes

About

Topics

Resources

License

Citation

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 7

Deployments 118

Languages

Packages