Background and challenges

As an important part of the Chinese Academy of Sciences, the Institute of Automation has world-leading technical strength in the fields of artificial intelligence, pattern recognition, and automation control. In the face of the increasingly severe challenge of academic integrity, the Institute of Automation has undertaken the important mission of promoting the construction of integrity in the field of scientific research, and is committed to maintaining the healthy development of the academic ecology through technological innovation. In the process of building an active monitoring platform for academic misconduct in scientific research papers, automation faces multiple complex technical challenges, which not only come from the complexity of business needs, but also from the high difficulty of technical implementation.

The platform needs to achieve academic misconduct detection in four dimensions: "anti-plagiarism, anti-plagiarism, anti-generation, and anti-tampering", covering all-round monitoring capabilities from the text level to the semantic level to the image level. This multi-dimensional detection requirement requires the system to be able to process multiple types of data such as text, image, and semantics at the same time, which puts forward extremely high performance requirements for the storage and retrieval system. At the same time, in the face of the massive paper data of SCI academic journals, it is necessary to build a technical architecture that can support large-scale data storage, efficient retrieval, and real-time analysis.

Technical solutions

The platform is built based on a multi-modal, multi-type and multi-collaborative academic integrity model, and has the ability of "anti-plagiarism-anti-plagiarism-anti-generation". The whole platform is divided into portal and management: the portal provides functions such as user login registration, paper detection and personal center, which is convenient for users to operate and view information; The management side includes system settings, paper detection, feature database management, expert review, information management and other capabilities for system management, data analysis and monitoring.

For complex business needs, the platform adopts a variety of professional databases, and gives full play to their unique advantages. Qdrant Vector Database is a high-performance vector storage solution designed to work with high-dimensional vector data. The platform uses its powerful vector storage and similarity search capabilities to support the storage and fast retrieval of high-dimensional dense vectors, which provides a technical basis for the semantic similarity comparison of text and images. Elasticsearch is a distributed search engine built on Lucene, which provides the platform with powerful full-text search, word segmentation, aggregation analysis and other capabilities, and its efficient query performance and rich analysis functions can meet the complex retrieval needs of massive scientific research papers. As a state-of-the-art relational database, PostgreSQL is responsible for storing metadata information (title, author, publication date, abstract and other structured information) of papers, and its JSON data type also better meets the storage needs of semi-structured data.

Faced with the O&M burden of various professional databases such as Qdrant, Elasticsearch, and PostgreSQL, Automation chose KubeBlocks as a unified database management platform. In general, the O&M team needs to master the unique O&M skills of each database, which is not only expensive to learn, but also prone to system risks due to inconsistent operations. KubeBlocks provides standardized deployment templates and a unified interface, so that O&M personnel can easily deploy, scale, backup, and monitor various databases without having to learn the specific technical details of each database. This unified management model not only greatly reduces the burden of technical learning, but more importantly, ensures the standardization and consistency of operation and maintenance operations, significantly reduces the risk of human error, and realizes the overall improvement of management efficiency.

The entire platform is deployed based on Kubernetes version 1.27, and KubeBlocks version 0.9.2 is used for database management, and the entire cluster is configured with more than 12TB of memory and more than 100TB of storage space, providing sufficient resources for the storage and computing of vectorized data. In addition, the platform is also equipped with dedicated GPU nodes to provide computing power support for model training and inference through a large number of 4090 graphics cards. This hierarchical hardware architecture design not only ensures the stable operation of database services, but also provides an independent hardware environment for the efficient execution of algorithm models.

Through the KubeBlocks platform, each database component is standardized and managed. The PostgreSQL cluster is deployed with version 14.8.0 and configured with master-slave replication and automatic failover to ensure the high availability of business data. Elasticsearch Cluster 8.8.2 is deployed to build a distributed search cluster that supports rapid retrieval of massive documents and complex aggregation analysis. Qdrant vector database deployment version 1.10.0 optimizes the vector indexing algorithm to support efficient similarity search.

Application results

The platform has successfully realized the integrated detection of text-level anti-plagiarism, semantic-level anti-plagiarism and image-level anti-generation, and the detection accuracy and efficiency have reached the industry-leading level. The introduction of Qdrant vector database has increased the speed of large-scale vector similarity retrieval by several times, and the similarity comparison of massive vector data can be completed in milliseconds, providing strong technical support for real-time detection. Elasticsearch's distributed architecture supports the real-time retrieval of massive documents, even when processing tens of millions of papers, while maintaining excellent query performance. Through the unified management of KubeBlocks, the availability of each database service reaches 99.99%, which effectively supports the stable operation of the platform × 74 and provides reliable service guarantee for academic institutions and journals.

通过 KubeBlocks 的自动化管理，原本需要 3-4 名专业 DBA 维护的多种数据库系统，现在仅需 1 名兼职运维工程师即可完成日常管理工作，人力成本降低了 70% 。新数据库实例的部署时间从原来的几天缩短到几小时，配置变更和扩容操作也实现了一键完成，大大提升了运维效率。标准化的运维流程和自动化的操作减少了人为错误，系统故障率较传统管理方式降低了 80%，显著提升了系统的稳定性和可靠性。

目前平台已成功应用于多个重要学术期刊的论文审核流程，显著提升了学术不端行为的识别准确率和处理效率，通过技术手段有效遏制学术不端行为，为维护科研诚信、优化学术生态提供了有力的技术支撑，平台的成功实施为学术诚信检测领域提供了技术标准和最佳实践，推动了整个行业的技术进步。

总结展望

自动化所科研论文学术不端主动监测平台的成功实施，充分证明了 KubeBlocks 在复杂数据库环境管理中的卓越能力。项目的成功得益于多个关键因素的有机结合：合理的多模态、多类型技术架构设计充分考虑了学术不端检测的复杂性和多样性需求；针对不同数据类型选择最适合的数据库技术，充分发挥了各个数据库的技术优势；更重要的是，KubeBlocks 作为统一管理平台，不仅简化了运维复杂度，更提升了整个系统的可靠性和可维护性。通过统一的管理平台，不仅实现了技术目标，更重要的是大幅降低了运维成本，提升了系统的整体效率和可靠性。

展望未来，平台将继续在技术创新和应用推广方面发力。随着AI技术的快速发展，平台将持续优化检测算法，提升对新型学术不端行为的识别能力，特别是针对AI生成内容的检测技术。同时，计划向更多学术机构和期刊推广应用，形成更大规模的学术诚信检测网络，推动建立行业统一的学术诚信标准。基于平台的成功实践经验，还将积极参与制定学术诚信检测的行业标准和技术规范，为整个学术界的诚信建设贡献力量。这个成功案例不仅展示了先进技术在学术诚信建设中的重要作用，也为其他类似项目提供了宝贵的实施经验和技术参考。

Cas

Active monitoring platform practice for academic misconduct

99.99%

70%

Background and challenges

Technical solutions

Application results

总结展望