BibDedupe: An Open-Source Python Library for Bibliographic Record Deduplication

literature-review

Author

Wagner, Gerit

Published

2024

Doi

10.21105/JOSS.06318

Keywords

literature-review

Summary

BibDedupe is a Python library developed for bibliographic record deduplication in meta-analysis and research synthesis. It is constructed with a focus on four requirements: (1) Zero false positives: The primary objective is to prevent incorrectly merging distinct entries. This focus on zero false positives is crucial to ensure trustworthiness and prevent biased conclusions in the analysis. (2) Reproducibility: BibDedupe implements fixed rules to produce consistent results, in line with the scientific standard of reproducibility. (3) Efficiency: The library is also tuned for low false-negative rates and rapid processing, to ensure scalability of the duplicate identification process. (4) Continuous evaluation and improvement: It is continuously evaluated on over 160,000 records from 10 datasets to ensure its effectiveness, especially in follow-up refinements. Unlike general-purpose deduplication tools, BibDedupe is specifically designed for the unique requirements of bibliographic data in meta-analysis and research synthesis. In this context, BibDedupe aims to provide a Python library that improves the effectiveness and efficiency of duplicate identification, potentially benefitting review papers across scientific disciplines.

Article / DOI link Open access PDF

Additional resources

Code / source: https://github.com/CoLRev-Environment/bib-dedupe

Open access PDF

Citation (APA style)

Wagner, G. (2024). BibDedupe: An Open-Source Python Library for Bibliographic Record Deduplication. Journal of Open Source Software 9(97), 6318. https://doi.org/10.21105/JOSS.06318

Citation: BibTeX

@article{Wagner2024,
  doi        = {10.21105/JOSS.06318},
  author     = {Wagner, Gerit},
  journal    = {Journal of Open Source Software},
  title      = {BibDedupe: An Open-Source Python Library for Bibliographic Record Deduplication},
  year       = {2024},
  volume     = {9},
  number     = {97},
  pages      = {6318},
  url        = {https://joss.theoj.org/papers/10.21105/joss.06318},
  abstract   = {BibDedupe is a Python library developed for bibliographic record deduplication in meta-analysis and research synthesis. It is constructed with a focus on four requirements: (1) Zero false positives: The primary objective is to prevent incorrectly merging distinct entries. This focus on zero false positives is crucial to ensure trustworthiness and prevent biased conclusions in the analysis. (2) Reproducibility: BibDedupe implements fixed rules to produce consistent results, in line with the scientific standard of reproducibility. (3) Efficiency: The library is also tuned for low false-negative rates and rapid processing, to ensure scalability of the duplicate identification process. (4) Continuous evaluation and improvement: It is continuously evaluated on over 160,000 records from 10 datasets to ensure its effectiveness, especially in follow-up refinements. Unlike general-purpose deduplication tools, BibDedupe is specifically designed for the unique requirements of bibliographic data in meta-analysis and research synthesis. In this context, BibDedupe aims to provide a Python library that improves the effectiveness and efficiency of duplicate identification, potentially benefitting review papers across scientific disciplines.},
  fulltext   = {https://joss.theoj.org/papers/10.21105/joss.06318.pdf},
  news_announced = {2026-02-22}
}

Citation: RIS

TY  - JOUR
AU  - Wagner, Gerit
TI  - BibDedupe: An Open-Source Python Library for Bibliographic Record Deduplication
T2  - Journal of Open Source Software
PY  - 2024
VL  - 9
IS  - 97
SP  - 6318
DO  - 10.21105/JOSS.06318
UR  - https://joss.theoj.org/papers/10.21105/joss.06318
ER  -

--- title: "BibDedupe: An Open-Source Python Library for Bibliographic Record Deduplication" date: "2024" date-format: "YYYY" categories: ["literature-review"] keywords: ["literature-review"] doi: "10.21105/JOSS.06318" url: "https://joss.theoj.org/papers/10.21105/joss.06318" journal.name: "Journal of Open Source Software" outlet: "Journal of Open Source Software" author: "Wagner, Gerit" authors: - name: "Wagner, Gerit" orcid: "0000-0003-3926-7717" citation_key: "Wagner2024" free_fulltext: true self_archiving_possible_1y: false self_archiving_possible_2y: false format: html: include-after-body: ../../assets/metrics-scripts.html --- # Summary ::: { .justify } BibDedupe is a Python library developed for bibliographic record deduplication in meta-analysis and research synthesis. It is constructed with a focus on four requirements: (1) Zero false positives: The primary objective is to prevent incorrectly merging distinct entries. This focus on zero false positives is crucial to ensure trustworthiness and prevent biased conclusions in the analysis. (2) Reproducibility: BibDedupe implements fixed rules to produce consistent results, in line with the scientific standard of reproducibility. (3) Efficiency: The library is also tuned for low false-negative rates and rapid processing, to ensure scalability of the duplicate identification process. (4) Continuous evaluation and improvement: It is continuously evaluated on over 160,000 records from 10 datasets to ensure its effectiveness, especially in follow-up refinements. Unlike general-purpose deduplication tools, BibDedupe is specifically designed for the unique requirements of bibliographic data in meta-analysis and research synthesis. In this context, BibDedupe aims to provide a Python library that improves the effectiveness and efficiency of duplicate identification, potentially benefitting review papers across scientific disciplines. ::: <div class="text-center my-3"> <a class="btn btn-sm btn-outline-secondary me-2" href="https://doi.org/10.21105/JOSS.06318" target="_blank" role="button"> <i class="bi bi-box-arrow-up-right"></i> Article / DOI link </a> <a class="btn btn-sm btn-outline-primary" href="/data/papers/Wagner2024.pdf" target="_blank" role="button"> <i class="bi bi-file-earmark-pdf"></i> Open access PDF </a> </div> ## Additional resources - Code / source: <https://github.com/CoLRev-Environment/bib-dedupe> ## Open access PDF <iframe src="/data/papers/Wagner2024.pdf" width="100%" height="800px" style="border: 1px solid #ccc;"> This browser does not support PDFs. Please use the button above to download the PDF. </iframe> ```{=html} <div class="metrics-row">  <div class="metric"> <div class="altmetric-embed" data-badge-type="donut" data-badge-popover="right" data-doi="10.21105/JOSS.06318" data-hide-no-mentions="true"> </div> </div>  <div class="metric"> <span class="__dimensions_badge_embed__" data-doi="10.21105/JOSS.06318" data-style="small_circle" data-hide-zero-citations="true" data-legend="hover-right"> </span> </div>  <div class="metric"> <div class="scite-badge" data-doi="10.21105/JOSS.06318"> </div> </div> </div> ``` ## Citation (APA style) <div class="apa-citation"> <p style="text-indent:-2.5em; margin-left:2.5em;"> Wagner, G. (2024). BibDedupe: An Open-Source Python Library for Bibliographic Record Deduplication. *Journal of Open Source Software* 9(97), 6318. https://doi.org/10.21105/JOSS.06318 </p> </div> ## Citation: BibTeX ```bibtex @article{Wagner2024, doi = {10.21105/JOSS.06318}, author = {Wagner, Gerit}, journal = {Journal of Open Source Software}, title = {BibDedupe: An Open-Source Python Library for Bibliographic Record Deduplication}, year = {2024}, volume = {9}, number = {97}, pages = {6318}, url = {https://joss.theoj.org/papers/10.21105/joss.06318}, abstract = {BibDedupe is a Python library developed for bibliographic record deduplication in meta-analysis and research synthesis. It is constructed with a focus on four requirements: (1) Zero false positives: The primary objective is to prevent incorrectly merging distinct entries. This focus on zero false positives is crucial to ensure trustworthiness and prevent biased conclusions in the analysis. (2) Reproducibility: BibDedupe implements fixed rules to produce consistent results, in line with the scientific standard of reproducibility. (3) Efficiency: The library is also tuned for low false-negative rates and rapid processing, to ensure scalability of the duplicate identification process. (4) Continuous evaluation and improvement: It is continuously evaluated on over 160,000 records from 10 datasets to ensure its effectiveness, especially in follow-up refinements. Unlike general-purpose deduplication tools, BibDedupe is specifically designed for the unique requirements of bibliographic data in meta-analysis and research synthesis. In this context, BibDedupe aims to provide a Python library that improves the effectiveness and efficiency of duplicate identification, potentially benefitting review papers across scientific disciplines.}, fulltext = {https://joss.theoj.org/papers/10.21105/joss.06318.pdf}, news_announced = {2026-02-22} } ``` ## Citation: RIS ```bibtex TY - JOUR AU - Wagner, Gerit TI - BibDedupe: An Open-Source Python Library for Bibliographic Record Deduplication T2 - Journal of Open Source Software PY - 2024 VL - 9 IS - 97 SP - 6318 DO - 10.21105/JOSS.06318 UR - https://joss.theoj.org/papers/10.21105/joss.06318 ER - ```