GitPedia

Document similarity algorithms experiments

Document similarity algorithms experiment - Jaccard, TF-IDF, Doc2vec, USE, and BERT.

From massanishi·Updated June 9, 2026·View on GitHub·

Document similarity comparison using 5 popular algorithms: Jaccard, TF-IDF, Doc2vec, USE, and BERT. The project is written primarily in Python, first published in 2020. Key topics include: algorithm, bert, deep-learning, document-similarity, jaccard.

Document Similarity Algorithms Experiment

Document similarity comparison using 5 popular algorithms: Jaccard, TF-IDF, Doc2vec, USE, and BERT.

33,914 New York Times articles are used for the experiment. It aims to show which algorithm yields the best result out of the box in 2020.

Purpose

  1. By running multiple algorithms with some of which are most up-to-date and trendy in NLP community, it will show which algorithms give the best results and by how much onto the same set of data.
  2. By using full-length popular media publications as our data input, we will simulate the real world similarity/recommendation use case.
  3. By following URLs, you can actually see and compare the quality document similarity yourself.
  4. By using only the pretrained models available publicly, you can easily set it up and start implementing document similarity on your own with very little prior knowledge while expecting the similar result output.

Data Setup

33,914 New York Times articles from 2018 to June 2020 were selected. The data was mostly collected from their RSS feed.

5 articles were chosen as the basis. They represent different categories.

  1. How My Worst Date Ever Became My Best (lifestyle)
  2. A Deep-Sea Magma Monster Gets a Body Scan (science)
  3. Renault and Nissan Try a New Way After Years When Carlos Ghosn Ruled (business)
  4. Dominic Thiem Beats Rafael Nadal in Australian Open Quarterfinal (sports)
  5. 2020 Democrats Seek Voters in an Unusual Spot: Fox News (politics)

Comparison Criteria

Overlapping tags, sections, subsections, writing format, and subjective judgement are considered. For a more detailed description, please follow this blog post.

Algorithm That Win Overall

TF-IDF. It resulted in the best matches in 2.5 out of 5 comparisons.

The detailed breakdowns of how each algorithm predicted can be found in the algorithm folders.

Winner Algorithm By Each Article

How My Worst Date Ever Became My Best

Winner: BERT

TitleTag OverlapSection OverlapSubsection OverlapStyle OverlapThemeSubjective
Why Are All the Exes Texting?1YYYDatingRelated
When Love Seems Too Easy to Trust1YYYDatingRelated
He Saved His Last Lesson for Me1YYYDatingRelated

A Deep-Sea Magma Monster Gets a Body Scan

Winner: TF-IDF

TitleTag OverlapSection OverlapSubsection OverlapStyle OverlapThemeSubjective
A 3D Encounter With a Violent Volcano’s Underbelly4YNY3D Mapped VolcanoHighly Related
Pressure, and Mystery, on the Rise1YYYIceland's VolcanoRelated
It’s Not Just Hawaii: The U.S. Has 169 Volcanoes That Could Erupt2NNYVolcanosRelated

Renault and Nissan Try a New Way After Years When Carlos Ghosn Ruled

Winner: TF-IDF

TitleTag OverlapSection OverlapSubsection OverlapStyle OverlapThemeSubjective
Nissan CEO Says 'No Merit' in Merger With Renault-Nikkei3NNYNissan and RenaultRelated
Carlos Ghosn and the Roots of Nissan’s Decline3NNNCarlos GhosnRelated
Renault Chooses Volkswagen Executive as New C.E.O.5YYYRenault CEOVery Related

Dominic Thiem Beats Rafael Nadal in Australian Open Quarterfinal

Winner: Jaccard, TF-IDF, and USE

(Jaccard)

TitleTag OverlapSection OverlapSubsection OverlapStyle OverlapThemeSubjective
Djokovic vs. Federer, a Rivalry for the Ages, Is One-Sided This Time3YYYAustralian OpenRelated
Novak Djokovic Wins the Australian Open4YYYDominic vs Novak in Australian OpenVery Related
With Rome Title, Nadal Back on Track Entering French Open1NNYFrench OpenUnrelated

2020 Democrats Seek Voters in an Unusual Spot: Fox News

Winner: USE

TitleTag OverlapSection OverlapSubsection OverlapStyle OverlapThemeSubjective
Bernie Sanders Had a Problem With MSNBC. Then Came Super Tuesday.7NNYSanders and MSNBCVery Related
Democrats, Don’t Abandon Fox News7NNNDemocrats and Fox NewsVery Related
Candidates Running Against, and With, Cable News4YYYFox, MSNBC and politicsRelated

Contributors

Showing top 1 contributor by commit count.

View all contributors on GitHub →

This article is auto-generated from massanishi/document_similarity_algorithms_experiments via the GitHub API.Last fetched: 6/28/2026