Mini TF-IDF search engine over documentation
Indexes the Markdown files in a docs/ folder, computes a homemade TF-IDF score (no dependencies) and ranks the most relevant documents for a free-text query.
Prerequisites
Python 3.9+ (bibliothèque standard)
Python
import math
import re
from collections import Counter
from pathlib import Path
docs = {p.name: re.findall(r"\w{3,}", p.read_text(encoding="utf-8").lower())
for p in Path("docs").glob("*.md")}
N = len(docs)
df = Counter(mot for mots in docs.values() for mot in set(mots))
def score(requete, mots):
tf = Counter(mots)
return sum(tf[t] / len(mots) * math.log(N / df[t])
for t in requete.lower().split() if df.get(t))
requete = "configuration proxy timeout"
classement = sorted(((score(requete, mots), nom)
for nom, mots in docs.items()), reverse=True)
print(f"Recherche « {requete} » sur {N} documents")
print(f"{'score':>8} document")
for s, nom in classement[:4]:
print(f"{s:>8.4f} {nom}")Result
Recherche « configuration proxy timeout » sur 187 documents score document 0.0412 reseau-proxy-entreprise.md 0.0287 troubleshooting-api.md 0.0151 deploiement-prod.md 0.0093 faq-integration.md
TF-IDFRechercheNLPMarkdown