pgvector

Journaux liées à cette note :

En étudiant l'article Wikipedia "Base de données vectorielle", je découvre la liste de différents algorithmes Approximate Nearest Neighbor.

#JaiDécouvert feature extraction algorithms.

These feature vectors may be computed from the raw data using machine learning methods such as feature extraction algorithms, word embeddings or deep learning networks. The goal is that semantically similar data items receive feature vectors close to each other.

source

J'apprends :

In recent benchmarks, HNSW-based implementations have been among the best performers.

source

Je lis :

Databases that use HNSW as search index include:

Apache Lucene Vector Search

Chroma

Qdrant

pgvector

Clickhouse

…

source

En interrogeant Claude Sonnet 4, j'apprends :

Benchmark indicatif (1M vecteurs 768D) :

Métrique Qdrant pgvector Elasticsearch

Temps indexation 15 min 45 min 25 min

Requête/sec 2000+ 500-800 800-1200

RAM utilisée 4 GB 6 GB 8 GB+

Précision @10 0.95 0.92 0.94

Date création 2021 2021 2022 (support HNSW)

Langage Rust C Java

Open Source Open Source Open Source

Claude Sonnet 4

Métrique	Qdrant	pgvector	Elasticsearch
Temps indexation	15 min	45 min	25 min
Requête/sec	2000+	500-800	800-1200
RAM utilisée	4 GB	6 GB	8 GB+
Précision @10	0.95	0.92	0.94
Date création	2021	2021	2022 (support HNSW)
Langage	Rust	C	Java
	Open Source	Open Source	Open Source

Deux amis m'ont partagé un thread Hacker News : Postgres.new: In-browser Postgres with an AI interface.

Je viens de prendre le temps de tester postgres.new.

Voici une vidéo officielle : https://www.youtube.com/watch?v=ooWaPVvljlU

#Jadore ! Je trouve l'UX très bonne, j'aime l'onglet "Migrations", les explications données dans la colonne de droite.

Le projet est 100% Open source, voici le dépôt GitHub : https://github.com/supabase-community/postgres-new

Très beau travail !

Je me demande combien de temps ce projet a été implémenté 🤔.

1 mois et demi d'après la page contributors.
Mais je constate que le premier commit est plutôt conséquent, je pense que le projet était initialement intégré dans un mono repository.

Concernant l'implémentation, je lis :

All queries in postgres.new run directly in your browser. There’s no remote Postgres container or WebSocket proxy.

👍️

How is this possible? PGlite, a WASM version of PostgreSQL that can run directly in your browser. Every database that you create spins up a new instance of PGlite that exposes a fully-functional Postgres database. Data is stored in IndexedDB so that changes persist after refresh.

La partie LLM n'est pas mentionnée, #JeMeDemande comment elle est implémentée 🤔.

Je pense avoir trouvé ma réponse ici :

We pair PGlite with an LLM (currently GPT-4o) and give it full reign over the database with unrestricted permissions. (from)

Je lis :

RAG / pgvector: PGLite supports pgvector, so you can ask the LLM to create embeddings for RAG. The site uses transformers.js to create embeddings inside the browser.

Je n'ai pas tout compris 🤔.

#JaiDécouvert transformers.js.

J'ai lu ce commentaire :

It is a neat tech demo but it clearly shows the limits of AI:

I got it to generate invalid SQL resulting in errors - it merely generates reasonable SQL, but in my case it generated to disjoint set of tables…. - In practice you have tot review all code - It can point you into the wrong direction. Novel systems often have something smart/abstract in there. This system creates mostly Straightforward simple systems. That’s not where the value is

All in all, it’s not worth it to me. Writing code myself is easier than having to review LLM code

Within our organization we have forbidden full LLM merge request because more often than not the code was suboptimal. And had sneaky bugs/mistakes.

I’m not saying these can’t be overcome. But not with current LLM design. They mostly generate stuff they have seen and are bad as truly new stuff.

Personnellement, cela ne me surprend pas et cela ne remet pas en question, à mes yeux, l'intérêt de cet outil.

Je pense l'utiliser pour concevoir une ébauche de base de données.
Je pense qu'il pourra me fournir de bonnes suggestions pour les noms de tables et de champs, et même inclure des champs auxquels je n'aurais peut-être pas pensé.