Vector databases - a look at the AI database market

⭐ What are vector databases? ⭐ What do you need them for? ⭐ Who is in the market?

Includes a comparison matrix of vector database options like Pinecone, Milvus, Vespa, Vald, Chroma, Marqo AI, Weaviate, and Qdrant

With 350M+ USD invested in AI / vector databases in the last months, one thing is clear: The vector database market is hot πŸ”₯ Everyone, not just investors, is  interested in the booming AI market. While AI applications have dominated the news for quite some time, the infrastructure software that supports these applications, such as vector databases, is finally gaining attention too.  In the following, we’ll have a look at why vector databases are gaining attention and compare current vector database alternatives.

What is a vector database? 

A vector database stores vectors, or more precisely vector embeddings. A vector database therefore is a specialised type of database designed to store and manage large sets of vectors efficiently. However, the challenge and value are not derived from simply being able to store vectors. The value is created by the type of computations that can be run over the stored vector data and the speed with which these computations can be run, e.g. similarity searches. 

Vector databases are essentially an important piece of the AI tech stack. They can be used e.g. to give LLMs (Large Language Models) – or more broadly speaking, AI applications – a long-term memory and faster search and querying capabilities. Another important use case is RAG (Retrieval-Augmented Generation).

To give some context: The most traditional databases, SQL databases, store data in rows and columns; graph databases store graphs and object databases store objects.

Because Large Language Models and AI applications rely on vector embeddings, vector databases are especially apt at supporting AI applications. 

Accordingly, vector databases are becoming a critical layer in the AI tech stack; they are sometimes also called β€œAI databases”. However, databases tend to converge over time, meaning that many databases support several different database models.

What is a vector embedding?

A vector embedding is a list of numbers that represent objects and relationships, allowing unstructured data (such as images) to be searched and used. Typically, Large Language Models (more precisely the underlying Machine Learning (ML) algorithms) are used to create these vectors. The ML algorithms analyse large amounts of data to learn how to represent complex / unstructured data in a lower dimensional space (as vectors).

What does it have to do with nearest neighbour search?

Searchability (making unstructured data usable) is at the heart of this concept. The nearest neighbour search is therefore a key concept in vector databases. The distance between vector embeddings expresses the similarity of the vectors (and thus the represented objects). Therefore, as you are searching for the most similar data, the so-called β€œnearest neighbour search” is a key concept and the time required to find the nearest neighbours is essential. 

Do we need special vector databases?

There is already a discussion going on about whether special vector databases are needed or do not warrant a new category in the database landscape. Instead, vector extensions of traditional databases could be supporting the AI market. Both are reasonable expectations, and time will tell. Notable databases that have already added a vector extension include e.g. redis and elasticsearch. Additionally, more and more databases now allow storing vector types.

How does the vector database landscape look like?

To have a look at the current market situation, we are comparing the choices with the most traction, but excluding established players that have added vector capabilities to their existing database offering. Generally speaking we see a lot of very young companies, some companies that did pivot from their original specialization, and massive fundings. Please note: the table is not optimized to be readable on mobile or small screens (there just is a trade-off between providing the information and making it readable on every device).

If you’re on mobile, use this link to view a version that is readable on mobile.

  Open Source License GitHub stars  Developed in (language) Summary Business Model Embeds / Uses founding date / first released date In-memory UnterstΓΌtzung Sharding Index Types Consistency Model Benchmarks (Performance?) Queries per second (using text nytimes-256-angular) Latency, ms (Recall/Percentile 95 (millis), nytimes-256-angular) Approximate Nearest Neighbor (ANN) Vector Databases Funding Who's behind it HQ in 
Marqo AI Y Apache-2.0 2.8k ⭐ Python A tensor-based cloud-native commercial Open Source search and analytics engine. Open SaaS Tensor-based ❔   Y HNSW   - ❔ ❔   undisclosed preseed in May 2022 S2Search Australia Pty Ltd πŸ‡¦πŸ‡Ί
Weaviate Y BSD 5.6k ⭐ Assembly, C++, GoLang Weaviate is a commercial Open Source cloud-native vector database that stores both objects and vectors. Open SaaS ❔ started in 2018 as a traditional graph database, first released in 2019 N Y, static sharding a custom HNSW PQ algorithm that supports CRUD Eventual Consistency not comparative, just evaluating their own performance  791 2 Y (multiple ANN algorithms as long as they support full CRUD) 67.7M USD, series B SeMI Technologies πŸ‡ͺπŸ‡Ί
Chroma Y Apache-2.0 4.4k ⭐ Python & Typescript Chroma is a Commercial Open Source vector database Preparing a (Partly Open) SaaS model* [Commercial Open Source] HNSW lib, DuckDB; based on ClickHouse looks like 2022 N Dynamic segment placement       ❔ ❔ Y 20.3M USD, seed Chroma Inc. πŸ‡ΊπŸ‡Έ
Qdrant Y Apache-2.0 6.6k ⭐ Rust Qdrant is a Commercial Open Source vector similarity search engine and vector database Open SaaS RocksDB first released: 2021 Y Y, static sharding HNSW (SQ & PQ) Eventual Consistency, tunable consistency compares to weaviate, milvus, elastic (note: redis took too long to complete) 326 4 Y 9.8M € Qdrant Solutions GmbH πŸ‡ͺπŸ‡Ί
Milvus Y Apache-2.0 18k ⭐ GoLang & Python Milvus is a cloud-native Commercial Open Source vector database (Partly Open) SaaS* [Commercial Open Source] Initial blog post from them said SQLite, but meanwhile they said RocksDB - exchanged?
they also have a ChatGPT-Cache that is build on SQLite
and say "Milvus uses SQLite or MySQL to manage metadata"
founded 2017, first released: 2019 N Dynamic segment placement ANNOY; HNSW; IVF_PQ; IVF_SQ(; IVF_FLAT; FLAT; IVF_SQ8_H; RNSG Strong, bounded staleness, session, and eventually. The default consistency level in Milvus is bounded staleness.  not comparative 2406 1 Y 113M USD, series B Zilliz πŸ‡ΊπŸ‡Έ
Vespa Y Apache-2.0 4.4k ⭐ Java & C++ Vespa is a Commercial Open Source vector database by Yahoo! It is a search engine which supports vector search, lexical search, and search in structured data Open SaaS ❔ Originally a web search engine (alltheweb), acquired by Yahoo! in 2003 and later open sourced as Vespa in 2017; sinde Oct 2023 spinoff, raised series A in Nov 2023 maintains disk and memory structures for documents Y Custom HNSW (Multi-vector hybrid HNSW-IF) Eventual Consistency not comparative  ❔ ❔ Y Spinoff from Yahoo! in Oct 2023, then raised a 31M USD series A Yahoo! πŸ‡ΊπŸ‡Έ
Vald Y Apache-2.0 1.2k ⭐ GoLang Vald is a cloud-native Open Source distributed approximate nearest neighbor (ANN) dense vector search engine Community project, currently looks like no commercial interests are pursued uses the vector search engine NGT Technology incubation at Yahoo! Japan Corporation, development was stared in 2019 ❔ N/A N/A N/A not comparitive, but Vald performance only ❔ ❔ Y (NGT) - Yusuke Kato (Yahoo Japan Corporation), Kiichiro Yukawa (Yahoo Japan Corporation) πŸ‡―πŸ‡΅
Pinecone
N Proprietary NA   Pinecone is a fully managed vector database that specializes in enabling semantic search capabilities SaaS built on top of Faiss first released in 2019 N Y proprietary Eventual Consistency more programming language comparison for vector databases 150 (for p2, but more pods can be added) 1 (batched search, 0.99 recall, 200k SBERT) Y (proprietary), plus KNN (with Faiss) 138M, series B Pinecone Systems Inc πŸ‡ΊπŸ‡Έ

Want to know more about the vector database market?

Here are some more questions answered for anyone interested

What is an "Open SaaS" business model?

Software as a service (SaaS) refers to software that is managed / hosted for the client and is essentially “rented.” The open in Open SaaS refers to the open source software that is being offered as such a service.

This frequently implies that not all code is open source, particularly that which is part of the managed service / hosting and associated value-adding features. Note: The open source software offered in this manner may or may not be provided by the company providing the software as a service. This has caused some friction in the open source community, as original creators often struggle to make a living, and/or maintainers struggle to keep maintaining the software – while other companies profit. Most famously, huge cloud providers have taken advantage of this option, leading to new licenses that keep the source open but restrict others from hosting as a service without donating the whole source code back to the community.

Why should I care about index types?

Indexes are essentially a way to speed up searching a database. There are several established index types for vector databases and they affect the performance of the database, e.g. the time it takes a query to complete.

What about benchmarks?

You will see, if you review the benchmarks given at the top, that results typically vary. Benchmarks are difficult to do and neutral benchmarks even more so. Certain use cases may favor certain solutions. Therefore, ideally you benchmark based on your specific use case…. but as a first evaluation, try to understand the basic influencing factors and have a look at a handful of benchmarks and explanations. Having said all this: There is a benchmarking tool available for approximate nearest neighbor (ANN) algorithms search. If you use this, you can compare the performance of different databases (with regards to the ANN search)  for the same setup, based on the same approach. Also: The underlying libs often used by databases (like NGT and HNSW, see above) have already been benchmarked with it and you can compare to these directly.

Why is the market so hot, how can companies raise so much money?

AI is hot, everyone agrees that data and its management will be key to future success, and the database market is interesting: It is a long established market with many players, yet still demonstrating continually good growth (e.g. 17% in 2020). And the database market history shows that from time to time a new type of database comes up, and with it, the creation of a new market category. In such a market, typically the market creator “takes all” (not quite literally, but such a significant share, definetely the vast majority, that all other players are not attractive from a VC-perspective). Such a market could easily be worth 100M+ in ARR. Examples from the last 20 years: MongoDB (NoSQL databases), Cockroach (NewSQL databases), Neo4J (Graph databases), Influx (Time-Series databases). So, VCs are looking to find the next new type of database that can create a market… Maybe it will be vector databases? However, the database market has also shown to take 10 years+ for players to become profitable, so expect a longterm game. The race is still on for Edge Databases we think πŸ™‚

Want to know more about the database market?

We recommend checking out db-engines. The website compares all relevant systems and has tons of data from the last 20 years. Note: They do only add databases once they have some traction and notability, not any hobby project. Accordingly not all databases of the above comparison have been added to the website yet.