Vector databases - a look at the AI database market
What are vector databases? What do you need them for? Who is in the market? Includes a comparison matrix of vector database options like Pinecone, Milvus, Vespa, Vald, Chroma, Marqo AI, Weaviate, and QdrantWith 350M+ USD invested in AI / vector databases in the last months, one thing is clear: The vector database market is hot 🔥 Everyone, not just investors, is interested in the booming AI market. While AI applications have dominated the news for quite some time, the infrastructure software that supports these applications, such as vector databases, is finally gaining attention too. In the following, we’ll have a look at why vector databases are gaining attention and compare current vector database alternatives.
What is a vector database?
Vector databases can be used to give LLMs (Large Language Models) – or more broadly speaking, AI applications – a long-term memory and faster search and querying capabilities.
A vector database stores vectors, or more precisely vector embeddings. A vector database therefore is a specialised type of database designed to store and manage large sets of vectors efficiently. However, the challenge and value are not derived from simply being able to store vectors. The value is created by the type of computations that can be run over the stored vector data and the speed with which these computations can be run, e.g. similarity searches.
To give some context: The most traditional databases, SQL databases, store data in rows and columns; graph databases store graphs and object databases store objects.
Because AI applications typically rely on vector embeddings, vector databases are especially apt at supporting AI applications.
Accordingly, vector databases are becoming a critical layer in the AI tech stack; they are sometimes also called “AI databases”. However, databases tend to converge over time, meaning that many databases support several different database models.
What is a vector embedding?
A vector embedding is a list of numbers that represent objects and relationships, allowing unstructured data (such as images) to be searched and used. Typically, Machine Learning (ML) algorithms are used to create these vectors. The ML algorithms analyse large amounts of data to learn how to represent complex / unstructured data in a lower dimensional space (as vectors).
What does it have to do with nearest neighbour search?
Searchability (making unstructured data usable) is at the heart of this concept. The nearest neighbour search is therefore a key concept in vector databases. The distance between vector embeddings expresses similarity. Therefore, as you are searching for the most similar data, the so-called “nearest neighbour search” is a key concept and the time required to find the nearest neighbours is essential.
Do we need special vector databases?
There is already a discussion going on about whether special vector databases are needed or do not warrant a new category in the database landscape. Instead, vector extensions of traditional databases could be supporting the AI market. Both are reasonable expectations, and time will tell. Notable databases that have already added a vector extension include e.g. redis and elasticsearch. Additionally, more and more databases now allow storing vector types.
How does the vector database landscape look like?
To have a look at the current market situation, we are comparing the choices with the most traction, but excluding established players that have added vector capabilities to their existing database offering. Generally speaking we see a lot of very young companies, some companies that did pivot from their original specialization, and massive fundings. Please note: the table is not optimized to be readable on mobile or small screens (there just is a trade-off between providing the information and making it readable on every device).
Open Source? | License | GitHub stars or other traffic numbers | Developed in (language) | Summary | Business Model | Embeds / Uses | founding / first released date | In-memory support | Sharding | Index Types | Consistency Model | Benchmarks (Performance?) | Approximate Nearest Neighbor (ANN) | Funding | Who's behind it | HQ in | |
Marqo AI | Y | Apache-2.0 | 2.8k stars | Python | A tensor-based cloud-native commercial Open Source search and analytics engine. | Open SaaS | Tensor-based | ❔ | Y | HNSW | - | undisclosed preseed in May 2022 | S2Search Australia Pty Ltd | Australia | |||
Weaviate | Y | BSD | 5.6k stars | Assembly, C++, GoLang | Weaviate is a commercial Open Source cloud-native vector database that stores both objects and vectors. | Open SaaS | ❔ | started in 2018 as a traditional graph database, first released in 2019 | N | Y | a custom HNSW algorithm that supports CRUD | Eventual Consistency | not comparative, just evaluating their own performance | Y (It can support multiple ANN algorithms as long as they support full CRUD) | 67.7M USD, series B | SeMI Technologies | EU |
Chroma | Y | Apache-2.0 | 4.4k stars | Python & Typescript | Chroma is a Commercial Open Source vector database | Preparing a (Partly Open) SaaS model* [Commercial Open Source] | HNSW lib, DuckDB; based on ClickHouse | looks like 2022 | N | HNSW | - | Y | 20.3M USD, seed | Chroma Inc. | US | ||
Qdrant | Y | Apache-2.0 | 6.6k stars | Rust | Qdrant is a Commercial Open Source vector similarity search engine and vector database | Open SaaS | RocksDB | first released: 2021 | Y | Y | HNSW | Eventual Consistency, tunable consistency | compares to weaviate, milvus, elastic (note: redis took too long to complete) | Y | 9.8M € | Qdrant Solutions GmbH | EU (D) |
Milvus | Y | Apache-2.0 | 18k stars | GoLang & Python | Milvus is a cloud-native Commercial Open Source vector database | (Partly Open) SaaS* [Commercial Open Source] | Initially a blog post from them said SQLite, but meanwhile they said RocksDB; was maybe exchanged? they also have a ChatGPT-Cache that is build on SQLite and say "Milvus uses SQLite or MySQL to manage metadata" | founded 2017, first released: 2019 | N | Y | ANNOY; HNSW; IVF_PQ; IVF_SQ(; IVF_FLAT; FLAT; IVF_SQ8_H; RNSG | Milvus supports four consistency levels: strong, bounded staleness, session, and eventually. The default consistency level in Milvus is bounded staleness. | not comparative | Y | 113M USD, series B | Zilliz | US |
Vespa | Y | Apache-2.0 | 4.4k stars | Java & C++ | Vespa is a Commercial Open Source vector database by Yahoo! It is a search engine which supports vector search, lexical search, and search in structured data | Open SaaS | Vespa does not wrap its core data structures around any libraries and engines | originally a web search engine (alltheweb), acquired by Yahoo! in 2003 and later developed into and open sourced as Vespa in 2017 | maintains disk and memory structures for documents | Y | HNSW; BM25 | Eventual Consistency | not comparative | Y | Yahoo! | Yahoo! | US |
Vald | Y | Apache-2.0 | 1.2k | GoLang | Vald is a cloud-native Open Source distributed approximate nearest neighbor (ANN) dense vector search engine | Community project, currently looks like no commercial interests are pursued | Vald implements the NGT approximate nearest neighbor search algorithm | Technology incubation at Yahoo! Japan Corporation, development was stared in 2019 | ❔ | Y | Distributed Index, asynchronous indexing | NGT | not comparitive | Y (NGT) | Yahoo Japan Corporation | Yusuke Kato and Kiichiro Yukawa (Yahoo Japan Corporation) | Japan |
Pinecone | N | Proprietary | NA | Pinecone is a fully managed vector database that specializes in enabling semantic search capabilities | SaaS | built on top of Faiss | first released in 2019 | N | Y | proprietary | Eventual Consistency | more programming language comparison for vector databases | Y (proprietary), plus KNN (with Faiss) | 138M, series B | Pinecone Systems Inc | US |
Want to know more about the vector database market?
Here are some more questions answered for anyone interested
What is an "Open SaaS" business model?
Software as a service (SaaS) refers to software that is managed / hosted for the client and is essentially “rented.” The open in Open SaaS refers to the open source software that is being offered as such a service.
This frequently implies that not all code is open source, particularly that which is part of the managed service / hosting and associated value-adding features. Note: The open source software offered in this manner may or may not be provided by the company providing the software as a service. This has caused some friction in the open source community, as original creators often struggle to make a living, and/or maintainers struggle to keep maintaining the software – while other companies profit. Most famously, huge cloud providers have taken advantage of this option, leading to new licenses that keep the source open but restrict others from hosting as a service without donating the whole source code back to the community.
Why should I care about index types?
Indexes are essentially a way to speed up searching a database. There are several established index types for vector databases and they affect the performance of the database, e.g. the time it takes a query to complete.
What about benchmarks?
You will see, if you review the benchmarks given at the top, that results typically vary. Benchmarks are difficult to do and neutral benchmarks even more so. Certain use cases may favor certain solutions. Therefore, ideally you benchmark based on your specific use case…. but as a first evaluation, try to understand the basic influencing factors and have a look at a handful of benchmarks and explanations. Having said all this: There is a benchmarking tool available for approximate nearest neighbor (ANN) algorithms search. If you use this, you can compare the performance of different databases (with regards to the ANN search) for the same setup, based on the same approach. Also: The underlying libs often used by databases (like NGT and HNSW, see above) have already been benchmarked with it and you can compare to these directly.
Why is the market so hot, how can companies raise so much money?
AI is hot, everyone agrees that data and its management will be key to future success, and the database market is interesting: It is a long established market with many players, yet still demonstrating continually good growth (e.g. 17% in 2020). And the database market history shows that from time to time a new type of database comes up, and with it, the creation of a new market category. In such a market, typically the market creator “takes all” (not quite literally, but such a significant share, definetely the vast majority, that all other players are not attractive from a VC-perspective). Such a market could easily be worth 100M+ in ARR. Examples from the last 20 years: MongoDB (NoSQL databases), Cockroach (NewSQL databases), Neo4J (Graph databases), Influx (Time-Series databases). So, VCs are looking to find the next new type of database that can create a market… Maybe it will be vector databases? However, the database market has also shown to take 10 years+ for players to become profitable, so expect a longterm game. The race is still on for Edge Databases we think 🙂
Want to know more about the database market?
We recommend checking out db-engines. The website compares all relevant systems and has tons of data from the last 20 years. Note: They do only add databases once they have some traction and notability, not any hobby project. Accordingly not all databases of the above comparison have been added to the website yet.