Vector embeddings (multi-dimensional vectors) are a central building block for AI applications. And accordingly, the ability to store vectors to add long-term memory to your AI applications (e.g. via vector databases) is gaining importance. Sounds fancy, but for the basic use cases, this simply boils down to “arrays of floats” for developers. And this is exactly what ObjectBox database now supports natively. If you want to use vectors on the edge, e.g. in a mobile app or on an embedded device, when offline, independent from an Internet connection, removing the unknown latency, try it…
Another typical use case is the embedding of certain types of data, like text, audio or images, as vector coordinates. To store such a vector embedding, in the following example we store the floating point coordinates that were computed by a machine learning model for an image together with a reference to the actual image:
// Link to the actual image, e.g. on Cloud storage
// The coordinates computed for this image (vector embedding)
Ready to go?
To update to this release, change the version of objectbox-gradle-plugin to 3.6.0.
Vector databases - a look at the AI database market
What are vector databases? What do you need them for? Who is in the market? Includes a comparison matrix of vector database options like Pinecone, Milvus, Vespa, Vald, Chroma, Marqo AI, Weaviate, and Qdrant
With 350M+ USD invested in AI / vector databases in the last months, one thing is clear: The vector database market is hot 🔥 Everyone, not just investors, is interested in the booming AI market. While AI applications have dominated the news for quite some time, the infrastructure software that supports these applications, such as vector databases, is finally gaining attention too. In the following, we’ll have a look at why vector databases are gaining attention and compare current vector database alternatives.
What is a vector database?
Vector databases can be used to give LLMs (Large Language Models) – or more broadly speaking, AI applications – a long-term memory and faster search and querying capabilities.
A vector database stores vectors, or more precisely vector embeddings. A vector database therefore is a specialised type of database designed to store and manage large sets of vectors efficiently. However, the challenge and value are not derived from simply being able to store vectors. The value is created by the type of computations that can be run over the stored vector data and the speed with which these computations can be run, e.g. similarity searches.
To give some context: The most traditional databases, SQL databases, store data in rows and columns; graph databases store graphs and object databases store objects.
Because AI applications typically rely on vector embeddings, vector databases are especially apt at supporting AI applications.
Accordingly, vector databases are becoming a critical layer in the AI tech stack; they are sometimes also called “AI databases”. However, databases tend to converge over time, meaning that many databases support several different database models.
What is a vector embedding?
A vector embedding is a list of numbers that represent objects and relationships, allowing unstructured data (such as images) to be searched and used. Typically, Machine Learning (ML) algorithms are used to create these vectors. The ML algorithms analyse large amounts of data to learn how to represent complex / unstructured data in a lower dimensional space (as vectors).
What does it have to do with nearest neighbour search?
Searchability (making unstructured data usable) is at the heart of this concept. The nearest neighbour search is therefore a key concept in vector databases. The distance between vector embeddings expresses similarity. Therefore, as you are searching for the most similar data, the so-called “nearest neighbour search” is a key concept and the time required to find the nearest neighbours is essential.
Do we need special vector databases?
There is already a discussion going on about whether special vector databases are needed or do not warrant a new category in the database landscape. Instead, vector extensions of traditional databases could be supporting the AI market. Both are reasonable expectations, and time will tell. Notable databases that have already added a vector extension include e.g. redis and elasticsearch. Additionally, more and more databases now allow storing vector types.
How does the vector database landscape look like?
To have a look at the current market situation, we are comparing the choices with the most traction, but excluding established players that have added vector capabilities to their existing database offering. Generally speaking we see a lot of very young companies, some companies that did pivot from their original specialization, and massive fundings. Please note: the table is not optimized to be readable on mobile or small screens (there just is a trade-off between providing the information and making it readable on every device).
Milvus is a cloud-native Commercial Open Source vector database
(Partly Open) SaaS* [Commercial Open Source]
Initially a blog post from them said SQLite, but meanwhile they said RocksDB; was maybe exchanged? they also have a ChatGPT-Cache that is build on SQLite and say "Milvus uses SQLite or MySQL to manage metadata"
Want to know more about the vector database market?
Here are some more questions answered for anyone interested
What is an "Open SaaS" business model?
Software as a service (SaaS) refers to software that is managed / hosted for the client and is essentially “rented.” The open in Open SaaS refers to the open source software that is being offered as such a service.
This frequently implies that not all code is open source, particularly that which is part of the managed service / hosting and associated value-adding features. Note: The open source software offered in this manner may or may not be provided by the company providing the software as a service. This has caused some friction in the open source community, as original creators often struggle to make a living, and/or maintainers struggle to keep maintaining the software – while other companies profit. Most famously, huge cloud providers have taken advantage of this option, leading to new licenses that keep the source open but restrict others from hosting as a service without donating the whole source code back to the community.
Why should I care about index types?
Indexes are essentially a way to speed up searching a database. There are several established index types for vector databases and they affect the performance of the database, e.g. the time it takes a query to complete.
What about benchmarks?
You will see, if you review the benchmarks given at the top, that results typically vary. Benchmarks are difficult to do and neutral benchmarks even more so. Certain use cases may favor certain solutions. Therefore, ideally you benchmark based on your specific use case…. but as a first evaluation, try to understand the basic influencing factors and have a look at a handful of benchmarks and explanations. Having said all this: There is a benchmarking tool available for approximate nearest neighbor (ANN) algorithms search. If you use this, you can compare the performance of different databases (with regards to the ANN search) for the same setup, based on the same approach. Also: The underlying libs often used by databases (like NGT and HNSW, see above) have already been benchmarked with it and you can compare to these directly.
Why is the market so hot, how can companies raise so much money?
AI is hot, everyone agrees that data and its management will be key to future success, and the database market is interesting: It is a long established market with many players, yet still demonstrating continually good growth (e.g. 17% in 2020). And the database market history shows that from time to time a new type of database comes up, and with it, the creation of a new market category. In such a market, typically the market creator “takes all” (not quite literally, but such a significant share, definetely the vast majority, that all other players are not attractive from a VC-perspective). Such a market could easily be worth 100M+ in ARR. Examples from the last 20 years: MongoDB (NoSQL databases), Cockroach (NewSQL databases), Neo4J (Graph databases), Influx (Time-Series databases). So, VCs are looking to find the next new type of database that can create a market… Maybe it will be vector databases? However, the database market has also shown to take 10 years+ for players to become profitable, so expect a longterm game. The race is still on for Edge Databases we think 🙂
Want to know more about the database market?
We recommend checking out db-engines. The website compares all relevant systems and has tons of data from the last 20 years. Note: They do only add databases once they have some traction and notability, not any hobby project. Accordingly not all databases of the above comparison have been added to the website yet.