Vector types (aka arrays) added with ObjectBox Java 3.6 release

Vector types (aka arrays) added with ObjectBox Java 3.6 release

Vector embeddings (multi-dimensional vectors) are a central building block for AI applications. And accordingly, the ability to store vectors to add long-term memory to your AI applications (e.g. via vector databases) is gaining importance. Sounds fancy, but for the basic use cases, this simply boils down to “arrays of floats” for developers. And this is exactly what ObjectBox database now supports natively. If you want to use vectors on the edge, e.g. in a mobile app or on an embedded device, when offline, independent from an Internet connection, removing the unknown latency, try it…

See the release notes for all new features this release brings.

Code Examples

Let’s start with a simple example: let’s assume some shapes that use a palette of RGB colors. An entity for this might look like this:

We can now create a query to find all shapes that use a certain color:

Another typical use case is the embedding of certain types of data, like text, audio or images, as vector coordinates. To store such a vector embedding, in the following example we store the floating point coordinates that were computed by a machine learning model for an image together with a reference to the actual image:

Ready to go?

To update to this release, change the version of objectbox-gradle-plugin to 3.6.0.

To add ObjectBox database to your JVM or Android project read our Getting Started guide.
As always, we look forward to your feedback on GitHub or via our anonymous feedback form and hope you have a great time building apps with ObjectBox! ❤️

Vector databases – a look at the AI database market with a comprehensive comparison matrix

Vector databases – a look at the AI database market with a comprehensive comparison matrix

Vector databases - a look at the AI database market

What are vector databases? What do you need them for? Who is in the market? Includes a comparison matrix of vector database options like Pinecone, Milvus, Vespa, Vald, Chroma, Marqo AI, Weaviate, and Qdrant

With 350M+ USD invested in AI / vector databases in the last months, one thing is clear: The vector database market is hot 🔥 Everyone, not just investors, is  interested in the booming AI market. While AI applications have dominated the news for quite some time, the infrastructure software that supports these applications, such as vector databases, is finally gaining attention too.  In the following, we’ll have a look at why vector databases are gaining attention and compare current vector database alternatives.

What is a vector database? 

Vector databases can be used to give LLMs (Large Language Models) – or more broadly speaking, AI applications – a long-term memory and faster search and querying capabilities. 

A vector database stores vectors, or more precisely vector embeddings. A vector database therefore is a specialised type of database designed to store and manage large sets of vectors efficiently. However, the challenge and value are not derived from simply being able to store vectors. The value is created by the type of computations that can be run over the stored vector data and the speed with which these computations can be run, e.g. similarity searches. 

To give some context: The most traditional databases, SQL databases, store data in rows and columns; graph databases store graphs and object databases store objects.

Because AI applications typically rely on vector embeddings, vector databases are especially apt at supporting AI applications. 

Accordingly, vector databases are becoming a critical layer in the AI tech stack; they are sometimes also called “AI databases”. However, databases tend to converge over time, meaning that many databases support several different database models.

What is a vector embedding?

A vector embedding is a list of numbers that represent objects and relationships, allowing unstructured data (such as images) to be searched and used. Typically, Machine Learning (ML) algorithms are used to create these vectors. The ML algorithms analyse large amounts of data to learn how to represent complex / unstructured data in a lower dimensional space (as vectors).

What does it have to do with nearest neighbour search?

Searchability (making unstructured data usable) is at the heart of this concept. The nearest neighbour search is therefore a key concept in vector databases. The distance between vector embeddings expresses similarity. Therefore, as you are searching for the most similar data, the so-called “nearest neighbour search” is a key concept and the time required to find the nearest neighbours is essential. 

Do we need special vector databases?

There is already a discussion going on about whether special vector databases are needed or do not warrant a new category in the database landscape. Instead, vector extensions of traditional databases could be supporting the AI market. Both are reasonable expectations, and time will tell. Notable databases that have already added a vector extension include e.g. redis and elasticsearch. Additionally, more and more databases now allow storing vector types.

How does the vector database landscape look like?

To have a look at the current market situation, we are comparing the choices with the most traction, but excluding established players that have added vector capabilities to their existing database offering. Generally speaking we see a lot of very young companies, some companies that did pivot from their original specialization, and massive fundings. Please note: the table is not optimized to be readable on mobile or small screens (there just is a trade-off between providing the information and making it readable on every device).

  Open Source? License GitHub stars or other traffic numbers Developed in (language) Summary Business Model Embeds / Uses founding / first released date In-memory support Sharding Index Types Consistency Model Benchmarks (Performance?) Approximate Nearest Neighbor (ANN)  Funding Who's behind it HQ in 
Marqo AI Y Apache-2.0 2.8k stars Python A tensor-based cloud-native commercial Open Source search and analytics engine. Open SaaS Tensor-based   Y HNSW   -   undisclosed preseed in May 2022 S2Search Australia Pty Ltd Australia
Weaviate Y BSD 5.6k stars Assembly, C++, GoLang Weaviate is a commercial Open Source cloud-native vector database that stores both objects and vectors. Open SaaS started in 2018 as a traditional graph database, first released in 2019 N Y a custom HNSW algorithm that supports CRUD Eventual Consistency not comparative, just evaluating their own performance Y (It can support multiple ANN algorithms as long as they support full CRUD) 67.7M USD, series B SeMI Technologies EU
Chroma Y Apache-2.0 4.4k stars Python & Typescript Chroma is a Commercial Open Source vector database Preparing a (Partly Open) SaaS model* [Commercial Open Source] HNSW lib, DuckDB; based on ClickHouse looks like 2022 N   HNSW   - Y 20.3M USD, seed Chroma Inc. US
Qdrant Y Apache-2.0 6.6k stars Rust Qdrant is a Commercial Open Source vector similarity search engine and vector database Open SaaS RocksDB first released: 2021 Y Y HNSW Eventual Consistency, tunable consistency compares to weaviate, milvus, elastic (note: redis took too long to complete) Y 9.8M € Qdrant Solutions GmbH EU (D)
Milvus Y Apache-2.0 18k stars GoLang & Python Milvus is a cloud-native Commercial Open Source vector database (Partly Open) SaaS* [Commercial Open Source] Initially a blog post from them said SQLite, but meanwhile they said RocksDB; was maybe exchanged?
they also have a ChatGPT-Cache that is build on SQLite
and say "Milvus uses SQLite or MySQL to manage metadata"
founded 2017, first released: 2019 N Y ANNOY; HNSW; IVF_PQ; IVF_SQ(; IVF_FLAT; FLAT; IVF_SQ8_H; RNSG Milvus supports four consistency levels: strong, bounded staleness, session, and eventually. The default consistency level in Milvus is bounded staleness.  not comparative Y 113M USD, series B Zilliz US
Vespa Y Apache-2.0 4.4k stars Java & C++ Vespa is a Commercial Open Source vector database by Yahoo! It is a search engine which supports vector search, lexical search, and search in structured data Open SaaS Vespa does not wrap its core data structures around any libraries and engines originally a web search engine (alltheweb), acquired by Yahoo! in 2003 and later developed into and open sourced as Vespa in 2017 maintains disk and memory structures for documents Y HNSW; BM25 Eventual Consistency not comparative Y Yahoo! Yahoo! US
Vald Y Apache-2.0 1.2k GoLang Vald is a cloud-native Open Source distributed approximate nearest neighbor (ANN) dense vector search engine Community project, currently looks like no commercial interests are pursued Vald implements the NGT approximate nearest neighbor search algorithm Technology incubation at Yahoo! Japan Corporation, development was stared in 2019 Y Distributed Index, asynchronous indexing NGT not comparitive Y (NGT) Yahoo Japan Corporation Yusuke Kato and Kiichiro Yukawa (Yahoo Japan Corporation) Japan
Pinecone
N Proprietary NA   Pinecone is a fully managed vector database that specializes in enabling semantic search capabilities SaaS built on top of Faiss first released in 2019 N Y proprietary Eventual Consistency more programming language comparison for vector databases Y (proprietary), plus KNN (with Faiss) 138M, series B Pinecone Systems Inc US

Want to know more about the vector database market?

Here are some more questions answered for anyone interested

What is an "Open SaaS" business model?

Software as a service (SaaS) refers to software that is managed / hosted for the client and is essentially “rented.” The open in Open SaaS refers to the open source software that is being offered as such a service.

This frequently implies that not all code is open source, particularly that which is part of the managed service / hosting and associated value-adding features. Note: The open source software offered in this manner may or may not be provided by the company providing the software as a service. This has caused some friction in the open source community, as original creators often struggle to make a living, and/or maintainers struggle to keep maintaining the software – while other companies profit. Most famously, huge cloud providers have taken advantage of this option, leading to new licenses that keep the source open but restrict others from hosting as a service without donating the whole source code back to the community.

Why should I care about index types?

Indexes are essentially a way to speed up searching a database. There are several established index types for vector databases and they affect the performance of the database, e.g. the time it takes a query to complete.

What about benchmarks?

You will see, if you review the benchmarks given at the top, that results typically vary. Benchmarks are difficult to do and neutral benchmarks even more so. Certain use cases may favor certain solutions. Therefore, ideally you benchmark based on your specific use case…. but as a first evaluation, try to understand the basic influencing factors and have a look at a handful of benchmarks and explanations. Having said all this: There is a benchmarking tool available for approximate nearest neighbor (ANN) algorithms search. If you use this, you can compare the performance of different databases (with regards to the ANN search)  for the same setup, based on the same approach. Also: The underlying libs often used by databases (like NGT and HNSW, see above) have already been benchmarked with it and you can compare to these directly.

Why is the market so hot, how can companies raise so much money?

AI is hot, everyone agrees that data and its management will be key to future success, and the database market is interesting: It is a long established market with many players, yet still demonstrating continually good growth (e.g. 17% in 2020). And the database market history shows that from time to time a new type of database comes up, and with it, the creation of a new market category. In such a market, typically the market creator “takes all” (not quite literally, but such a significant share, definetely the vast majority, that all other players are not attractive from a VC-perspective). Such a market could easily be worth 100M+ in ARR. Examples from the last 20 years: MongoDB (NoSQL databases), Cockroach (NewSQL databases), Neo4J (Graph databases), Influx (Time-Series databases). So, VCs are looking to find the next new type of database that can create a market… Maybe it will be vector databases? However, the database market has also shown to take 10 years+ for players to become profitable, so expect a longterm game. The race is still on for Edge Databases we think 🙂

Want to know more about the database market?

We recommend checking out db-engines. The website compares all relevant systems and has tons of data from the last 20 years. Note: They do only add databases once they have some traction and notability, not any hobby project. Accordingly not all databases of the above comparison have been added to the website yet.