Top Small Language Models (SLMs) and how local vector databases add power

Top Small Language Models (SLMs) and how local vector databases add power

Can Small Language Models (SLMs) really do more with less? In this article, we discuss the unique strengths of SLMs, the top SLMs, their integration with local vector databases, and how SLMs + local vector databases are shaping the future of AI, prioritizing privacy, immediacy, and sustainability.

The Evolution of Language Models

In the world of artificial intelligence (AI), bigger models were once seen as better. Large Language Models (LLMs) amazed everyone with their ability to write, translate, and analyze complex texts. But they come with big problems too: high costs, slow processing, and huge energy demands. For example, OpenAI’s latest GPT-o3 model can cost up to $6,000 per task. The annual energy consumption of GPT-3.5 is equivalent to powering more than 4000 US households for a year. That’s a huge price to pay, both financially and environmentally.

Now, Small Language Models (SLMs) are stepping into the spotlight, enabling sophisticated AI to run directly on devices (local AI) like your phone, laptop, or even a smart home assistant. These models not only reduce costs and energy consumption but also bring the power of AI closer to the user, ensuring privacy and real-time performance.

What Are Small Language Models (SLMs)?

LLMs are designed to understand and generate human language. Small Language Models (SLMs) are compact versions of LLMs. So, the key difference between SLMs and LLMs is their size. While LLMs like GPT-4 are designed with hundreds of billions of parameters, SLMs use only a fraction of that. There is no strict definition of SLM vs. LLM yet. At this moment, SLM sizes can be as small as single-digit million parameters and go up to several billion parameters. Some authors suggest 8B parameters as the limit for SLMs. However, in our view that opens up the question if we need a definition for Tiny Language Models (TLMs)?

Advantages and disadvantages of SLM

Small Language Models (SLMs) bring a range of benefits, particularly for local AI applications, but they also come with trade-offs.

Benefits of SLMs

  • Privacy: By running on-device, SLMs keep sensitive information local, eliminating the need to send data to the cloud.
  • Offline Capabilities: Local AI powered by SLMs functions seamlessly without internet connectivity.
  • Speed: SLMs require less computational power, enabling faster inference and smoother performance.
  • Sustainability: With lower energy demands for both training and operation, SLMs are more environmentally friendly.
  • Accessibility: Affordable training and deployment, combined with minimal hardware requirements, make SLMs accessible to users and businesses of all sizes.

Limitations of SLMs

The main disadvantage is the flexibility and quality of SLM responses: SLMs typically cannot tackle the same broad range of tasks as LLMs in the same quality. However, in certain areas, they are already matching their larger counterparts. For example, Artificial Analysis AI Review 2024 highlights that GPT-4o-mini (July 2024) has a similar Quality Index to GPT-4 (March 2023), while being 100x cheaper in price.

Small Language Models vs LLMs
Small Language Models vs LLMs

Overcoming limitations of SLMs

A combination of SLMs with local vector databases is a game-changer. With a local vector database, the variety of tasks and the quality of answers cannot only be enhanced but also for the areas that are actually relevant to the use case you are solving. E.g. you can add internal company knowledge, specific product manuals, or personal files to the SLM. In short, you can provide the SLM with context and additional knowledge that has not been part of its training via a local vector database. In this combination, an SLM can already today be as powerful as an LLM for your specific case and context (your tasks, your apps, your business). We’ll dive into this a bit more later.

In the following, we’ll have a look at the current landscape of SLMs – including the top SLMs – in a handy comparison matrix.

Top SLM Models

Model NameSize (Parameters)Company/
Team
LicenseSourceQuality claims
DistilBERT66 MHugging FaceApache 2Hugging Face"40% less parameters than google-bert/bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances"
MobileLLM1.5 BMetaPre-training code for MobileLLM open sourced (Attribution-NonCommercial 4.0 International)Arxiv.org"2.7%/4.3% accuracy boost over preceding
125M/350M state-of-the-art models"
"close correctness to LLaMA-v2 7B in API
calling tasks"
TinyGiant (xLAM-1B)1.3 BSalesforceTraining set open sourced (Creative Commons Public Licenses); trained model will be open sourcedAnnouncement

Related Research on Arxiv.org
"outperforming models 7x its size, including GPT-3.5 & Claude"
Gemma 2B2 BGoogleGemma license (not open source per definition, but seemingly pretty much unrestricted use), training data not sharedHuggingface"The Gemma performs well on the Open LLM leaderboard. But if we compare Gemma-2b (2.51 B) with PHI-2 (2.7 B) on the same benchmarks, PHI-2 easily beats Gemma-2b."
Phi-33.8 B, 7 BMicrosoftMIT LicenseMicrosoft NewsiPhone 14: Phi-3-mini processing speed of 12 tokens per second.
From the H2O Danube3 benchmarks you can see that the Phi-3 model shows top performance compared to similar size models, oftentimes beating the Danube3
OpenELM270M, 450M, 1.1B, 3BAppleApple License, but pretty much reads like you can do as much with it as a permissive oss license (of course not use their logo)Huggingface

GitHub
OpenELM 1.1 B shows 1.28% (Zero Shot Tasks), 2.36% (OpenLLM Leaderboard), and 1.72% (LLM360) higher accuracy compared to OLMo 1.2 B, while using 2× less pretraining data
H2O Danube33-500M, 3-4BH2O.aiApache 2.0Arvix.org

Huggingface
"competitive performance compared to popular models of similar size across a wide variety of benchmarks including academic benchmarks, chat benchmarks, as well as fine-tuning benchmarks"
GPT-4o mini~8B (rumoured)OpenAIProprietaryAnnouncementGPT-4o mini scores 82% on MMLU and currently outperforms GPT-4 on chat preferences in LMSYS leaderboard⁠. GPT-4o mini surpasses GPT-3.5 Turbo and other small models on academic benchmarks across both textual intelligence and multimodal reasoning, and supports the same range of languages as GPT-4o
Gemini 1.5 Flash 8B8BGoogleProprietaryAnnouncement on Google for DevelopersSmaller and faster variant of 1.5 Flash features half the price, twice the rate limits, and lower latency on small prompts compared to its forerunner. Nearly matches 1.5 Flash on many key benchmarks.
Llama 3.1 8B8BMetaLlama 3.1 CommunityHuggingface

Artificial Analysis
MMLU score of 69.4% and a Quality Index across evaluations of 53. Faster compared to average, with a output speed of 157.7 tokens per second. Low latency (0.37s TTFT), small context window (128k).
Mistral-7B7BMistralApache 2.0Huggingface

Artificial Analysis
MMLU score 60.1%. Mistral 7B significantly outperforms Llama 2 13B on all metrics, and is on par with Llama 34B (since Llama 2 34B was not released, we report results on Llama 34B). It is also vastly superior in code and reasoning benchmarks. Was the best model for its size in autumn 2023.
Ministral3B, 8BMistralMistral Research LicenseHuggingface

Artificial Analysis
Claimed (by Mistral) to be the world's best Edge models.

Ministral 3B has MMLU score of 58% and Quality index across evaluations of 51. Ministral 8B has MMLU score of 59% and Quality index across evaluations of 53.
Granite2B, 8BIBMApache 2.0Huggingface

IBM Announcement
Granite 3.0 8B Instruct matches leading similarly-sized open models on academic benchmarks while outperforming those peers on benchmarks for enterprise tasks and safety.
Qwen 2.50.5B, 1.5B, 3B, 7BAlibaba CloudApache 2.0 (0.5B, 1.5B, 7B)
Qwen Research (3B)
Huggingface

Qwen Announcement
Models specializing in coding and solving Math problems. For 7B model, MMLU score 74.2%, context window (128k).
Phi-414 BMicrosoftMIT LicenseHuggingface

Artificial Analysis
Quality Index across evaluations of 77, MMLU 85%, Supports a 16K token context window, ideal for long-text processing. Outperforms Phi3 and outperforms on many metrices or is comparable with Qwen 2.5 , and GPT-4o-mini

SLM Use Cases – best choice for running local AI

SLMs are perfect for on-device or local AI applications. On-device / local AI is needed in scenarios that involve hardware constraints, demand real-time or guaranteed response rates, require offline functionality or need to comply with strict data privacy and security needs. Here are some examples:

  • Mobile Applications: Chatbots or translation tools that work seamlessly on phones even when not connected to the internet.
  • IoT Devices: Voice assistants, smart appliances, and wearable tech running language models directly on the device.
  • Healthcare: Embedded in medical devices, SLMs allow patient data to be analyzed locally, preserving privacy while delivering real-time diagnostics.
  • Industrial Automation: SLMs process language on edge devices, increasing uptime and reducing latency in robotics and control systems.

By processing data locally, SLMs not only enhance privacy but also ensure reliable performance in environments where connectivity may be limited.

On-device Vector Databases and SLMs: A Perfect Match

Imagine a digital assistant on your phone that goes beyond generic answers, leveraging your company’s (and/or your personal) data to deliver precise, context-aware responses – without sharing this private data with any cloud or AI provider. This becomes possible when Small Language Models are paired with local vector databases. Using a technique called Retrieval-Augmented Generation (RAG), SLMs access the additional knowledge stored in the vector database, enabling them to provide personalized, up-to-date answers. Whether you’re troubleshooting a problem, exploring business insights, or staying informed on the latest developments, this combination ensures tailored and relevant responses.

Key Benefits of using a local tech stack with SLMs and a local vector database

  • Privacy. SLMs inherently provide privacy advantages by operating on-device, unlike larger models that rely on cloud infrastructure. To maintain this privacy advantage when integrating additional data, a local vector database is essential. ObjectBox is a leading example of a local database that ensures sensitive data remains local. 
  • Personalization. Vector databases give you a way to enhance the capabilities of SLMs and adapt them to your needs. For instance, you can integrate internal company data or personal device information to offer highly contextualized outputs.
  • Quality. Using additional context-relevant knowledge reduces hallucinations and increases the quality of the responses.
  • Traceability. As long as you store your metadata alongside the vector embeddings, all the knowledge you use from the local vector database can give the sources.
  • Offline-capability. Deploying SLMs directly on edge devices removes the need for internet access, making them ideal for scenarios with limited or no connectivity.
  • Cost-Effectiveness. By retrieving and caching the most relevant data to enhance the response of the SLM, vector databases reduce the workload of the SLM, saving computational resources. This makes them ideal for edge devices, like smartphones, where power and computing resources are limited.

Use case: Combining SLMs and local Vector Databases in Robotics

Imagine a warehouse robot that organizes inventory, assists workers, and ensures smooth operations. By integrating SLMs with local vector databases, the robot can process natural language commands, retrieve relevant context, and adapt its actions in real time – all without relying on cloud-based systems.

For example:

  • A worker says, “Can you bring me the red toolbox from section B?”
  • The SLM processes the request and consults the vector database, which stores information about the warehouse layout, inventory locations, and specific task history.
  • Using this context, the robot identifies the correct toolbox, navigates to section B, and delivers it to the worker.

The future of AI is – literally – in our hands

AI is becoming more personal, efficient, and accessible, and Small Language Models are driving this transformation. By enabling sophisticated local AI, SLMs deliver privacy, speed, and adaptability in ways that larger models cannot. Combined with technologies like vector databases, they make it possible to provide affordable, tailored, real-time solutions without compromising data security. The future of AI is not just about doing more – it’s about doing more where it matters most: right in your hands.

The Embedded Database for C++ and C

The Embedded Database for C++ and C

After 6 years and 21 incremental “zero dot” releases, we are excited to announce the first major release of ObjectBox, the high-performance embedded database for C++ and C. As a faster alternative to SQLite, ObjectBox delivers more than just speed – it’s object-oriented, highly efficient, and offers advanced features like data synchronization and vector search. It is the perfect choice for on-device databases, especially in resource-constrained environments or in cases with real-time requirements.

What is ObjectBox?

ObjectBox is a free embedded database designed for object persistence. With “object” referring to instances of C++ structs or classes, it is built for objects from scratch with zero overhead — no SQL or ORM layer is involved, resulting in outstanding object performance.

The ObjectBox C++ database offers advanced features, such as relations and ACID transactions, to ensure data consistency at all times. Store your data privately on-device across a wide range of hardware, from low-profile ARM platforms and mobile devices to high-speed servers. It’s a great fit for edge devices, iOS or Android apps, and server backends. Plus, ObjectBox is multi-platform (any POSIX will do, e.g. iOS, Android, Linux, Windows, or QNX) and multi-language: e.g., on mobile, you can work with Kotlin, Java or Swift objects. This cross-platform compatibility is no coincidence, as ObjectBox Sync will seamlessly synchronize data across devices and platforms.

Why should C and C++ Developers care?

ObjectBox deeply integrates with C and C++. Persisting C or C++ structs is as simple as a single line of code, with no need to interact with unfamiliar database APIs that disrupt the natural flow of C++. There’s also no data transformation (e.g. SQL, rows & columns) required, and interacting with the database feels seamless and intuitive.

As a C or C++ developer, you likely value performance. ObjectBox delivers exceptional speed (at least we haven’t tested against a faster DB yet). Having several 100,000s CRUD operations per second on commodity hardware is no sweat. Our unique advantage is that, if you want to, you can read raw objects from “mmapped” memory (directly from disk!). This offers true “zero copy” data access without any throttling layers between you and the data.

Finally, CMake support makes integration straightforward, starting with FetchContent support so you can easily get the library. But there’s more: we offer code generation for entity structs, which takes only a single CMake command.

“ObjectBox++”: A quick Walk-Through

Once ObjectBox is set up for CMake, the first step is to define the data model using FlatBuffers schema files. FlatBuffers is a building block within ObjectBox and is also widely used in the industry. For those familiar with Protocol Buffers, FlatBuffers are its parser-less (i.e., faster) cousin. Here’s an example of a “Task” entity defined in a file named “task.fbs”:

And with that file, you can generate code using the following CMake command:

Among other things, code generation creates a C++ struct for Task data, which is used to interact with the ObjectBox API. The struct is a straightforward C++ representation of the data model:

The code generation also provides some internal “glue code” including the method create_obx_model() that defines the data model internally. With this, you can open the store and insert a task object in just three lines of code:

And that’s all it takes to get a database running in C++. This snippet essentially covers the basics of the getting started guide and this example project on GitHub.

Vector Embeddings for C++ AI Applications

Even if you don’t have an immediate use case, ObjectBox is fully equipped for vectors and AI applications. As a “vector database,” ObjectBox is ready for use in high-dimensional vector similarity searches, employing the HNSW algorithm for highly scalable performance beyond millions of vectors.

Vectors can represent semantics within a context (e.g. objects in a picture) or even documents and paragraphs to “capture” their meaning. This is typically used for RAG (Retrieval-Augmented Generation) applications that interact with LLMs. Basically, RAG allows AI to work with specific data, e.g. documents of a department or company and thus individualizes the created content.

To quickly illustrate vector search, imagine a database of cities including their location as a 2-dimensional vector. To enable nearest neighbor search, all you need to do is to define a HNSW index on the location property, which enables the nearestNeighbors query condition used like this:

For more details, refer to the vector search doc pages or the full city vector search example on GitHub.

store.close(); // Some closing words

This release marks an important milestone for ObjectBox, delivering significant improvements in speed, usability, and features. We’re excited to see how these enhancements will help you create even better, feature-rich applications.

There’s so much to explore! Please follow the links to dive deeper into topics like queries, relations, transactions, and, of course, ObjectBox Sync.

As always, we’re here to listen to your feedback and are committed to continually evolving ObjectBox to meet your needs. Don’t hesitate to reach out to us at any time.

P.S. Are you looking for a new job? We have a vacant C++ position to build the future of ObjectBox with us. We are looking forward to receiving your application! 🙂

The rise of small language models (“small LLMs”)

The rise of small language models (“small LLMs”)

As artificial intelligence (AI) continues to evolve, companies, researchers, and developers are recognizing that bigger isn’t always better. Therefore, the era of ever-expanding model sizes is giving way to more efficient, compact models, so-called Small Language Models (SLMs). SLMs offer several key advantages that address both the growing complexity of AI and the practical challenges of deploying large-scale models. In this article, we’ll explore why the race for larger models is slowing down and how SLMs are emerging as the sustainable solution for the future of AI.

 

 

From Bigger to Better: The End of the Large Model Race

Up until 2023, the focus was on expanding models to unprecedented scales. But the era of creating ever-larger models appears to be coming to an end. Many newer models like Grok or Llama 3 are smaller in size yet maintain or even improve performance compared to models from just a year ago. The drive now is to reduce model size, optimize resources, and maintain power.

The Plateau of Large Language Models (LLMs)

 

2024_12_16_SLMs_2

Why Bigger No Longer Equals Better

As models become larger, developers are realizing that the performance improvements aren’t always worth the additional computational cost. Breakthroughs in knowledge distillation and fine-tuning enable smaller models to compete with and even outperform their larger predecessors in specific tasks. For example, medium-sized models like Llama with 70B parameters and Gemma-2 with 27B parameters are among the top 30 models in the chatbot arena, outperforming even much larger models like GPT-3.5 with 175B parameters.

 

The Shift Towards Small Language Models (SLMs)

In parallel with the optimization of LLMs, the rise of SLMs presents a new trend (see Figure). These models require fewer computational resources, offer faster inference times, and have the potential to run directly on devices. In combination with an on-device database, this enables powerful local GenAI and on-device RAG apps on all kinds of embedded devices, like on mobile phones, Raspberry Pis, commodity laptops, IoT, and robotics.

 

Advantages of SLMs

Despite the growing complexity of AI systems, SLMs offer several key advantages that make them essential in today’s AI landscape:

 

speed-icon

Efficiency and Speed
SLMs are significantly more efficient, needing less computational power to operate. This makes them perfect for resource-constrained environments like edge computing, mobile phones, and IoT systems. This enables quicker response times and more real-time applications. For example, studies show that small models like DistilBERT can retain over 95% of the performance of larger models in some tasks while being 60% smaller and faster to execute.

Accessibility
As SLMs are less resource-hungry (less hardware requirements, less CPU, memory, power needs), they are more accessible for companies and developers with smaller budgets. Because the model and data can be used locally, on-device / on-premise, there is no need for cloud infatstructure and they are also usable for use cases with high privacy requirements. All in all, SLMs democratize AI development and empower smaller teams and individual developers to deploy advanced models on more affordable hardware.

Cost Reduction and Sustainability
Training and deploying large models require immense computational and financial resources, and comes with high operational costs. SLMs drastically reduce the cost of training, deployment, and operation as well as the carbon footprint, making AI more financially and environmentally sustainable.

Gear

Specialization and Fine-tuning
SLMs can be fine-tuned more efficiently for specific applications. They excel in domain-specific tasks because their smaller size allows for faster and more efficient retraining. It makes them ideal for sectors like healthcare, legal document analysis, or customer service automation. For instance, using the ‘distilling step-by-step’ mechanism, a 770M parameter T5 model outperformed a 540B parameter PaLM model using 80% of the benchmark dataset, showcasing the power of specialized training techniques with a much smaller model size

Gear

On-Device AI for Privacy and Security
SLMs are becoming compact enough for deployment on edge devices like smartphones, IoT sensors, and wearable tech. This reduces the need for sensitive data to be sent to external servers, ensuring that user data stays local. With the rise of on-device vector databases, SLMs can now handle use-case-specific, personal, and private data directly on the device. This allows more advanced AI apps, like those using RAG, to interact with personal documents and perform tasks without sending data to the cloud. With a local, on-device  vector database users get personalized, secure AI experiences while keeping their data private.

 The Future: Fit-for-Purpose Models: From Tiny to Small to Large Language models

 The future of AI will likely see the rise of models that are neither massive nor minimal but fit-for-purpose. This “right-sizing” reflects a broader shift toward models that balance scale with practicality. SLMs are becoming the go-to choice for environments where specialization is key and resources are limited. Medium-sized models (20-70 billion parameters) are becoming the standard choice for balancing computational efficiency and performance on general AI tasks. At the same time, SLMs are proving their worth in areas that require low latency and high privacy.

Innovations in model compression, parameter-efficient fine-tuning, and new architecture designs are enabling these smaller models to match or even outperform their predecessors. The focus on optimization rather than expansion will continue to be the driving force behind AI development in the coming years.

 

 Conclusion: Scaling Smart is the New Paradigm

 

As the field of AI moves beyond the era of “bigger is better,” SLMs and medium-sized models are becoming more important than ever. These models represent the future of scalable and efficient AI. They serve as the workhorses of an industry that is looking to balance performance with sustainability and efficiency. The focus on smaller, more optimized models demonstrates that innovation in AI isn’t just about scaling up; it’s about scaling smart.

Local AI – what it is and why we need it

Local AI – what it is and why we need it

Artificial Intelligence (AI) has become an integral part of our daily lives in recent years. However, it has been tied to running in huge, centralized cloud data centers. This year, “local AI”, also known as “on-device AI” or “Edge AI”, is gaining momentum. Local vector databases, efficient language models (so-called Small Language Models, SLMs), and AI algorithms are becoming smaller, more efficient, and less compute-heavy. As a result, they can now run on a wide variety of devices, locally.

Figure 1. Evolution of language model’s size with time. Large language models (LLMs) are marked as celadon circles, and small language models (SLMs) as blue ones.

What is Local AI (on-device AI, Edge AI)?

Local AI refers to running AI applications directly on a device, locally, instead of relying on (distant) cloud servers. Such an on-device AI works in real-time on commodity hardware (e.g. old PCs), consumer devices (e.g. smartphones, wearables), and other types of embedded devices (e.g. robots and point-of-sale (POS) systems used in shops and restaurants). An interest in local Artificial Intelligence is growing (see Figure 2).

Figure 2. Interest over time according to Google Trends.

Why use Local AI: Benefits

Local AI addresses many of the concerns and challenges of current cloud-based AI applications. The main reasons for the advancement of local AI are: 

On top, local AI reduces:

  • latency, enabling real-time apps
  • data transmission and cloud costs, enabling commodity business cases

In short: By leveraging the power of Edge Computing and on-device processing, local AI can unlock new possibilities for a wide range of applications, from consumer applications to industrial automation to healthcare.

Privacy: Keeping Data Secure

In a world where data privacy concerns are increasing, local AI offers a solution. Since data is processed directly on the device, sensitive information remains local, minimizing the risk of breaches or misuse of personal data. No need for data sharing and data ownership is clear. This is the key to using AI responsibly in industries like healthcare, where sensitive data needs to be processed and used without being sent to external servers. For example, medical data analysis or diagnostic tools can run locally on a doctor’s device and be synchronized to other on-premise, local devices (like e.g. PCs, on-premise servers, specific medical equipment) as needed. This ensures that patient data never leaves the clinic, and data processing is compliant with strict privacy regulations like GDPR or HIPAA.

Accessibility: AI for Anyone, Anytime

One of the most significant advantages of local AI is its ability to function without an internet connection. This opens up a world of opportunities for users in remote locations or those with unreliable connectivity. Imagine having access to language translation, image recognition, or predictive text tools on your phone without needing to connect to the internet. Or a point-of-sale (POS) system in a retail store that operates seamlessly, even when there’s no internet. These AI-powered systems can still analyze customer buying habits, manage inventory, or suggest product recommendations offline, ensuring businesses don’t lose operational efficiency due to connectivity issues. Local AI makes this a reality. In combination with little hardware requirements, it makes AI accessible to anyone, anytime. Therefore, local AI is an integral ingredient in making AI more inclusive and to democratize AI.

Sustainability: Energy Efficiency

Cloud-based AI requires massive server farms that consume enormous amounts of energy. Despite strong efficiency improvements, in 2022, data centers globally consumed between 240 and 340 terawatt-hours (TWh) of electricity. To put this in perspective, data centers now use more electricity than entire countries like Argentina or Egypt. This growing energy demand places considerable pressure on global energy resources and contributes to around 1% of energy-related CO2 emissions.

The rise of AI has amplified these trends. According to McKinsey, the demand for data center capacity is projected to grow by over 20% annually, reaching approximately 300GW by 2030, with 70% of this capacity dedicated to hosting AI workloads. Gartner even predicts that by 2025, “AI will consume more energy than the human workforce”. AI workloads alone could drive a 160% increase in data center energy demand by 2030, with some estimates suggesting that AI could consume 500% more energy in the UK than it does today. By that time, data centers may account for up to 8% of total energy consumption in the United States.

In contrast, local AI presents a more sustainable alternative, e.g. by leveraging Small Language Models, which require less power to train and run. Since computations happen directly on the device, local AI significantly reduces the need for constant data transmission and large-scale server infrastructure. This not only lowers energy use but also helps decrease the overall carbon footprint. Additionally, integrating a local vector database can further enhance efficiency by minimizing reliance on power-hungry data centers, contributing to more energy-efficient and environmentally friendly technology solutions.

When to use local AI: Use case examples

Local AI enables an infinite number of new use cases. Thanks to advancements in AI models and vector databases, AI apps can be run cost-effectively on less capable hardware, e.g. commodity PCs, without the need for an internet connection and data sharing. This opens up the opportunity for offline AI, real-time AI, and private AI applications on a wide variety of devices. From smartphones and smartwatches to industrial equipment and even cars, local AI is becoming accessible to a broad range of users. 

  • Consumer Use Cases (B2C): Everyday apps like photo editors, voice assistants, and fitness trackers can integrate AI to offer faster and more personalized services (local RAG), or integrate generative AI capabilities. 
  • Business Use Cases (B2B): Retailers, manufacturers, and service providers can use local AI for data analysis, process automation, and real-time decision-making, even in offline environments. This improves efficiency and user experience without needing constant cloud connectivity.

Conclusion

Local AI is a powerful alternative to cloud-based solutions, making AI more accessible, private, and sustainable. With Small Language Models and on-device vector databases like ObjectBox, it is now possible to bring AI onto everyday devices. From the individual user who is looking for convenient, always-available tools to large businesses seeking to improve operations and create new services without relying on the cloud – local AI is transforming how we interact with technology everywhere.

First on-device Vector Database (aka Semantic Index) for iOS

First on-device Vector Database (aka Semantic Index) for iOS

Easily empower your iOS and macOS apps with fast, private, and sustainable AI features. All you need is a Small Language Model (SLM; aka “small LLM”) and ObjectBox – our on-device vector database built for Swift apps. This gives you a local semantic index for fast on-device AI features like RAG or GenAI that run without an internet connection and keep data private.

The recently demonstrated “Apple Intelligence” features are precisely that: a combination of on-device AI models and a vector database (semantic index). Now, ObjectBox Swift enables you to add the same kind of AI features easily and quickly to your iOS apps right now.

Not developing with Swift? We also have a Flutter / Dart binding (works on iOS, Android, desktop), a Java / Kotlin binding (works on Android and JVM), or one in C++ for embedded devices.

Enabling Advanced AI Anywhere, Anytime

Typical AI apps use data (e.g. user-specific data, or company-specific data) and multiple queries to enhance and personalize the quality of the model’s response and perform complex tasks. And now, for the very first time, with the release of ObjectBox 4.0, this will be possible locally on restricted devices.

 

Local AI Tech Stack Example for on-device RAG

Swift on-device Vector Database and search for iOS and MacOS

With the ObjectBox Swift 4.0 release, it is possible to create a scalable vector index on floating point vector properties. It’s a very special index that uses an algorithm called HNSW. It’s scalable because it can find relevant data within millions of entries in a matter of milliseconds.
Let’s pick up the cities example from our vector search documentation. Here, we use cities with a location vector and want to find the closest cities (a proximity search). The Swift class for the City entity shows how to define an HNSW index on the location:

Inserting City objects with a float vector and HNSW index works as usual, the indexing happens behind the scenes:

To then find cities closest to a location, we do a nearest neighbor search using the new query condition and “find with scores” methods. The nearest neighbor condition accepts a query vector, e.g. the coordinates of Madrid, and a count to limit the number of results of the nearest neighbor search, here we want at max 2 cities. The find with score methods are like a regular find, but in addition return a score. This score is the distance of each result to the query vector. In our case, it is the distance of each city to Madrid.

The ObjectBox on-device vector database empowers AI models to seamlessly interact with user-specific data — like texts and images — directly on the device, without relying on an internet connection. With ObjectBox, data never needs to leave the device, ensuring data privacy.

Thus, it’s the perfect solution for developers looking to create smarter apps that are efficient and reliable in any environment. It enhances everything from personalized banking apps to robust automotive systems.

ObjectBox: Optimized for Resource Efficiency

At ObjectBox, we specialize on efficiency that comes from optimized code. Our hearts beat for creating highly efficient and capable software that outperforms alternatives on small and big hardware. ObjectBox maximizes speed while minimizing resource use, extending battery life, and reducing CO2 emissions.

With this expertise, we took a unique approach to vector search. The result is not only a vector database that runs efficiently on constrained devices but also one that outperforms server-side vector databases (see first benchmark results; on-device benchmarks coming soon). We believe this is a significant achievement, especially considering that ObjectBox still upholds full ACID properties (guaranteeing data integrity).

 Cloud/server vector databases vs. On-device/Edge vector databases

Also, keep in mind that ObjectBox is a fully capable database. It allows you to store complex data objects along with vectors. Thus, you have the full feature set of a database at hand. It empowers hybrid search, traceability, and powerful queries.

Use Cases / App ideas

ObjectBox can be used for a million different things, from empowering generative AI features in mobile apps to predictive maintenance on ECUs in cars to AI-enhanced games. For iOS apps, we expect to see the following on-device AI use cases very soon:

    • Across all categories we’ll see Chat-with-files apps:
        • Travel: Imagine chatting to your favorite travel guide offline, anytime, anywhere. No need to carry bulky paper books, or scroll through a long PDF on your mobile.

        • Research: Picture yourself chatting with all the research papers in your field. Easily compare studies and findings, and quickly locate original quotes.
    • Lifestyle:
        • Health: Apps offering personalized recommendations based on scientific research, your preferences, habits, and individual health data. This includes data tracked from your device, lab results, and doctoral diagnosis.
    • Productivity: Personal assistants for all areas of life.
        • Family Management: Interact with assistants tailored to specific roles. Imagine a parent’s assistant that monitors school channels, chat groups, emails, and calendars. Its goal is to automatically add events like school plays, remind you about forgotten gym bags, and even suggest birthday gifts for your child’s friends.

        • Professional Assistants: Imagine being a busy sales rep on the go, juggling appointments and travel. A powerful on-device sales assistant can do more than just automation. It can prepare contextual and personalized follow-ups instantly. For example, by summarizing talking points, attaching relevant company documents, and even suggesting who to CC in your emails.
    • Educational:
          • Educational apps featuring “chat-with-your-files” functionality for learning materials and research papers. But going beyond that, they generate quizzes and practice questions to help people solidify knowledge.

    Run the local AI Stack with a Language Model (SLM, LLM)

    Recent Small Language Models (SMLs) already demonstrate impressive capabilities while being small enough to run on e.g. mobile phones. To run the model on-device of an iPhone or a macOS computer, you need a model runtime. On Apple Silicone the best choice in terms of performance typically MLX – a framework brought to you by Apple machine learning research. It supports the hardware very efficiently by supporting CPU/GPU and unified memory.

    To summarize, you need these three components to run on-device AI with an semantic index:

      • ObjectBox: vector database for the semantic index
      • Models: choose an embedding model and a language model to matching your requirements
      • MLX as the model runtime

    Start building next generation on-device AI apps today! Head over to our vector search documentation and Swift documentation for details.

        Retrieval Augmented Generation (RAG) with vector databases: Expanding AI Capabilities

        Retrieval Augmented Generation (RAG) with vector databases: Expanding AI Capabilities

        What is RAG?

        Retrieval Augmented Generation (RAG) is a technique to enhance the intelligence of large language models (LLMs) with additional knowledge, such as reliable facts from specific sources, private or personal information not available to others, or just fresh news to improve their answers. Typically in the RAG, the additional knowledge is provided to the model from a vector database. For example, you can add internal data from your company, the latest news or the data from your personal devices to get responses that use your context. It can truly help you like an expert instead of giving generalized answers. This technique also reduces hallucinations. 

        Why RAG?

        Let’s take a look at the key benefits that RAG in general offers:

        • Customization and Adaptation: RAG helps LLMs to tailor responses to specific domains or use cases by using vector databases to store and retrieve domain-specific information. It turns general intelligence into expert intelligence.
        • Contextual Relevance: By incorporating information retrieved from a large corpus of text, RAG models can generate contextually relevant responses. It improves the quality of generated responses compared to traditional generation models.
        • Accuracy and diversity: Incorporation of external information also helps to generate more informative and accurate responses and keep LLM up-to-date. This also helps to avoid repetitive or generic responses and allows for more diverse and interesting conversations.
        • Cost-effective implementation: RAG requires less task-specific training data compared to fine-tuning the foundation models. When we compare retrieval augmented generation vs fine-tuning, RAG’s ability to use external knowledge stands out. While fine-tuning requires lots of labeled data, RAG can rely on external sources. This can be particularly beneficial in scenarios where annotated training data is limited or expensive to obtain, thus, providing a cost-effective implementation. 
        • Transparency: RAG models provide transparency in their responses by explicitly indicating the source of retrieved information. This allows users to understand how the model arrived at its response and helps enhance trust in the generated output.

        Therefore, RAG is suitable for applications where access to a vast amount of specialized data is necessary. For example, a customer support bot that pulls details from FAQs and generates coherent, conversational responses. Another example is an email drafting tool that fetches information about recent meetings and generates a personalized summary.

        How retrieval augmented generation works

        Let’s discuss the mechanics of how RAG operates with vector database, covering its main stages from dataset creation to response generation (see figure).

        This image has an empty alt attribute; its file name is RAG.png
        Retrieval augmented generation diagram


        • DB creation: Creation of external dataset

        Before the real use, the vector database should be created. The new data, that lies outside the training dataset of LLM, should be identified and added to the dataset (e.g. up-to-date information or specific information). This dataset is then transferred into vector embeddings via an AI model (embedding language models) and is stored in the vector database. 

        • DB in use: Retrieval of relevant information
          Once a query comes in, it is also transferred into a vector / embedding. It is used then to retrieve the most relevant result from the database. To achieve this, RAG uses semantic search techniques also known as vector search to understand the user’s query and/or context, retrieving contextually relevant information from a large dataset. Vector search goes beyond keyword matching and focuses on semantic relationships, improving the quality of the retrieved information and the overall performance of the RAG system in generating contextually relevant responses. 
        • DB in use: Augmentation
          At this stage, the user’s query is augmented by adding the relevant data retrieved in the previous stage. Often, only the top responses from vector search are considered as relevant data. Many databases have additional filtering techniques in place here.
        • Generation
          The augmented query is sent to the LLM to generate an accurate answer.

        The Role of Long Context Windows

        The rise of the new LLMs with long (1+million tokens) context windows, like Gemini 1.5, raised the discussion on whether long context windows will replace RAG. A long context window enables users to directly incorporate huge amounts of data into a query. Thus, it increases context to the LLM to improve its efficiency. 

        Long context length and RAG have pros and cons, and neither will kill the other. Rather than being mutually exclusive, large context windows and RAG can be complemented. Large context windows can enhance RAG applications by expanding the margin of precision and accommodating vast amounts of data. However, the capability of the model to take a long context does not mean that it can efficiently leverage all the information. If the relevant information is located in the middle of the context window, LLM’s ability to recall it is worse than the one located in the beginning. In order to use RAG with the long context window, the reranking (e.g. Cross-Encoder) should be used. The reranking model first calculates a matching score between a given query and vectors in the database (e.g. representing documents). And then it rearranges vector search results so that the most relevant ones are prioritized.

        Future Directions of RAG

        While RAG offers numerous benefits, there are still opportunities for improvement. Researchers are exploring ways to enhance RAG by combining it with other techniques. These include fine-tuning (RAFT) or the long context window (in combination with reranking). Another direction of research is expanding RAG capabilities by advancing data handling (including multimodal data), evaluation methodologies, and scalability. Finally, RAG is also affected by the new advances in optimizing LLMs to run locally on restricted devices (mobile, IoT), along with the emergence of the first on-device vector database. Now, RAG can be performed directly on your mobile device, prioritizing privacy, low latency, and offline capabilities.