Can Small Language Models (SLMs) really do more with less? In this article, we discuss the unique strengths of SLMs, the top SLMs, their integration with local vector databases, and how SLMs + local vector databases are shaping the future of AI, prioritizing privacy, immediacy, and sustainability.
The Evolution of Language Models
In the world of artificial intelligence (AI), bigger models were once seen as better. Large Language Models (LLMs) amazed everyone with their ability to write, translate, and analyze complex texts. But they come with big problems too: high costs, slow processing, and huge energy demands. For example, OpenAI’s latest GPT-o3 model can cost up to $6,000 per task. The annual energy consumption of GPT-3.5 is equivalent to powering more than 4000 US households for a year. That’s a huge price to pay, both financially and environmentally.
Now, Small Language Models (SLMs) are stepping into the spotlight, enabling sophisticated AI to run directly on devices (local AI) like your phone, laptop, or even a smart home assistant. These models not only reduce costs and energy consumption but also bring the power of AI closer to the user, ensuring privacy and real-time performance.
What Are Small Language Models (SLMs)?
LLMs are designed to understand and generate human language. Small Language Models (SLMs) are compact versions of LLMs. So, the key difference between SLMs and LLMs is their size. While LLMs like GPT-4 are designed with hundreds of billions of parameters, SLMs use only a fraction of that. There is no strict definition of SLM vs. LLM yet. At this moment, SLM sizes can be as small as single-digit million parameters and go up to several billion parameters. Some authors suggest 8B parameters as the limit for SLMs. However, in our view that opens up the question if we need a definition for Tiny Language Models (TLMs)?
Advantages and disadvantages of SLM
Small Language Models (SLMs) bring a range of benefits, particularly for local AI applications, but they also come with trade-offs.
Benefits of SLMs
- Privacy: By running on-device, SLMs keep sensitive information local, eliminating the need to send data to the cloud.
- Offline Capabilities: Local AI powered by SLMs functions seamlessly without internet connectivity.
- Speed: SLMs require less computational power, enabling faster inference and smoother performance.
- Sustainability: With lower energy demands for both training and operation, SLMs are more environmentally friendly.
- Accessibility: Affordable training and deployment, combined with minimal hardware requirements, make SLMs accessible to users and businesses of all sizes.
Limitations of SLMs
The main disadvantage is the flexibility and quality of SLM responses: SLMs typically cannot tackle the same broad range of tasks as LLMs in the same quality. However, in certain areas, they are already matching their larger counterparts. For example, Artificial Analysis AI Review 2024 highlights that GPT-4o-mini (July 2024) has a similar Quality Index to GPT-4 (March 2023), while being 100x cheaper in price.
Overcoming limitations of SLMs
A combination of SLMs with local vector databases is a game-changer. With a local vector database, the variety of tasks and the quality of answers cannot only be enhanced but also for the areas that are actually relevant to the use case you are solving. E.g. you can add internal company knowledge, specific product manuals, or personal files to the SLM. In short, you can provide the SLM with context and additional knowledge that has not been part of its training via a local vector database. In this combination, an SLM can already today be as powerful as an LLM for your specific case and context (your tasks, your apps, your business). We’ll dive into this a bit more later.
In the following, we’ll have a look at the current landscape of SLMs – including the top SLMs – in a handy comparison matrix.
Top SLM Models
Model Name | Size (Parameters) | Company/ Team | License | Source | Quality claims | |
DistilBERT | 66 M | Hugging Face | Apache 2 | Hugging Face | "40% less parameters than google-bert/bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances" | |
MobileLLM | 1.5 B | Meta | Pre-training code for MobileLLM open sourced (Attribution-NonCommercial 4.0 International) | Arxiv.org | "2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models" "close correctness to LLaMA-v2 7B in API calling tasks" | |
TinyGiant (xLAM-1B) | 1.3 B | Salesforce | Training set open sourced (Creative Commons Public Licenses); trained model will be open sourced | Announcement Related Research on Arxiv.org | "outperforming models 7x its size, including GPT-3.5 & Claude" | |
Gemma 2B | 2 B | Gemma license (not open source per definition, but seemingly pretty much unrestricted use), training data not shared | Huggingface | "The Gemma performs well on the Open LLM leaderboard. But if we compare Gemma-2b (2.51 B) with PHI-2 (2.7 B) on the same benchmarks, PHI-2 easily beats Gemma-2b." | ||
Phi-3 | 3.8 B, 7 B | Microsoft | MIT License | Microsoft News | iPhone 14: Phi-3-mini processing speed of 12 tokens per second. From the H2O Danube3 benchmarks you can see that the Phi-3 model shows top performance compared to similar size models, oftentimes beating the Danube3 | |
OpenELM | 270M, 450M, 1.1B, 3B | Apple | Apple License, but pretty much reads like you can do as much with it as a permissive oss license (of course not use their logo) | Huggingface GitHub | OpenELM 1.1 B shows 1.28% (Zero Shot Tasks), 2.36% (OpenLLM Leaderboard), and 1.72% (LLM360) higher accuracy compared to OLMo 1.2 B, while using 2× less pretraining data | |
H2O Danube3 | 3-500M, 3-4B | H2O.ai | Apache 2.0 | Arvix.org Huggingface | "competitive performance compared to popular models of similar size across a wide variety of benchmarks including academic benchmarks, chat benchmarks, as well as fine-tuning benchmarks" | |
GPT-4o mini | ~8B (rumoured) | OpenAI | Proprietary | Announcement | GPT-4o mini scores 82% on MMLU and currently outperforms GPT-4 on chat preferences in LMSYS leaderboard. GPT-4o mini surpasses GPT-3.5 Turbo and other small models on academic benchmarks across both textual intelligence and multimodal reasoning, and supports the same range of languages as GPT-4o | |
Gemini 1.5 Flash 8B | 8B | Proprietary | Announcement on Google for Developers | Smaller and faster variant of 1.5 Flash features half the price, twice the rate limits, and lower latency on small prompts compared to its forerunner. Nearly matches 1.5 Flash on many key benchmarks. | ||
Llama 3.1 8B | 8B | Meta | Llama 3.1 Community | Huggingface Artificial Analysis | MMLU score of 69.4% and a Quality Index across evaluations of 53. Faster compared to average, with a output speed of 157.7 tokens per second. Low latency (0.37s TTFT), small context window (128k). | |
Mistral-7B | 7B | Mistral | Apache 2.0 | Huggingface Artificial Analysis | MMLU score 60.1%. Mistral 7B significantly outperforms Llama 2 13B on all metrics, and is on par with Llama 34B (since Llama 2 34B was not released, we report results on Llama 34B). It is also vastly superior in code and reasoning benchmarks. Was the best model for its size in autumn 2023. | |
Ministral | 3B, 8B | Mistral | Mistral Research License | Huggingface Artificial Analysis | Claimed (by Mistral) to be the world's best Edge models. Ministral 3B has MMLU score of 58% and Quality index across evaluations of 51. Ministral 8B has MMLU score of 59% and Quality index across evaluations of 53. | |
Granite | 2B, 8B | IBM | Apache 2.0 | Huggingface IBM Announcement | Granite 3.0 8B Instruct matches leading similarly-sized open models on academic benchmarks while outperforming those peers on benchmarks for enterprise tasks and safety. | |
Qwen 2.5 | 0.5B, 1.5B, 3B, 7B | Alibaba Cloud | Apache 2.0 (0.5B, 1.5B, 7B) Qwen Research (3B) | Huggingface Qwen Announcement | Models specializing in coding and solving Math problems. For 7B model, MMLU score 74.2%, context window (128k). | |
Phi-4 | 14 B | Microsoft | MIT License | Huggingface Artificial Analysis | Quality Index across evaluations of 77, MMLU 85%, Supports a 16K token context window, ideal for long-text processing. Outperforms Phi3 and outperforms on many metrices or is comparable with Qwen 2.5 , and GPT-4o-mini |
SLM Use Cases – best choice for running local AI
SLMs are perfect for on-device or local AI applications. On-device / local AI is needed in scenarios that involve hardware constraints, demand real-time or guaranteed response rates, require offline functionality or need to comply with strict data privacy and security needs. Here are some examples:
- Mobile Applications: Chatbots or translation tools that work seamlessly on phones even when not connected to the internet.
- IoT Devices: Voice assistants, smart appliances, and wearable tech running language models directly on the device.
- Healthcare: Embedded in medical devices, SLMs allow patient data to be analyzed locally, preserving privacy while delivering real-time diagnostics.
- Industrial Automation: SLMs process language on edge devices, increasing uptime and reducing latency in robotics and control systems.
By processing data locally, SLMs not only enhance privacy but also ensure reliable performance in environments where connectivity may be limited.
On-device Vector Databases and SLMs: A Perfect Match
Imagine a digital assistant on your phone that goes beyond generic answers, leveraging your company’s (and/or your personal) data to deliver precise, context-aware responses – without sharing this private data with any cloud or AI provider. This becomes possible when Small Language Models are paired with local vector databases. Using a technique called Retrieval-Augmented Generation (RAG), SLMs access the additional knowledge stored in the vector database, enabling them to provide personalized, up-to-date answers. Whether you’re troubleshooting a problem, exploring business insights, or staying informed on the latest developments, this combination ensures tailored and relevant responses.
Key Benefits of using a local tech stack with SLMs and a local vector database
- Privacy. SLMs inherently provide privacy advantages by operating on-device, unlike larger models that rely on cloud infrastructure. To maintain this privacy advantage when integrating additional data, a local vector database is essential. ObjectBox is a leading example of a local database that ensures sensitive data remains local.
- Personalization. Vector databases give you a way to enhance the capabilities of SLMs and adapt them to your needs. For instance, you can integrate internal company data or personal device information to offer highly contextualized outputs.
- Quality. Using additional context-relevant knowledge reduces hallucinations and increases the quality of the responses.
- Traceability. As long as you store your metadata alongside the vector embeddings, all the knowledge you use from the local vector database can give the sources.
- Offline-capability. Deploying SLMs directly on edge devices removes the need for internet access, making them ideal for scenarios with limited or no connectivity.
- Cost-Effectiveness. By retrieving and caching the most relevant data to enhance the response of the SLM, vector databases reduce the workload of the SLM, saving computational resources. This makes them ideal for edge devices, like smartphones, where power and computing resources are limited.
Use case: Combining SLMs and Vector Databases
Imagine a warehouse robot that organizes inventory, assists workers, and ensures smooth operations. By integrating SLMs with local vector databases, the robot can process natural language commands, retrieve relevant context, and adapt its actions in real time – all without relying on cloud-based systems.
For example:
- A worker says, “Can you bring me the red toolbox from section B?”
- The SLM processes the request and consults the vector database, which stores information about the warehouse layout, inventory locations, and specific task history.
- Using this context, the robot identifies the correct toolbox, navigates to section B, and delivers it to the worker.
The future of AI is in the hands of the users
AI is becoming more personal, efficient, and accessible, and Small Language Models are driving this transformation. By enabling sophisticated local AI, SLMs deliver privacy, speed, and adaptability in ways that larger models cannot. Combined with technologies like vector databases, they make it possible to provide affordable, tailored, real-time solutions without compromising data security. The future of AI is not just about doing more – it’s about doing more where it matters most: right in your hands.