Vector types (aka arrays) added with ObjectBox Java 3.6 release

Vector types (aka arrays) added with ObjectBox Java 3.6 release

Vector embeddings (multi-dimensional vectors) are a central building block for AI applications. And accordingly, the ability to store vectors to add long-term memory to your AI applications (e.g. via vector databases) is gaining importance. Sounds fancy, but for the basic use cases, this simply boils down to “arrays of floats” for developers. And this is exactly what ObjectBox database now supports natively. If you want to use vectors on the edge, e.g. in a mobile app or on an embedded device, when offline, independent from an Internet connection, removing the unknown latency, try it…

See the release notes for all new features this release brings.

Code Examples

Let’s start with a simple example: let’s assume some shapes that use a palette of RGB colors. An entity for this might look like this:

We can now create a query to find all shapes that use a certain color:

Another typical use case is the embedding of certain types of data, like text, audio or images, as vector coordinates. To store such a vector embedding, in the following example we store the floating point coordinates that were computed by a machine learning model for an image together with a reference to the actual image:

Ready to go?

To update to this release, change the version of objectbox-gradle-plugin to 3.6.0.

To add ObjectBox database to your JVM or Android project read our Getting Started guide.
As always, we look forward to your feedback on GitHub or via our anonymous feedback form and hope you have a great time building apps with ObjectBox! ❤️

Vector databases – a look at the AI database market with a comprehensive comparison matrix

Vector databases – a look at the AI database market with a comprehensive comparison matrix

Vector databases - a look at the AI database market

⭐ What are vector databases? ⭐ What do you need them for? ⭐ Who is in the market? (Updated Oct 2024)

Includes a comparison matrix of vector database options like Pinecone, Milvus, Vespa, Vald, Chroma, Marqo AI, Weaviate, and Qdrant

In 2023 we saw record fundings of vector database players vector database. Since then almost every general purpose database (like MongoDB, elastic, Orcale MySQL etc.) have added a Vector Search and related features, basically making all of the vector databases too. There is an ongoing discussion if pure players are superior, but as always, the right answer is: “it depends”. Any ways, the vector database market is stilly very hot in Q4 of 2024 🔥

Of course, everyone, not just investors, is  interested in the booming AI market. While AI applications have dominated the news for quite some time, the infrastructure software that supports these applications, such as vector databases, has finally gained more spotlight.  In the following, we’ll have a look at why vector databases are gaining attention and compare current vector database alternatives.

What is a vector database? 

A vector database stores vectors, or more precisely vector embeddings. A vector database therefore is a specialised type of database designed to store and manage large sets of vectors efficiently. However, the challenge and value are not derived from simply being able to store vectors. The value is created by the type of computations that can be run over the stored vector data and the speed with which these computations can be run, e.g. similarity searches. 

Vector databases are essentially an important piece of the AI tech stack. They can be used e.g. to give LLMs (Large Language Models) – or more broadly speaking, AI applications – a long-term memory and faster search and querying capabilities. Another important use case is RAG (Retrieval-Augmented Generation).

To give some context: The most traditional databases, SQL databases, store data in rows and columns; graph databases store graphs and object databases store objects.

Because Large Language Models and AI applications rely on vector embeddings, vector databases are especially apt at supporting AI applications. 

Accordingly, vector databases are becoming a critical layer in the AI tech stack; they are sometimes also called “AI databases”. However, databases tend to converge over time, meaning that many databases support several different database models.

What is a vector embedding?

A vector embedding is a list of numbers that represent objects and relationships, allowing unstructured data (such as images) to be searched and used. Typically, Large Language Models (more precisely the underlying Machine Learning (ML) algorithms) are used to create these vectors. The ML algorithms analyse large amounts of data to learn how to represent complex / unstructured data in a lower dimensional space (as vectors).

What do vector databases have to do with nearest neighbour search?

Searchability (making unstructured data usable) is at the heart of this concept. The nearest neighbour search is therefore a key concept in vector databases. The distance between vector embeddings expresses the similarity of the vectors (and thus the represented objects). Therefore, as you are searching for the most similar data, the so-called “nearest neighbour search” is a key concept and the time required to find the nearest neighbours is essential. 

Do we need special vector databases?

There is already a discussion going on about whether special vector databases are needed or do not warrant a new category in the database landscape. Instead, vector extensions of traditional databases could be supporting the AI market. Both are reasonable expectations, and time will tell. Notable databases that have already added a vector extension include e.g. redis and elasticsearch. Additionally, more and more databases now allow storing vector types.

How does the vector database landscape look like?

To have a look at the current market situation, we are comparing the choices with the most traction, but excluding established players that have added vector capabilities to their existing database offering. Generally speaking we see a lot of very young companies, some companies that did pivot from their original specialization, and massive fundings. Please note: the table is not optimized to be readable on mobile or small screens (there just is a trade-off between providing the information and making it readable on every device).

If you’re on mobile, use this link to view a version that is readable on mobile.

Name Open Source License GitHub stars  Developed in (language) Summary Business Model Embeds / Uses founding date / first released date In-memory Unterstützung Sharding Index Types Consistency Model Benchmarks (Performance?) Approximate Nearest Neighbor (ANN) Vector Databases Funding Who's behind it HQ in 
ObjectBox Y Apache-2.0   C++, supports native language APIs in Java, Flutter / Dart, Swift, Python, GoLang, and C++ ObjectBox is an on-device vector database for Edge AI on Mobile, IoT, Embedded and other commodity devices Free to use; paid Data Sync HNSW built and optimized from scratch for efficiency / speed on devices with limited resources development of the initial on-device database started in 2015; released the vector search to become the first on-device vector database for productive use early in 2024 Y N HNSW Transactionally safe, ACID   Y Seed in 2018 ObjectBox 🇪🇺
Marqo AI Y Apache-2.0 2.8k ⭐ Python A tensor-based cloud-native commercial Open Source search and analytics engine. Open SaaS Tensor-based   Y HNSW   -   undisclosed preseed in May 2022 S2Search Australia Pty Ltd 🇦🇺
Weaviate Y BSD 5.6k ⭐ Assembly, C++, GoLang Weaviate is a commercial Open Source cloud-native vector database that stores both objects and vectors. Open SaaS started in 2018 as a traditional graph database, first released in 2019 N Y, static sharding a custom HNSW PQ algorithm that supports CRUD Eventual Consistency not comparative, just evaluating their own performance  Y (multiple ANN algorithms as long as they support full CRUD) 67.7M USD, series B SeMI Technologies 🇪🇺
Chroma Y Apache-2.0 4.4k ⭐ Python & Typescript Chroma is a Commercial Open Source vector database Preparing a (Partly Open) SaaS model* [Commercial Open Source] HNSW lib, DuckDB; based on ClickHouse looks like 2022 N Dynamic segment placement       Y 20.3M USD, seed Chroma Inc. 🇺🇸
Qdrant Y Apache-2.0 6.6k ⭐ Rust Qdrant is a Commercial Open Source vector similarity search engine and vector database Open SaaS RocksDB first released: 2021 Y Y, static sharding HNSW (SQ & PQ) Eventual Consistency, tunable consistency compares to weaviate, milvus, elastic (note: redis took too long to complete) Y 9.8M € Qdrant Solutions GmbH 🇪🇺
Milvus Y Apache-2.0 18k ⭐ GoLang & Python Milvus is a cloud-native Commercial Open Source vector database (Partly Open) SaaS* [Commercial Open Source] Initial blog post from them said SQLite, but meanwhile they said RocksDB - exchanged?
they also have a ChatGPT-Cache that is build on SQLite
and say "Milvus uses SQLite or MySQL to manage metadata"
founded 2017, first released: 2019 N Dynamic segment placement ANNOY; HNSW; IVF_PQ; IVF_SQ(; IVF_FLAT; FLAT; IVF_SQ8_H; RNSG Strong, bounded staleness, session, and eventually. The default consistency level in Milvus is bounded staleness.  not comparative Y 113M USD, series B Zilliz 🇺🇸
Vespa Y Apache-2.0 4.4k ⭐ Java & C++ Vespa is a Commercial Open Source vector database by Yahoo! It is a search engine which supports vector search, lexical search, and search in structured data Open SaaS Originally a web search engine (alltheweb), acquired by Yahoo! in 2003 and later open sourced as Vespa in 2017; sinde Oct 2023 spinoff, raised series A in Nov 2023 maintains disk and memory structures for documents Y Custom HNSW (Multi-vector hybrid HNSW-IF) Eventual Consistency not comparative  Y Spinoff from Yahoo! in Oct 2023, then raised a 31M USD series A Yahoo! 🇺🇸
Vald Y Apache-2.0 1.2k ⭐ GoLang Vald is a cloud-native Open Source distributed approximate nearest neighbor (ANN) dense vector search engine Community project, currently looks like no commercial interests are pursued uses the vector search engine NGT Technology incubation at Yahoo! Japan Corporation, development was stared in 2019 N/A N/A N/A not comparitive, but Vald performance only Y (NGT) - Yusuke Kato (Yahoo Japan Corporation), Kiichiro Yukawa (Yahoo Japan Corporation) 🇯🇵
Pinecone
N Proprietary NA   Pinecone is a fully managed vector database that specializes in enabling semantic search capabilities SaaS built on top of Faiss first released in 2019 N Y proprietary Eventual Consistency more programming language comparison for vector databases Y (proprietary), plus KNN (with Faiss) 138M, series B Pinecone Systems Inc 🇺🇸

Want to know more about the vector database market?

Here are some more questions answered for anyone interested

What is an "Open SaaS" business model?

Software as a service (SaaS) refers to software that is managed / hosted for the client and is essentially “rented.” The open in Open SaaS refers to the open source software that is being offered as such a service.

This frequently implies that not all code is open source, particularly that which is part of the managed service / hosting and associated value-adding features. Note: The open source software offered in this manner may or may not be provided by the company providing the software as a service. This has caused some friction in the open source community, as original creators often struggle to make a living, and/or maintainers struggle to keep maintaining the software – while other companies profit. Most famously, huge cloud providers have taken advantage of this option, leading to new licenses that keep the source open but restrict others from hosting as a service without donating the whole source code back to the community.

Why should I care about index types?

Indexes are essentially a way to speed up searching a database. There are several established index types for vector databases and they affect the performance of the database, e.g. the time it takes a query to complete.

What about benchmarks?

You will see, if you review the benchmarks given at the top, that results typically vary. Benchmarks are difficult to do and neutral benchmarks even more so. Certain use cases may favor certain solutions. Therefore, ideally you benchmark based on your specific use case…. but as a first evaluation, try to understand the basic influencing factors and have a look at a handful of benchmarks and explanations. Having said all this: There is a benchmarking tool available for approximate nearest neighbor (ANN) algorithms search. If you use this, you can compare the performance of different databases (with regards to the ANN search)  for the same setup, based on the same approach. Also: The underlying libs often used by databases (like NGT and HNSW, see above) have already been benchmarked with it and you can compare to these directly.

Why is the market so hot, how can companies raise so much money?

AI is hot, everyone agrees that data and its management will be key to future success, and the database market is interesting: It is a long established market with many players, yet still demonstrating continually good growth (e.g. 17% in 2020). And the database market history shows that from time to time a new type of database comes up, and with it, the creation of a new market category. In such a market, typically the market creator “takes all” (not quite literally, but such a significant share, definetely the vast majority, that all other players are not attractive from a VC-perspective). Such a market could easily be worth 100M+ in ARR. Examples from the last 20 years: MongoDB (NoSQL databases), Cockroach (NewSQL databases), Neo4J (Graph databases), Influx (Time-Series databases). So, VCs are looking to find the next new type of database that can create a market… Maybe it will be vector databases? However, the database market has also shown to take 10 years+ for players to become profitable, so expect a longterm game. The race is still on for Edge Databases we think 🙂

Want to know more about the database market?

We recommend checking out db-engines. The website compares all relevant systems and has tons of data from the last 20 years. Note: They do only add databases once they have some traction and notability, not any hobby project. Accordingly not all databases of the above comparison have been added to the website yet.

What is an Edge Database, and why do you need one?

What is an Edge Database, and why do you need one?

Edge Databases – from trends to use cases

Data is decentralized. Cloud computing is centralized. Forcing the decentralized world into the centralized cloud topology is not only inefficient, but also economically, ecologically and socially wasteful – and sometimes simply impossible.

To drive digitization and extract value from decentralized data, we need to give the cloud an edge, or more precisely add Edge Computing. Edge computing is a decentralized topology for storing and processing data as close as possible to the data source, i.e., the place where the data is produced, at the edge of the network.

Valuable data is increasingly generated in a decentralized manner – outside traditional and centralized data centers and cloud environments. The dominance of centralized cloud computing approaches slows down digitization and the use of this existing decentralized data. Therefore, according to Gartner (2023) “Edge computing is integral to digital transformation”, and we need infrastructure technologies for the edge that enable developers to quickly and reliably work with decentralized edge data.

Edge Database (Foundation for Edge Data Management) is a new type of database that addresses these requirements. Developers need fast local data persistence and decentralized data flows (Data Sync) to implement edge solutions. Edge Databases solve these core edge functionalities out-of-the-box, allowing application developers to quickly implement edge solutions.

Megatrend to decentralized Edge Computing

By 2030, 30+ billion IoT devices will be creating ~4.6 trillion GB of data per day. The growing numbers of devices and data volume, variety, and velocity, as well as bandwidth infrastructure limitations, make it infeasible to store and process all data in a centralized cloud. On top, new use cases come with new requirements, a centralized cloud infrastructure cannot meet. For example, soft and hard response rate requirements, offline-functionality, and security and data protection regulations.

trends-driving-edge-computing

These trends accelerate the shift away from centralized cloud computing to a decentralized edge computing topology. Edge computing refers to decentralized data processing at the “edge” of the network. For example, in a car, on a machine, on a smartphone, or in a building. Hardware specifications do not capture the definition of an “edge device”. The crucial point is rather the decentralized use of data at, or as close as possible to, the data source.

Edge computing itself is not a technology but a topology, and according to McKinsey, one of the top growing trends in tech in 2021. The technologies needed to implement the edge computing topology are still inadequate. More specifically, there is a gap in basic “core” edge technologies, so-called “software infrastructure”. This gap is one of the main reasons for the failure of edge projects.

Needed: Infrastructure Software for Edge Computing

With computing shifting to the edge of the network, the needs of this decentralized topology become clear:

hugh performance db

Need for fast local data storage

→ i.e. a machine on the factory floor collects data on stiffness, friction, pressure points. There is limited space on the device, and typically no connection to the Internet. Even with an Internet connection, high data rates quickly push the available bandwidth, as well as associated networking / cloud costs, to the limit. To be able to use this data, it must be persisted in a structured manner at the edge, e.g. stored locally in a database.

feedback dialogue icon

Need for reliable on-device data flows

→ i.e. the car is an edge device consisting of many control units. Therefore, data must be stored on multiple control units. In order to access and use the data within several of the control units of the car, the data must be selectively synchronized between the devices. A centralized structure and thus a single point of failure is unthinkable.

data sync

Need for edge-to-edge-to-cloud data flows

→ i.e. in a manufacturing hall: Typically, you will find any number of diverse devices from sensors to brownfield to greenfield devices, and no internet connectivity. At the same time, there are diverse employee devices such as tablets or smartphones, as well as central PCs, and a cloud. To extract value from the data, it must be available in raw, aggregated, or summary form, in different places. This means it needs to be synchronized efficiently and selectively, with possible conflicts resolved.

types-of-data-on-edge-flexibility

Need for flexible edge data management

→ e.g. with the rise of IoT, time-series data have become common. However, time series data alone is usually not sufficient, and needs to be combined with other data structures (like objects) to add value. At the same time, a push to standardize data formats in industries (e.g. VSS in automotive or Umati in Industrial IoT) requires that the database supports flexible data structures.

Developing solutions without software infrastructure on an individual level is possible, but has many drawbacks:

Custom in-house implementations are cumbersome, slow, costly, and typically scale poorly. Oftentimes, applications or certain feature sets become unfeasible to deliver because of the lack of core software infrastructure. Legacy code and individual workarounds create problems over the lifetime of a product. Instead of a thriving ecosystem, only a few big players are able to implement edge solutions. Innovation and creativity are limited. An edge database is part of the solution and enables the entire edge ecosystem to build edge applications faster, cheaper and more efficiently.

lack-of-core-tech-for-the-edge

What is an Edge Database?

An Edge Database is a type of database specifically tailored to the unique requirements of the Edge Computing topology. Edge Databases run directly on-device, locally, and make it easy for app developers to access decentralized data from edge devices when and where needed. Using an Edge Database removes the burden of implementing ways to synchronize data, which is non-trivial, time-consuming, risky, and brings ongoing maintenance needs. Let’s look at this in more detail:

First, an Edge Database is optimized for resource efficiency (CPU, memory, …) and performance on resource-constrained devices (embedded devices, IoT, mobile). It has a small footprint of a few megabytes max. Traditional databases such as MySQL or MongoDB are too large and slow for typical edge devices, making them unsuitable for computing at the edge. Nevertheless, with integrations like the one between ObjectBox and MongoDB, developers can now combine ObjectBox’s on-device efficiency and offline-first capabilities of Edge Databases with MongoDB’s scalable cloud platform to enable seamless, bi-directional synchronization between the edge and the cloud.

An edge device without data flows to/from other devices is just a data island with very limited utility. Accordingly, an Edge Database must support the management of decentralized data flows. There is no more efficient way than at the database level. This ideally includes a range of conflict resolution strategies due to the decentralized and multi-directional structure of the Edge. 

Last not least, data security is of growing importance and data in motion needs to be protected. Data at rest is on a database level often protected by the OS and therefore less of a concern for most applications. 

 

What is an Edge Database?

When do you need an Edge Database?

Most IoT applications need to store and synchronize data. An Edge Database is always useful when functions / applications are planned that:

  • should work offline and independent of an internet connection
  • need to guarantee fast response times
  • work with a lot of, possibly high-frequency data
  • need to serve many devices at the same time
  • need historical data

In addition, developers also often decide to use an Edge Database to save time and nerves, or to be able to react quickly and flexibly to future requirements.

Edge Database Use Case Example in Manufacturing

Today, you can find everything from low-frequency brownfield devices to high-frequency greenfield devices on a factory floor. As a rule, the machine controllers in use are not designed to store or transmit data. They usually lack not only the functionality, but also the resources to support this. Therefore, additional edge devices are often needed to collect, analyze and interpret the huge amounts of data that each machine produces on site. For such an edge device, rapid data persistence and ingestion, and efficient data flow from edge-to-edge and edge-to-cloud are at the heart of value creation. The clear separation of machine control and edge data processing unit ensures that there is no risk of unintentional interference with the machine controller. An edge device with a powerful edge database can support multiple use cases on the shop floor today:

manufacturing-edge-computing-use-case

1. Operational efficiency

Process optimization along the line to increase quality and reduce damage. When the first machine in a production line uses a new batch of material, i.e. in sheet metal processing, one of the first steps is to cut a sheet to the required size. At this stage, the machine can already detect the differences in the metal compared to a previous batch (deviations are allowed within the DIN standard). With an Edge device this data can be evaluated, and the relevant information passed on to the next machine. With this data machines further down the line can avoid damage / breakpoints of the material.

2. Condition monitoring

Continuous machine condition monitoring reduces downtime and increases maintenance efficiency. A constant stream of high-frequency machine data is compared against the fingerprint of the machine. Any slight deviation is immediately detected and reported. Catching deviations early reduces down-times and costly repairs.

3. Historical Data

Historical data is stored for learning and training to optimize the production line. With an Edge Database, the data is persisted and thus available in the event of faulty behavior. In case of an error, the data preceding the incident can be analyzed and used to find the causes and predict, or even avoid, such an error in the future. Chances are that “fuzzy expert knowledge” already available at the production site can be translated into deterministic rules when tested with these data sets.

The future of Edge Databases 

Edge computing provides numerous benefits and enables many applications and functionalities that are only possible with edge computing. However, only a few (usually large) players have been able to create value in edge computing projects, gaining competitive advantages. One reason is a lack of basic edge software. A thriving edge ecosystem necessitates edge software infrastructure that addresses the fundamental recurring needs of edge projects. Edge databases are a critical component in the development of such an ecosystem.

Looking ahead, the emergence of on-device vector databases, coupled with small language models (SLMs), is transforming the landscape of AI applications. These technologies enable AI apps to run directly on edge devices, providing long-term memory, improving performance, and significantly reducing resource consumption. By processing data locally, they eliminate the need for constant cloud connectivity, enhancing privacy and efficiency. Companies like Apple have already embraced on-device AI (Apple Intelligence), showcasing its potential to deliver advanced functionalities seamlessly. This shift represents a game-changer, making AI more sustainable, scalable, and integrated into everyday use.

Green Coding: Developing Sustainable Software for a Greener Future

Green Coding: Developing Sustainable Software for a Greener Future

Digitization helps to save CO₂ – many experts agree on that. But things are not that simple, because the creation of software and its use contribute to greenhouse gas emissions too. All code creates a carbon footprint. Software development and use affect the environment from the energy consumed while running to the associated electronic device waste. Choosing a sustainable software architecture matters, but every developer also can make a difference by applying green coding principles. 

This article will explore the importance of green software development and its main principles.

Green Software Development: Balancing Digitization and Environmental Sustainability

In this section, we’ll first define some important terms in the topic of environmentally conscious software development. Then, we’ll discuss why it is relevant and discussing the broader benefits of adopting green coding practices.

What does sustainability in software development mean?

In our view, sustainability in software development (also “green software development”) entails developing and maintaining software in a way that is not only environmentally, but also socially and economically responsible. So, what really counts is the long-term bottom-line value from a general societal perspective, not an “individual balance sheet”.

There are many trade-offs in such an ambition, and therefore sustainable software development is rather a set of guiding principles than hands-on measures that are truly the same for everyone. Let’s dive a bit into how sustainable software development can contribute to all three aspects:

Environmental aspects

Since software is a significant source of direct greenhouse gas emissions, it is becoming more important to create software that reduces resource use as much as possible. As the world becomes more reliant on technology, energy consumption and carbon footprint of software will continue to grow. By adopting green software development practices, software developers can help to mitigate these environmental impacts.

earth-teal

Broader Economic contribution

If a software uses less energy and resources to accomplish the same tasks as another software, the users of that software can reduce their operating costs and improve their bottom line. Increasing the longevity of hardware (less wear, but also less hw requirements extending the usability of existing hw) also yields direct economic savings for the software users (companies as well as individuals). On a broader level, this compounds significantly over the number of users and with time and thus contributes to economic welfare. What sounds like a small contribution does add up tremendously in the end…

Social impact

Sustainable software development includes responsibility for the social impact of the software created. As a result, sustainable software aims to be transparent, inclusive, and offer data sovereignty. By giving individuals and organizations greater control over their own data, software empowers them and protects their privacy. At the same time, it promotes greater accountability and transparency in data-driven decision-making.

Overall, sustainability in software development involves taking a holistic approach. On top, sustainable software companies take steps to minimize negative impacts and promote positive ones over the long term.

This is why it has been one of our core values since we started ObjectBox:

Be Sustainable in every respect – we apply sutainability to our technology, as well as the people and small every-day decisions. ObjectBox aims to be the most resourceful data management solution for connected devices. We strive to save resources (energy, CO₂, bandwidth, time, etc.), but also always choose the sustainable path (recycled paper, saving energy, etc.), and support our employees to lead balanced and sustainable lives.

What is green coding / green software development?

Recently, the term “green coding” has emerged to describe the practice of creating and writing code (aka software) in a way that minimizes its environmental impact. This can involve using efficient code that consumes less energy, optimizing data usage, and reducing electronic waste.

What is the difference between Green IT and Green Coding?

Green IT is primarily about the hardware and the optimization of data centers. Today, it often actually is about optimizing cloud usage. The code decides whether this hardware is used efficiently. By contrast, green coding is about making the code more efficient, so that running the code (e.g. using an app on the smartphone, or using an email program) uses less resources and less electricity, thus producing less CO₂. 

Why is it time for developers to prioritize environmental sustainability?

Various studies estimate the Carbon footprint of the digital economy to be between 2.3 – 3.7% percent of global CO₂ emissions 😱 [1]. Although the impact of software on the environment may not yet be as dramatic as that of manufacturing, it keeps growing rapidly each year. By taking sustainable decisions in software development, we can make it part of the carbon solution of the future. 

Every line of code – scaled up to hundreds, thousands, or even millions of devices (desktops, smartphones, tablets…) worldwide – has the potential to significantly reduce energy consumption and CO₂ emissions.

How to put sustainable software development into practice?

We believe two key aspect to develop sustainable software, that creates bottom-line value, are:

  • minimize the resource consumption of software especially during operation, where most resources are consumed – be dilligent about that; it compounds
  • keep data as much as possible where it is produced, used and belongs (e.g. with the end users) and avoid unnecessary data transferals, superfluous cloud use, and unnecessarily storing data in the cloud

Both measures have significant environmental, social, and economic impact, short- and long-term.

It’s time we as developers start thinking about our impact on the planet and make sustainability a part of our everyday coding mindset. We can make a difference by incorporating sustainability into every action and decision we take when developing software. Careful measuring and optimizing the resource along the way is also important. The welcome side effect: fast software that is cheap to run and fun to use 🙂

For example, at ObjectBox, we’re all about maximizing the use of computing resources and minimizing resource waste of every line of code (LOC). This makes ObjectBox not only environmentally sustainable, but at the same time superfast, usable on low end devices w. little hw requirements, and cheap in operational costs 🤯

💚 Responsible development practices pay off in several respects and we really cannot see a huge tradeoff. All it costs is spending more time and brain on optimizations, benchmarking, and dilligently applying this approach to every line of code.

💚 As a developer tool, our impact is broader than a developer’s impact on end-users. So, we’re committed to using resources efficiently and reducing waste at every stage of the game.

Guidelines to start making your code more sustainable

Some more tipps how to put sustainable software development into practice:

  • Energy efficiency: Developing software that is energy-efficient can help to reduce its environmental impact by minimizing the amount of energy required to run software. 
  • Responsible sourcing: Using responsibly sourced hardware, software, and other materials can help to reduce the environmental impact of software development.
  • Longevity: Developing software that is designed to last can help to reduce waste and promote sustainability by reducing the need for frequent updates and replacements.
  • Accessibility: Making software accessible to a wide range of users can help to promote social sustainability by ensuring that everyone has access to the benefits of technology.
  • Data sovereignty, privacy and security: Protecting user data and maintaining strong cybersecurity measures can help to promote sustainability by preventing data breaches and other security incidents that can have negative social and economic impacts.

Examples of sustainable coding: More impactful than you would expect

1. How can a millisecond be worth 2 days?

Real world example: By reducing the resolution of images in a banking app with 500.000 users, whose users on average opened it daily, developers saved more than 2 days of total operational time (up time) [2].

 2. How can 2 grams of CO₂ savings / hour be worth 330.000 t CO2?

Theoretical consideration: Netflix states that streaming its content produces 55 grams of CO₂ per hour [3]. This gives us 40 kilograms of CO₂ per year for daily streaming of two hours per person [4]. With Netflix users being 230M, a reduction would have an enormous scaling factor [5]. Assuming a Netflix developer reduces the 55 grams to 53 grams, you get 330 kt of CO₂ in potential savings. Note: This is a highly theoretical example, just to demonstrate the thinking.
Anyways: Individuals can’t save that much as easily. That’s the impact you as a programmer have!

3. How much CO₂ can local storage save in 1 million cars?

Sending and storing 1 GB of data in the cloud needs about 5 kWh of electricity, while local storage only needs about 0.000005 kWh, which is a million times lower. Making the switch to local storage in 1 Million cars would lead to saving 905 kg of CO₂ every second. If you want to know what that actually means, you can translate that into equivalents: CO2 equivalencies or the CO2 calculator

👉 These examples clearly illustrate the potential impact of shifting towards an environmentally conscious mindset when developing software. Now that we know the why, it’s time to discuss the how.

Sustainable Edge Data Managment w. ObjectBox – a ready-made developer tool

ObjectBox is a free Edge Database that can help reduce the environmental impact of apps. It is optimized for computing resource efficiency and empowers developers to store and use data locally and create offline-first apps. Unless the data is really needed in the cloud, this is way more energy-efficient and sustainable compared to a cloud setup. On top, it works independant from an Internet connection being available and is superfast while saving battery, making it an ideal choice for apps that prioritize sustainability.

What is an Edge Database?

An Edge Database is a type of database that is used on the “edge” of a network, closer to the data sources and devices generating data. Traditional databases, on the other hand, are usually set up in centralized data centers or in the cloud.

Edge databases are essential when devices need to work offline, guarantee response times, speed is of the essence, you have limited Internet connectivity, mission-critical scenarios, or when handling high-frequency data. By processing data locally on the edge, Edge Databases can reduce latency and improve performance while also reducing the amount of data transferred over the network.

Edge databases have a small footprint and are designed to run on restricted devices such as routers, IoT gateways, mobile phones, and other embedded systems. They typically incorporate features needed in distributed systems, such as data synchronization, caching, and offline support to ensure that data remains available even in the event of network outages or other disruptions.

ObjectBox Sync is a highly efficient and sustainable data synchronization solution. It reduces the amount of energy used by having as little overhead as possible when sending data combined with solid compression, avoiding data transformations, and only syncing data changes instead of sending all data to the cloud all the time. Developers have control over what data is synced when.

Overall, ObjectBox DB + Sync is a powerful tool for building fast apps that prioritize consuming less energy and saving device resources. By storing data locally and only syncing when and where needed, developers can ensure that their apps are as sustainable as possible, and save on cloud costs along the way. 

What is Data Synchronization + How to Keep Data in Sync

What is Data Synchronization + How to Keep Data in Sync

What is Data Sync / Data Synchronization in app development?

Data Synchronization (Sync) is the process of establishing consistency and consolidation of data between different devices, including offline data sync to ensure accessibility even without a constant internet connection. It is fundamental to most IT solutions, especially in IoT and Mobile. Data Sync entails the continuous harmonization of data over time and typically is a complex, non-trivial process. Even corporates struggle with its implementation and had to roll back Data Sync solutions due to technical challenges. 

The question Data Sync answers is

phone-data-sync-with-machine-payment-automatic-data

How do you keep data sets from two (or more) data stores / databases – separated by space and time – mirrored with one another as closely as possible, in the most efficient way?

Data Sync challenges include asynchrony, conflicts, slow bandwidth, flaky networks, third-party applications, and file systems that have different semantics.

Data Sync versus Data Replication in Databases

sync-data-better-than-replication

Data replication is the process of storing the same data in several locations to prevent data loss and improve data availability and accessibility. Typically, data replication means that all data is fully mirrored / backed up / replicated on another instance (device/server). This way, all data is stored at least twice. Replication typically works in one direction only (unidirectional); there is no additional logic to it and no possibility of conflicts.

In contrast, Data Sync typically relates to a subset of the data (selection) and works in two directions (bi-directional). This adds a layer of complexity, because now conflicts can arise. Of course, if you select all data for synchronisation into one direction, it will yield the same result as replication. However, replication cannot replace synchronization.

Why do you need to keep data in sync?

Think about it – if clocks were not in sync, everyone would live on a different time. While I can see an upside to this, it would result in many inefficiencies as you could not rely on schedules. When business data is not in sync (up-to-date everywhere), it harms the efficiency of the organization due to:

  • Isolated data silos
  • Conflicting data / information states
  • Duplicate data / double effort
  • Outdated information states / incorrect data

In the end, the members of such an organization would not be able to communicate and collaborate efficiently with each other. They would instead be spending a lot of time on unnecessary work and “conflict resolution”. On top, management would miss an accurate overview and data-driven insights to prioritize and steer the company. The underlying mechanism that keeps data up-to-date across devices is a technical process called data synchronization (Sync), which often requires offline data sync capabilities to maintain consistency even when devices are offline. And while we expect these processes to “just work”, someone needs to implement and maintain them, which is a non-trivial task.

Growing data masses and shifts in data privacy requirements call for sensible usage of network bandwidth and the cloud. Edge computing with selective data synchronization is an effective way to manage which data is sent to the cloud, and which data stays on the device. Keeping data on the edge and synchronizing selective data sets effectively, reduces the data volume that is transferred via the network and stored in the cloud. Accordingly, this means lower mobile networking and cloud costs. On top, it also enables higher data security and data privacy, because it makes it easy to store personal and private data with the user. When data stays with the user, data ownership is clear too.

Unidirectional Data Replication

replication-data-sync-database

Bidirectional Data Synchronization

how-to-sync-data-what-is-data-sync

Out-of-the-box Sync magic: Syncing is hard

Almost every Mobile or IoT application needs to sync data, so every developer is aware of the basic concept and challenges. This is why many experienced developers appreciate out-of-the-box solutions. While JSON / REST offers a great concept to transfer data, there is more to Data Sync than what it looks like at a glance. Of course, the complexity of Sync varies widely depending on the use case. For example, the amount of data, data changes, synchronous / asynchronous sync, and number of devices (connections), and what kind of client-server or peer-to-peer setup is needed, all affect the complexity.

iceburg-building-data-synchronization

What looks easy in practice hides a complex bit of coding and opens a can of worms for testing. For an application to work seamlessly across devices – independent of the network, which can be offline, flaky, or only occasionally connected – an app developer must anticipate and handle a host of local and network failures to ensure data consistency. Offline sync capabilities help applications continue to function in these scenarios, ensuring reliable performance. Moreover, for devices with restricted memory, battery and/or CPU resources (i.e. Mobile and IoT devices), resource sensitivity is also essential. Data storage and synchronization solutions must be both effective / efficient, and sustainable.

How to Keep Data in Sync Without the Headache?

Thankfully, there are out-of-the-box data synchronization solutions available on the market, which solve data syncing for developers. They fall broadly into two categories: cloud-dependent data synchronization, and independent, “edge” data synchronization. Cloud-based solutions, like Firebase, require a connection to the internet to function. Data is sent to and requested from the cloud constantly. Edge solutions, like ObjectBox, also offer “Offline Sync”: Data is stored in an efficient on-device database, synchronization on and between edge devices can be done continually without an Internet connection, and Dat Sync with a cloud or a backend that is not located on premise occurs once the device(s) goes online. Below, we summarize the most popular market offerings for data synchronization (offline and cloud based):

mongo-realm-logo

Couchbase

Couchbase is a Cloud DB, Edge DB and Sync offering that requires the use of Couchbase servers.

firebase-logo

Firebase

Firebase is a Backend as a Service (BaaS) offering from Google (acquired). Google offers it as a cloud hosted solution for mobile developers.

mongo-realm-logo

Mongo Realm

Realm was acquired by MongoDB in 2019; the Mongo Realm Sync solution (Atlas Device Sync) used Realm DB on edge devices and synchronized with a MongoDB hosted in the cloud. However, MongoDB recently announced end-of-life for it.

mongo-realm-logo

ObjectBox

ObjectBox is a DB for any device, from restricted edge devices to servers, and offers an out-of-the-box Sync solution with offline sync capabilities, enabling reliable data access even without an internet connection. ObjectBox enables self-hosting on-premise / in the cloud, as well as Offline Sync.

pasre-logo-comparison

Parse

Parse is a BaaS offering that Facebook acquired and shut down. Facebook open sourced the code. The GitHub repository is not officially maintained. You can host Parse yourself or use a Parse hosting service.

Data Sync, Edge Computing, and the Future of Data

There is a megashift happening in computing from centralized cloud computing to Edge Computing. Edge computing is a decentralized topology entailing storing and using data as close to the source of the data as possible, i.e. directly on edge devices. Accordingly, the market is growing rapidly with projections estimating continuing growth with a 34% CAGR for the next five years. The move from the cloud to the edge is strongly driven by new use cases and growing data masses Edge data persistence and Data Sync (managing decentralized data flows), especially “Offline Sync”, are the key technologies needed for Edge Computing. Using edge data persistence, data can be stored and processed on the edge. This means application always work, independent from a network connection, offline. Faster response times can be guaranteedWith Offline Sync, data can be synchronized between several edge devices in any location independant from an Internet connection. Once a connection becomes available, selected data can be synchronized with  a central server. By exchanging less data with the cloud or a central instance, data synchronization reduces the burden on the network. This brings down mobile network and cloud costs, and reduces the amount of energy used: a win-win-win solution. It also enables data privacy by design.