Nine factors that influence the choice of a database in a project

Andrei Rogalenko
Level Up Coding
Published in
18 min readMay 30, 2023

--

When choosing databases for your current project (or when replacing those that do not meet your current needs), the number of options is very large. This is both good and bad, because some filtering criteria are needed.

There are many more databases today than ever before. In December 2012, when DB-Engines.com first started ranking databases, it had a list of 73 systems (a significant increase from the very first list of 18 systems). Ten years later, as of December 2022, there were almost four hundred systems on the list. The past decade has seen a veritable Cambrian explosion of database technology. You have to navigate a wide range of options: SQL, NoSQL, many “multi-model” databases, which can be a combination of SQL and NoSQL, or multiple NoSQL data models (combining two or more options: documents, key-values, wide columns, graphs, etc.)

Also, users should not confuse popularity with applicability to them. While the network effect has its benefits (“Everyone uses X, so I can’t go wrong if I choose it”), it can also lead to groupthink, inhibition of innovation, and competition.

Let’s look at nine factors that users should consider first when choosing and comparing databases.

Nine factors.

  1. Software Architecture — Does the database use the most efficient data structures, flexible data models, and rich query languages to support your workloads and query patterns?
  2. Hardware utilization — can it fully utilize the full capabilities of modern hardware platforms, or will a significant part of the CPU resources be idle?
  3. Integration capabilities — how easy is it to integrate into your development environment? Does it support the programming languages, frameworks, and projects you need? Has it been designed to integrate with your microservices and event streaming architecture?
  4. RASP — does it have the necessary qualities: Reliability, Availability, Scalability, Serviceability and, of course, Performance?
  5. Deployment — Does the database only work in a limited environment, such as only a company’s facilities, or only one data center, or only one cloud service provider? Or can it be deployed in any place on Earth that you need?
  6. Data consistency and integrity: Maintaining data accuracy and ensuring consistent results, even in a distributed environment, is vital for business-critical applications.
  7. Security: Robust security mechanisms, including access controls, encryption, and auditing capabilities, are necessary to protect sensitive data and comply with relevant regulations.
  8. Developer ecosystem and tooling: A thriving developer community, comprehensive documentation, and a rich set of tools and libraries can greatly facilitate development, debugging, and maintenance processes.
  9. Cost: Consideration of licensing fees, operational costs, and potential future scalability expenses is crucial to ensure that the chosen database aligns with the project’s budget and financial constraints.

Software architecture

The most critical question here is: “Does the database use the most efficient data structures, flexible data models, and rich query languages to support your workloads and query patterns?”

  • Workloads — Do you need write-heavy transactional workloads or mixed read/write transient workloads? Or will your workloads be primarily analytical and read-based? Will you need hybrid workloads from a combination of transactions and analytics? Will the loads be real-time, batch or mixed? Will there be a steady stream of events per second, or will there be predictable, smooth, regular rises and falls throughout the day? Or maybe you need to plan for handling stochastic traffic spikes (for example, breaking news or other reasons for the sudden popularity of a post)?
  • Data model — do you work with key-value pairs? With wide columns (key-key-value data)? With columns? Documents? Counts? RDBMS (with tables and JOINs)? Or maybe something completely different? Do you really have the time and need to do full data normalization, or will the database gobble up large amounts of unstructured data so quickly that normalization would be stupid and a denormalized data model would be better for you? There is no single “correct” answer to these questions. It is always necessary to take into account specific circumstances.
  • Query language. The query language is much more biased in this aspect, because even if your data engineering team can mask or hide the backend query model, many of the users have their own inclinations and preferences. This is one of the main reasons why SQL remains an almost unopposed choice. At the same time, there are new query languages. Some of them are SQL-like, such as the Cassandra Query Language (CQL) used by Cassandra and ScyllaDB. SQL users will find it vaguely familiar. But don’t be fooled — there are no table JOINs! In addition, there are new generation query languages such as JSON. It is based on Amazon DynamoDB queries. ScyllaDB also supports this JSON query model with the DynamoDB compatible Alternator interface. But no matter what you are leaning towards, you need to think about the query language before the final choice of the database.
  • Transactions / Operations / CAP — which one is more important to you? Fully consistent ACID transactions? Or high-performance, high-availability, simple CRUD operations? The CAP theorem states that only two of the three items can be chosen: consistency, accessibility, or partition tolerance. Given that distributed databases must always be partition-resistant, you will have to choose between so-called CP-mode systems (consistency-oriented) or AP-mode systems (availability-oriented). In addition, implementation details should be considered within these modes. For example, how to achieve high consistency in a distributed system can vary greatly. It is also necessary to consider the choice of different consensus algorithms to ensure linearizability, such as Paxos, Raft, Zookeeper (ZAB) and so on. Along with the differences in algorithms, each implementation can differ significantly from the other.
  • Data distribution — what exactly is meant by a “distributed system”? Are we talking about a local cluster in one data center? Or do we mean clustering in several data centers? How do inter-cluster updates happen? Do they count as one logical cluster, or do they require cross-cluster synchronization? How is data localization handled and, for example, GDPR compliance ensured?

Hardware utilization

We are witnessing a hardware revolution that continues to expand the capabilities of software. Many software applications, and in particular many databases, are still tied to decades of roots, architectures, and assumptions.

  • CPU Efficiency/Utilization — Most software starts to perform poorly when the percentage of CPU usage rises above, say 40% or 50%. This means that the software will run inefficiently, resulting in permanent downtime of half of the machine’s resources. In essence, you are paying for twice as much infrastructure. So you need to figure out how your system handles distributed processing.
  • RAM Efficiency/Utilization — Does your database run into memory limits all the time? Is caching too aggressive or too bloated (for example, are there multiple layers of caching), causing unnecessary data to be stored in memory? How does the database optimize read and write paths?
  • Storage Efficiency/Utilization — What storage format does your database use? Does it have compact mutable tables that might require heavyweight file locking mechanisms? Or does it use immutable tables that can provide fast writes, but at the cost of more space and more reads? Does it have tiered storage capability? How does it handle concurrency? Are the files stored in rows (good for transaction-heavy scenarios) or in columns (good for analytics and highly repetitive data)? Here, too, there is no single “correct” answer. Each solution optimizes the system for different application scenarios.
  • Efficiency/degree of network utilization — here it is worth thinking about the efficiency of both client-server cluster communications, and intra-cluster communications. Client-server models can be made more efficient with concurrency, connection pooling, and so on. Intra-cluster communications include both standard operational/transactional data exchange (data replication in an update or write operation), and administrative tasks, such as streaming and balancing data between nodes when the topology changes.

Integration capabilities

No database is isolated. How easy is it to integrate it into your development environment? Does it support your programming languages, frameworks and projects? Has it been designed to integrate with your microservices and event streaming architecture?

  • Programming Languages/Frameworks — Companies very often associate themselves with their preferred programming language or framework. If the database does not have a required client, SDK, library, ORM and / or other packages for its integration into this language, then it might as well not exist at all. It should be noted that the explosive development of databases began in parallel with the explosive development of programming languages. However, it’s worth looking into client programming language support. In this case, you need to take into account that it may not be the same programming language in which the database itself is written (which may affect its software architecture and efficiency). The question here is solely in what language can applications be written to connect to this database in the backend.
  • Event Streaming/Message Queuing — Databases may be the only source of truth, but they are not the only systems running in your company. In fact, there may be different databases performing transactions and analysis on different parts of a company’s data and information space. Event streaming is an increasingly popular medium to avoid creating a data silo; Today, the convenience of a database is defined by its integration with real-time event transmission and message queuing technologies. Can your database be both a destination and a source of data? Does it have Change Data Capture (CDC)? Can it connect to event streaming and message queuing technologies like Apache Kafka, Apache Pulsar, or RabbitMQ?
  • API — Does the database support one or more APIs, such as a RESTful or GraphQL interface, to make it easier to integrate the database into application and microservice architectures? Does it have an administrative API so that it can be managed programmatically and not just through the GUI? Using the GUI may seem convenient only at first, until you have to manage and automate deployment systems.
  • Other integrations — what about CI/CD toolchains? Observability platforms? How about using the database as a pluggable storage engine or as a low-level element of a larger architecture? How well does it serve as an infrastructure, or does it fit into the infrastructure you already use?

RASP

This acronym was coined decades ago and is commonly used in the context of hardware. It stands for Reliability, Availability, Serviceability (or Scalability), Performance. In fact, all these parameters simplify the operation of the system. When choosing a database, they determine how much manual labor you may need to keep the system running and stable. They indicate how well the database can take care of itself under normal operating conditions and even handle failure conditions.

  • Reliability is an indicator of how much effort needs to be invested so that the system does not collapse or data is lost. How high is her survival rate? What entropy protection mechanisms does it have to restore cluster synchronization? How good are backup systems? And more importantly, how good are recovery systems? Are there safeguards in place to prevent users from accidentally destroying the system?
  • Availability — what does the database do during short-term network splits and the unavailability of intermediate nodes? What happens when a node fails completely? What happens if the network failure lasts for several hours?
  • Maintainability — the concept of “observability” is popular today, which includes three categories: logging, traces and metrics. Of course, observability must be built into the database. However, serviceability is much more than that. How easy is it to upgrade without downtime? How convenient are maintenance operations?
  • Scalability — Some databases are easy to get started with, but over time you can hit a ceiling. Scalability means you don’t have to worry about running into total data processing, operations per second, or geographic limits, such as when moving from a single data center to a worldwide deployment. In addition, there is horizontal scalability (increasing the number of nodes in the cluster) and vertical scalability (placement of the database on servers that have an ever-increasing number of CPUs, RAM and storage).
  • Performance — if the database cannot meet the SLA in terms of delays or throughput, then it simply will not pull the production. Also, in terms of scalability, it’s worth mentioning that many databases seem to meet your performance requirements on a small scale or based on a static benchmark using test data, but when faced with real production workloads, they simply can’t keep up with the increased frequency. variability and complexity of requests. So performance requires a strong correlation with a linear scale.

Deployment

All of the above should be done where you need to. Does the database only work in a limited environment, such as only a company’s facilities, or only one data center, or only one cloud service provider? Or can it be deployed in any place on Earth that you need? Ask the following questions:

  • Limitations — can it work at the company’s facilities? Can it be deployed only at the company’s facilities? Is it tied to only one specific cloud provider, or can it work with a provider of your choice? What are hybrid cloud or multi-cloud options?
  • Command/control — can it only work as a self-managed database or can it be absorbed into Database-as-a-Service (DBaas)? The first option gives you full control over the system, while the second takes the burden of administration off your teams. Both of them have their pros and cons. Is there only one option to choose from, or does the database allow users to switch between these two business models?
  • Automation and instrumentation — does it have a Kubernetes Operator to support it in production? Terraform and Ansible scripts? Although this is the last item on the list, it should not be left for later when choosing a system for production.

Data consistency and integrity

Data consistency and integrity are crucial aspects of any database system, especially in a high-load corporate project. Let’s delve into the topic in more detail.

Data consistency refers to the accuracy and validity of data stored in the database. It ensures that data remains reliable and coherent throughout the system, even in the presence of concurrent transactions or system failures. Inconsistent data can lead to incorrect analysis, inaccurate reporting, and flawed decision-making processes.

To maintain data consistency, databases typically employ various techniques such as ACID (Atomicity, Consistency, Isolation, Durability) properties. ACID ensures that database transactions are executed in an “all-or-nothing” manner, meaning that either all changes within a transaction are applied, or none of them are.

Atomicity guarantees that a transaction is treated as a single, indivisible unit of work. Consistency ensures that the database moves from one valid state to another after a transaction is completed. Isolation ensures that concurrent transactions do not interfere with each other, preventing issues like dirty reads, non-repeatable reads, and phantom reads. Durability guarantees that once a transaction is committed, its changes are permanently stored and will survive any subsequent failures.

Data integrity, on the other hand, focuses on the accuracy and validity of data at the individual record or entity level. It ensures that data conforms to defined rules, constraints, and relationships specified in the database schema. Data integrity measures include primary key constraints, unique constraints, foreign key constraints, check constraints, and triggers, among others.

Primary key constraints ensure that each record in a table is uniquely identifiable. Unique constraints ensure that values in a column or combination of columns are unique. Foreign key constraints establish relationships between tables, ensuring data integrity across related entities. Check constraints define rules for acceptable values in columns. Triggers allow the enforcement of additional business rules or validations during data modification operations.

Ensuring data consistency and integrity requires a combination of proper database design, transaction management, and the enforcement of constraints and rules. Additionally, careful consideration should be given to handling data conflicts, implementing appropriate locking and concurrency control mechanisms, and addressing issues related to data replication and synchronization in distributed or replicated environments.

Regular data validation, integrity checks, and periodic backups are also essential to identify and mitigate any potential issues. Data consistency and integrity are critical not only for accurate reporting and decision-making but also for maintaining trust in the system and meeting compliance requirements.

In summary, data consistency and integrity are fundamental aspects of a database system. They ensure that data remains accurate, valid, and reliable throughout the system, enabling reliable operations, analysis, and decision-making processes in a high-load corporate project.

Security

Security is a critical factor to consider when selecting a database for a high-load corporate project. As a professional in the field, let’s dive into the topic of database security in more detail.

Database security encompasses a range of measures designed to protect the confidentiality, integrity, and availability of data stored in a database system. It involves safeguarding data from unauthorized access, protecting against data breaches, ensuring data privacy, and complying with relevant regulations and industry standards.

Here are some key aspects of database security:

  • Access Controls: Database systems provide access control mechanisms to manage user privileges and permissions. Role-based access control (RBAC) is commonly used to assign and enforce fine-grained access privileges to users or groups. It ensures that only authorized individuals can access and manipulate specific data based on their roles and responsibilities.
  • Authentication and Authorization: Strong authentication mechanisms, such as username/password combinations, multi-factor authentication (MFA), or biometric authentication, help verify the identity of users before granting access to the database. Authorization mechanisms control what actions users are allowed to perform once authenticated, ensuring that they can only access the data they are authorized to view or modify.
  • Encryption: Encryption protects data at rest (stored data) and data in transit (data being transmitted between systems). Encryption algorithms transform data into an unreadable format, and only authorized users with the decryption keys can access the original data. Strong encryption protocols, such as Transport Layer Security (TLS) for network encryption and Advanced Encryption Standard (AES) for data encryption, are commonly used to protect sensitive information.
  • Auditing and Logging: Database systems often provide auditing and logging capabilities to track and record activities performed on the database. Audit logs capture details such as user logins, data modifications, and access attempts. Regular monitoring and analysis of audit logs can help detect suspicious activities, identify security breaches, and ensure compliance with regulatory requirements.
  • Data Masking and Anonymization: In certain cases, sensitive data needs to be masked or anonymized to protect privacy. Data masking techniques replace sensitive information with realistic but fictitious data, ensuring that sensitive data is not exposed during development, testing, or when accessed by unauthorized individuals.
  • Database Activity Monitoring (DAM): DAM systems continuously monitor and analyze database activities, providing real-time alerts for suspicious behavior or policy violations. DAM helps detect and prevent unauthorized access attempts, SQL injection attacks, and other security threats by monitoring user activities, SQL statements, and data access patterns.
  • Patch Management: Regularly applying patches and updates to the database software is essential to address security vulnerabilities and protect against known exploits. Timely patching helps mitigate the risk of unauthorized access or data breaches resulting from software vulnerabilities.
  • Compliance with Regulations: Depending on the industry and geographical location, databases must adhere to specific regulations and compliance requirements. Examples include the General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), Payment Card Industry Data Security Standard (PCI DSS), and many others. Compliance involves implementing security controls, privacy measures, data retention policies, and audit capabilities to meet legal and regulatory obligations.
  • Database Backup and Recovery: Regularly backing up databases and implementing disaster recovery mechanisms are vital for security. In the event of a system failure, data corruption, or a security incident, reliable backups ensure that data can be restored to a known good state, minimizing data loss and downtime.
  • Security Testing and Vulnerability Assessments: Regular security testing, including vulnerability assessments and penetration testing, helps identify and remediate security weaknesses in the database system. Conducting such tests periodically helps ensure that security measures are effective and up to date.

Database security requires a layered approach, combining technical controls, robust policies and procedures, ongoing monitoring, and employee awareness and training. It is crucial to assess the security features and capabilities of the database management system (DBMS) being considered for the high-load corporate project to ensure it meets the organization’s specific security requirements.

In summary, database security is essential for protecting the confidentiality, integrity, and availability of data in a high-load corporate project. By implementing comprehensive security measures, organizations can mitigate risks, prevent data breaches, and maintain compliance with relevant regulations.

Developer ecosystem and tooling

The “Developer ecosystem and tooling” factor plays a crucial role in the selection of a database for a high-load corporate project. As a professional in this area, let’s explore this factor in detail.

The developer ecosystem refers to the community, resources, and support available to developers working with a particular database. It encompasses a range of aspects, including:

  • Documentation: Comprehensive and well-organized documentation is essential for developers to understand the features, functionalities, and APIs of the database. It should include tutorials, guides, reference manuals, and examples that help developers effectively utilize the database’s capabilities.
  • Libraries and Frameworks: A robust developer ecosystem offers a rich set of libraries, frameworks, and SDKs (Software Development Kits) that simplify database integration and enhance developer productivity. These tools often provide pre-built functions, abstraction layers, and best practices, reducing the amount of code developers need to write and accelerating application development.
  • Community Support: A thriving developer community is invaluable for sharing knowledge, best practices, and solutions to challenges encountered while working with the database. Online forums, discussion boards, and community-driven platforms allow developers to seek assistance, contribute their expertise, and collaborate with others, fostering a sense of community and support.
  • Training and Education: Educational resources, such as online courses, tutorials, webinars, and workshops, contribute to the developer ecosystem. They enable developers to acquire the necessary skills and knowledge to effectively utilize the database’s features and optimize performance.
  • Developer Events and Conferences: Developer-focused events and conferences provide opportunities for networking, learning, and staying up to date with the latest advancements in database technologies. They often feature keynote speeches, technical sessions, hands-on workshops, and expert-led discussions, enhancing developers’ understanding and expertise.

Now, let’s turn our attention to the tooling aspect of this factor. Tooling refers to the set of software tools, utilities, and IDE (Integrated Development Environment) support available for developers working with the database. Here are some key aspects of tooling in the context of a database:

  1. Database Management Tools: These tools provide user-friendly interfaces to manage databases, configure settings, execute queries, monitor performance, and analyze data. They offer features such as query optimization, data visualization, schema management, and backup and recovery capabilities.
  2. Query Editors and IDEs: Specialized query editors and IDEs provide a dedicated environment for developers to write, test, and debug database queries and code. They often include features like syntax highlighting, code completion, query execution plans, and debugging tools, enhancing developer productivity and efficiency.
  3. Data Modeling and Design Tools: Data modeling tools allow developers to visually design and create database schemas, define relationships between entities, and generate SQL scripts. These tools facilitate the process of database design, ensuring proper organization and structure of data.
  4. Version Control and Collaboration Tools: Version control systems, such as Git, and collaboration platforms enable developers to manage and track changes to database scripts, schemas, and configurations. They support team collaboration, code reviews, and merging of changes, ensuring proper versioning and minimizing conflicts.
  5. Testing and Profiling Tools: Testing tools help developers create and execute automated tests to verify the correctness and performance of database operations. Profiling tools assist in identifying performance bottlenecks, optimizing queries, and monitoring resource usage.

Having a strong developer ecosystem and robust tooling in place offers several benefits to developers and the project as a whole:

  • Developers can leverage existing resources, libraries, and frameworks, reducing development time and effort.
  • Access to comprehensive documentation and community support accelerates learning and problem-solving.
  • Well-designed tooling enhances productivity, streamlines development workflows, and improves code quality.
  • Regular updates and new releases from the database vendor ensure developers can leverage the latest features and enhancements.

Cost

The “Cost” factor is a significant consideration when selecting a database for a high-load corporate project. As a professional in this field, let’s explore the topic of cost in detail.

When assessing the cost of a database, it’s essential to consider the following aspects:

  • Licensing Fees: Some databases require the payment of licensing fees based on factors such as the number of users, CPUs, or data storage capacity. The cost structure can vary, including one-time upfront fees, annual subscriptions, or per-user or per-instance fees. Understanding the licensing model and associated costs is crucial for budget planning.
  • Operating Costs: Apart from licensing fees, there are ongoing operating costs to consider. These include expenses related to hardware infrastructure, maintenance, database administration, and technical support. The scale of the project and the database’s resource requirements can significantly impact these operating costs.
  • Scalability Expenses: As the project grows and the database needs to handle increased load, there may be additional costs associated with scaling the database. Some databases offer built-in scaling options, while others may require additional licenses or hardware investments to accommodate growth. Understanding the cost implications of scaling is important to ensure future scalability within the project’s budget.
  • Vendor Support: The level of support provided by the database vendor can impact the cost. Higher-tier support packages typically come at an additional cost but may offer benefits such as faster response times, dedicated support personnel, and extended coverage hours. Evaluating the project’s support requirements and aligning them with the available support options is important to manage costs effectively.
  • Training and Education: Investing in training and education for developers and administrators is essential for efficient database usage and maintenance. Training programs, certifications, and workshops may incur costs, but they can enhance the skills and productivity of the project team, leading to long-term cost savings.
  • Total Cost of Ownership (TCO): It’s crucial to assess the overall TCO of the database over its lifecycle. This includes upfront costs, ongoing operating costs, scalability expenses, support costs, and any potential costs associated with migrating or integrating with other systems. Comparing the TCO of different database options allows for a more comprehensive evaluation of cost-effectiveness.

While cost is an important factor, it’s equally important to consider the value and benefits provided by the database. A lower-cost database may not always be the best choice if it lacks the necessary features, scalability, or performance required by the project. Striking the right balance between cost and value is crucial to ensure the project’s success.

It’s worth noting that cost considerations should be aligned with the project’s budget and financial constraints. A cost-benefit analysis, taking into account the specific requirements, expected growth, and long-term goals of the project, can help make an informed decision while considering the cost factor.

In summary, when evaluating the cost of a database for a high-load corporate project, it’s important to consider licensing fees, ongoing operating costs, scalability expenses, vendor support options, training and education, and the total cost of ownership. Balancing cost with the required features, performance, scalability, and long-term goals of the project is essential to make a cost-effective decision.

--

--