The success of an application and the business that uses it depends mostly on the selection of the right database. Instead of depending on some recommendations of vendors or using a database just because you already have it, the most practical approach is to consider the requirements and fundamental purpose of the data store.
You can start by asking a few questions like what could be the peak data volume, the number of users to access the database, and what would be the requirements of availability, latency, scalability, data consistency and throughput. Also, you must consider the geographic distribution of users, the frequency of changing data schemas, whether the application requires online transaction processing (OLTP) or analytic query (OLAP), or a combination of both. Besides, the natural shape of your data, the expected ratio of reads to writes in production, the need for full-text queries or geographic queries, and the preferred programming languages are also critical. Lastly, you must consider any legal restrictions for data storage.
We will now take a closer look at these factors that aid database selectionOpens a new window .
The volume of data to store
To determine the size of the database, you must estimate how much data you will store. If the data volume fits within gigabytes or less, then any database like in-memory databases should suffice. However, if the data volume is in the terabytes range, which is equivalent to thousands of gigabytes, there are many more options available. As the data volume grows and touches petabytes, which are millions of gigabytes, then the options of databases shrink, and the cost goes up many times because it involves huge capital expenditure to create an infrastructure for on-premise storage or meet the expenses of operation for cloud storage.
The larger the database is more complicated is its maintenance and management, which requires the services of a database administrator attached to a company like RemoteDBAOpens a new window that involves additional expenditure.
Number of simultaneous users
As a part of server sizing exercise before installing your production database, you must consider the load for multiple simultaneous users. Remember that most databases cannot handle thousands of users querying a huge volume of data in terabytes or petabytes because of limitations in scalability. It is easier to estimate the number of simultaneous users of a database used by the employees of an organization than that of a public database. For public databases, it might require scaling up to several servers for seasonal or unexpected loads. Many databases do not support horizontal scaling that involves manual sharding of large tables, which is a time-consuming process.
Availability, latency, scalability, data consistency
Availability is a top requirement for transactional databases as many applications must be available 99.999% times round the clock. Some cloud bases that run in multiple availability zones offer five-nines availability. You can configure on-premise databases for high availability beyond scheduled maintenance periods if you can set up a pair of active-active servers.
Latency refers to the end-to-end response time of the application and the response time of the database. Ideally, the response time for user action is usually a sub-second, which means that the database response time must be under 100 milliseconds for each simple transaction. But analytic queries can take seconds or minutes.
Transactions per second denote the throughput of an OLTP database, and databases with high performance can support many users simultaneously.
SQL databases have strong data consistency because all reads return the latest data. The data consistency for SQL databases can range between eventual and strong.
Stability of database schemas
SQL databases are good if the database schemas remain unchanged for a considerable time, and you want most fields to be consistent from record to record. Otherwise, schema-free NoSQL databases could be a better option for your application. Indeed, there are exceptions as some databases like Rockset allows for SQL queries without imposing consistent types on the data it imports or without imposing a fixed schema.
Geographic distribution of users
If database users are spread across the world, database latency is set at a lower limit for users due to the speed of light unless additional servers are provided in their regions. Some databases use distributed servers for read-write while some offer distributed read-only servers. That compels the routing of writes through a master server. The trade-off between latency and consistency becomes harder due to geographic distribution.
Storing strongly typed data in rectangular tables with rows and columns is typical of SQL databases as they depend on the relations defined between use indexes and tables to speed up selected queries.To query multiple tables simultaneously, it uses JOINS. Storing weakly typed JSON that may include nested documents and arrays is typical of document databases. Graph databases either store edges and vertices or triples or quads. Key-value stores and columnar stores are the other categories of the NoSQL database.
Sometimes the data acquires a shape during the generation that also works for analysis, but when it does not happen, a transformation is necessary. Sometimes one kind of database juxtaposes another like the key-value store can underlie any database.
OLTP or OLAP or HTAP
Consider the intended use of the database that depends on the purpose for which the application needs the database, which can be either for transactions, analysis, or both. For speedy transactions, it is necessary to have minimal indexes and fast write speed. For analysis, it requires lots of indexes and fast read speed.
Read/ write ratio
Some databases are faster at writes while others are faster at reads and queries. The number of expected reads and writes from the application is a factor to consider for database selection as well for guiding you in your efforts in benchmarking. The optimal choice of index types varies between write-heavy applications and read-heavy applications.
Preferred programming language
Your budget matters in database selection because some databases are free, and there is even gradationed among the paid versions. Moreover, how capable the database is in complying with the legal requirements is also a point to consider.