Database Sharding – Scaling data in a multi tenant environment

A multi tenant SaaS product should be efficient enough to scale seamlessly without compromising on Reliability, Availability and Performance. Large scale applications which are built with the intention of handling thousands of users accessing in a concurrent fashion is to be well equipped and architected to handle a medium sized customer with few hundreds of users to a large customer with fairly huge number of active users. When the number of tenants and users grow, the number of IOs which hits the database will also increase which is susceptible for performance degradation.

While introducing a new SaaS product, the application has tenant data stored in separate databases plus an extra database for the list of tenants as well as data shared by every application [Generally called as Meta Database]. For Example Financial application such as Accounts Management, Expense Management, Payroll applications, the shared information would include items like currency exchange and different VAT rates for different states and other information such as colour themes, language preference that are identical for each tenant. Everything fits into a single DB server and life will be good. But over some time when the business picks up and more tenants buy the product, the application provider we’ll be pushed to choose different database or different schema else splitting the single server into multiple server for the shared data plus application database servers.

At time, the customers might want to run some complex analytical reports, which have a tendency to consume lot of resources of the tenant database servers. Again the application provider has to set up separate scaled up servers with optimized hardware to load data dynamically from tenant databases. And this is just the beginning of additional servers as the SaaS adds more customers and add new services. Sometimes, it is natural for successful SaaS application owners to run 100 or more DBMS servers.

One effective way of scaling a database is through partitioning, dividing the data into smaller divisions in order to improve the efficiency of queries and updates. On the long run the application providers has to come up with data partitioning strategies to determine the best way to partition the customers data. The partition decision can be based on different aspects such as

Geographic based data separation
Usage Patterns
Customer Growth
Data security and Policy
Data isolation
Cost efficiency

Data Sharding

An important quote about database is “Small Databases are always faster, large database is not, so keep your database as small as possible”. Vertical scaling of the database server can be useful only up to a certain point. Beyond this, the database itself has to be partitioned. Following are the three key techniques which could be considered to ensure scalability in a Shared DB-Shared Schema [SDB-SS] Model.

• Multiple Database Instances
• Partitioned Table and Indexes
• File Group Allocation

In most situations, it is likely that database size will keep growing. Therefore, it is also important to have dynamic repartitioning strategies in place, to ensure that already-partitioned data can be repartitioned in order to keep up with performance and scale metrics.

Table Partitioning and File Group Allocation

When tables and indexes grow large, partitioning the data will help us split the data in to smaller and manageable sections. When the data in the table is partitioned, the SQL server can often eliminate searching the data in the irrelevant partitions and read the data directly from those partitions which contains the requested data – this is called as partition elimination.

Database Partitioning based on Tenant ID

No only select queries, DML statements such as Update, Delete etc. also get considerable performance improvement while modifying the data from partitioned tables.

Note: Partition elimination will work only if the partitioned column is present in the where clause of the query.

Although the partitioned data can reside in a single File Group, it makes sense to save it in a multiple file groups to achieve better performance, intelligent partition elimination and boosted IO Performance. Data can be backed up using file group backups separately. Another important benefit of storing the file groups into multiple file groups is, if the data from any one of the partition is old and if it is considered only for Auditing, then those partitions can be marked as Read-only, if the data has to be pulled from multiple partitioned tables, SQL server might use multiple process to read the data from multiple partitions.

Important: SQL Server 2012 supports up to 15,000 partitions by default. In earlier versions, the number of partitions was limited to 1,000 by default. On x86-based systems, creating a table or index with more than 1000 partitions is possible, but is not supported.

Refer: http://msdn.microsoft.com/en-us/library/ms190787.aspx

Approach1
All the tenants’ metadata are stored in a DB1 and application data in DB2. This option will be ideal for many applications with fewer loads of users and IO Operations. The maintenance is simpler and cost efficient. For start-ups, it is advisable to kick start with this approach and as the business grows, one can scale-out using the other approaches.

Approach2
When a tenant is insisting on complete isolation of their data in a separate database, a separate instance for that tenant could be created. The metadata of multiple tenants can reside in a common database, but the transaction data will be in a separate instance of database.

Approach3
Separate schema for each tenant can be considered for some applications. It gives a greater level of isolation and independence between multiple tenants, but it is easy to maintain as it shares the same database server instance.

Approach4
Usage pattern and volume might vary across tenants. In such a scenario, sharing schema / database between two or three tenants and having a separate schema / database for large tenants could be considered.

Approach 5
A separate database is used for each module. All the tenants share the same database with respect to that module. This is also called Vertical Partitioning.

Various Types of Data Sharding

Approach 6
This approach is a combination of Vertical and Horizontal Partitioning. Data from different application Modules are grouped and stored in different databases. The data is also partitioned based on Tenant ID.

Data Connection Abstraction and Connection String management

In a multi tenant environment where almost all kind of data scaling possibilities are given, there has to be a strong, comprehensive and centralized approach to manage the multiple connections across the application for each tenant. The three important aspects of connection management is

1. Security
2. Administration
3. Control

Storing connection strings in a simple plain text format may not be advisable in all situations. So provision should be made to encrypt them when necessary. Administration of Connection strings might not be an easy task when there are 100s of tenants connecting to multiple databases for each of their modules.
Proper Control over user credentials should be in place. For example developers should not be given access to the databases in Production environment etc.

Data Connection Abstraction in Techcello

In a Multi tenant model, the application should be capable of connecting to various different databases, schemas to save and fetch the application data. To serve this kind of dynamic data management, the underlying framework has to have a connection abstraction mechanism to handle connections without really hard coding them into the application for different tenants.

Connection strings management is an often overlooked feature within applications. Generally, a developer must hard-code the connection string within the application or must write the connection string in a clear-text file on each client’s machine in a single tenant model. TechCello solves these problems by providing a powerful, easy to use connection string management APIs that allows the developer to design the application without the concern of end-user connection string management.

Techcello’s connection string management APIs stores the connection string in an encrypted format in the database. Which database to connect is decided on the fly from the context of the logged in user and connected to the appropriate database to fetch the data. If a connection string is not found for a particular tenant, then the global connection string will be applied and data will be fetched from the shared database. This also allows a tenant to have many number of connection strings stored per tenant

CelloSaaS-Connection String Management Entity

Summary

As everyone says, “One size doesn’t fit for all” which is absolutely true for a multi tenant SaaS Application. Each tenant users are varies in terms of usage and their needs are completely different. So techcello framework has been designed to support different types of data partitioning according to the specific requirements of each application and its tenants. Developers get the freedom, flexibility and control, while at the same time, save a lot of time and effort during development. They can focus on the business functionalities and leave the scalability, security and data isolation issues to a proven and tested platform.

Tenants and their users do not really care how the application has been built and how the internal architecture works. They care ONLY about how the application works: its robustness, performance, feature set, and cost. It is the responsibility of the Product seller to provide a reliable and trust worthy service and allow them to concentrate only on their business.

admin