In the first post of this three-part series, we discussed the strengths and weaknesses of various types of NoSQL databases. In the second post, we compared the features of the two most popular NoSQL databases: MongoDB and Cassandra. In this final article, we will discuss best practices for hosting NoSQL databases on Amazon Elastic Compute Cloud (Amazon EC2).
Amazon EC2 offers multiple compute and storage options catering to the varied requirements of NoSQL workloads. Using Amazon EC2 with other services in the Amazon Web Services (AWS) ecosystem, such as Amazon Cloud Watch and 1-Click Launch from Amazon Marketplace, provides additional advantages.
Recommended Best Practices
Some of the best practices recommended for hosting NoSQL databases on Amazon EC2 are:
Multiple Deployment Options
With the help of AWS regions and Availability Zones, Amazon EC2 offers multiple deployment options that provide highly available workloads. However, enabling high availability requires network and security level planning and configuration. Also, the various deployment models add latency to write operations (for eventual consistency), which comes at a financial cost.
Single Region and Multiple Availability Zones
Setting up a MongoDB cluster in a new Amazon Virtual Private Cloud (Amazon VPC) on AWS requires the following deployment and configurations:
- Amazon VPC must be configured across all three availability zones with required subnets.
- Public subnets should allow outbound communication of private subnets via the NAT gateways.
- Bastion hosts must be configured to allow secure communication with Amazon EC2 instances present in both public and private subnets.
- Amazon Identity and Access Management (AIM) roles must be created to provide the required access for deployment.
- MongoDB clusters should be set up in private subnets, along with the replica sets in different availability zones.
In the scenario below, one Amazon EC2 node acts as the primary node while the others act as secondary nodes.
In the multi-region deployment model, data is replicated across multiple regions so that if one region is down, another region will take over to serve user requests. For better control, the node in this second region can be assigned a lower priority, ensuring it only takes over when the first region is down—not when there is an issue with the primary availability zone in the first region. To allow communication and data transfer between two regions, VPC peering must be enabled.
The multi-region deployment model is more expensive than other models, as you have to maintain two regions and bear the cost of data transfer. Also, maintaining eventual consistency can cause high latency issues, depending on the region you are using for replication.
Amazon EC2 Instance Types
Amazon EC2 offers a variety of instance types that are suitable for deploying NoSQL databases. These types include:
I3 instances use SSD storage and are compatible with I/O-intensive workloads such as NoSQL databases and data warehousing applications. The largest type of I3 instance (i3.16xlarge) offers 64 vCPUs and 488 GB of memory, along with 3.3 million read IOPS and 1.4 million write IOPS. I3 instances support enhanced networking, making them ideal for applications that require low jitters and a high packet transfer rate.
D2 Instances are Amazon EC2 dense-storage instances that use HDD-based storage. The largest D2 instance (d2.8xlarge) offers 36 vCPUs and 244 GB of memory. D2 instances also offer an enhanced networking feature with Intel interfaces. These instances are specially designed for applications that require high sequential read-write access to log analyzing applications or large datasets such as Apache Hadoop.
Amazon EC2 allows for both vertical and horizontal scaling of NoSQL databases. Vertical scaling can be achieved by auto-deployment, which replaces an existing instance with a larger instance without any data impact or downtime. When scaling horizontally, it is recommended to increase the number of nodes in a cluster in order to achieve homogeneous data distribution. For example, in MongoDB, sharding is used to distribute workloads across multiple nodes. With sharding, read/write operations are distributed across the cluster to gain high throughput.
High Performance Storage Options
The two storage options used to host NoSQL workloads on Amazon EC2 instances are:
Amazon Elastic Block Store (Amazon EBS)
Amazon EBS is a block storage from AWS that auto-replicates in its availability zone to provide excellent resilience and high availability. While there are multiple Amazon EBS volume types, the best choices for NoSQL workloads are Provisioned IOPS SSD (io1) and EBS General Purpose SSD (gp2). Provisioned IOPS SSD (io1) has a maximum IOPS of 32,000 per volume. General Purpose SSD (gp2) has a maximum IOPS of 10,000 per volume. Both options offer single digit latencies.
Ephemeral storage is the local storage of Amazon EC2 instances, and its IOPS depends on the attached Amazon EC2 instance type. Because you can’t create a snapshot of ephemeral storage volumes, you need to schedule a custom backup for ephemeral storage. One option is to create a new Amazon EBS volume and move data from ephemeral storage to the new volume. You can see examine a particular use case here — how to back up MySQL data from the instance store storage to EBS and how to perform automated backup with EBS.
AWS is a simple cloud-based platform that offers Amazon EC2 compute services for running NoSQL applications. These services offer features such as high scalability, high availability, and excellent performance. AWS is the best choice for hosting NoSQL applications due to its 1-Click Launch option (which reduces deployment time to just a few minutes) and the flexibility of its pay-as-you-go model. When hosting NoSQL databases on Amazon EC2, follow the best practices outlined in this post to ensure efficiency.