5 Challenges In Setting Up Hybrid Clouds
Here is a simple use case. Let's say that you have an application deployed in your local data center that stays idle for most of the time and only peaks for 3 hours in a day, say from 4pm to 7pm. To make it cost efficient, you want to keep as few nodes as possible for the most of the time and, once load peaks, you want to automatically detect that and bring up a few more nodes. These new nodes that you bring up to help your existing cluster may be in a totally different data center, or different cloud (e.g. Amazon EC2 or GoGrid), yet you want them to join your cluster and participate in load balancing, job collision resolution, job execution, fail-overs, etc...
Here are some challenges to consider when setting up hybrid clouds:
1. On Demand Startup and Shutdown
Your infrastructure must be able to start up and shutdown cloud nodes on demand. Usually you should have some policy implemented which listens to some of your application characteristics and reacts to them by starting or stopping cloud nodes. In simplest case, you can react to CPU utilization and start up new nodes if main cloud gets overloaded and stop nodes if it gets underloaded.
2. Cloud-based Node Discovery
The main challenge in setting up regular discovery protocols on clouds is that IP Multicast is not enabled on most of the cloud vendors (including Amazon and GoGrid). Your node discovery protocol would have to work over TCP. However, you do not know the IP addresses of the new nodes started on the cloud either. To mitigate that, you should utilize some of the cloud storage infrastructure, like S3 or SimpleDB on Amazon, to store IP addresses of new nodes for automatic node detection.
3. One-Directional Communication
One of the challenges in big enterprises is opening up new ports in Firewalls for connectivity with clouds. Quite often you will only be allowed to make only outgoing connections to a cloud. Your middleware should support such cases. On top of that, sometimes you may run into scenario of *disconnected clouds*, where cloud A can talk to cloud B, and cloud B can talk to cloud C, however cloud A cannot talk to cloud C directly. Ideally in such case cloud A should be allowed to talk to cloud C through cloud B.
Communication between clouds may take longer than communication between nodes within the same cloud. Often, communication within the same cloud is significantly slower than communication within local data center. Your middleware layer should properly react to and handle such delays without breaking up the cluster into pieces.
5. Reliability and Atomicity
Many operations on the cloud are unreliable and non-transactional. For example, if you store something on Amazon S3 storage, there is no guarantee that another application can read the stored data right away. There is also no way to ensure that data is not overwritten or implement some sort of file locking. The only way to provide such functionality is at application or middleware layers.
There are certainly other things that could go wrong, but these turned out to be the main challenges we had to resolve while working on the GridGain 3.0 version. Some of the cool features we plan to support are On Demand Startup and Shutdown Policies (including Cost-based policies) and Disconnected Clouds. GridGain 3.0 is planned to be released this summer, so stay tuned!
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)