Cloud Zone is brought to you in partnership with:

Dmitriy Setrakyan manages daily operations of GridGain Systems and brings over 12 years of experience to GridGain Systems which spans all areas of application software development from design and architecture to team management and quality assurance. His experience includes architecture and leadership in development of distributed middleware platforms, financial trading systems, CRM applications, and more. Dmitriy is a DZone MVB and is not an employee of DZone and has posted 42 posts at DZone. You can read more from them at their website. View Full User Profile

PDC Theorem – The Underlying Principle of Real-Time Distributed Processing

10.21.2011
| 1186 views |
  • submit to reddit

Few months ago I blogged about real-time processing in enterprise systems of all stripes, and since then I’ve talked quite a bit about the underlying principle of such processing. I call it a PDC theorem as in “Processing, Data, and Co-Location”.

The idea behind PDC is pretty simple. Real-time response in a highly distributed system is not achievable unless the following 3 rules are followed:

  • Processing must be distributable for in-memory computation
  • Data storage must be distributable (i.e. partitioned) for in-memory storage
  • Co-location must be ensured between processing and data units to provide locality of remote operations

Few important notes:

  • We, of course, are talking about business or perceptual real-time (a.k.a. Near Real-Time or nRT) and not about hardware real-time. Perceptual real-time response is not well defined but you can conceptually visualize it as the time the user of the system willing to wait for the response that he or she expects right away… In most cases it means few seconds or less. In rarer cases like FOREX trading, for example, the real-time would mean microseconds.
  • It is critically important that your processing supports algorithmic parallelization. Not all tasks can be parallelized and therefore not all tasks can be optimized for real-time processing. However, many of the typical business and social graph tasks can be split into multiple sub-tasks executing in parallel – and therefore are trivially parallelizable.
  • Data have to be partitioned and stored in-memory. Any outside calls to get data from NoSQL storage, file systems like HDFS, or traditional SQL storage renders any real-time attempts useless in most cases. This is one of the most critical design element and it is often overlooked. In other words – in no time the remote processing should escape the boundaries of the local JVM it is executing on.
  • Co-location of the processing and data (a.k.a affinity-based routing referring to the fact that should be an affinity between the computation and the data this computation needs) is the main mechanism to ensure that there is no noise data transfer between remote nodes where a task is being processed. Such unnecessary data transfer will violate the locality principle of the remote operations making real-time processing often unachievable.

It’s also quite obvious that PDC theorem doesn’t guarantee the real-time processing – it merely states these three rules are necessary but not enough on their own for a real-time response. Latencies of the atomic remote operations will often dictate whether or not real-time response is achievable in practice.

It is interesting to note that combination of PDC and CAP theorems really defines the fundamentals of high performance distributed programming today.

 

Source: http://gridgaintech.wordpress.com/2011/10/19/pdc-theorem-the-underlying-principle-of-real-time-distributed-processing/

Published at DZone with permission of Dmitriy Setrakyan, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: