An Interview with Pervasive's Mike Hoskins from JavaOne 2007
by Frank Sommers and Bill Venners
July 2, 2007
Pervasive Software recently introduced a new Java parallel dataflow framework, DataRush. In this interview with Artima, Pervasive's CTO Mike Hoskins explains how dataflow parallelism helps speed up data processing by taking advantage of multi-core CPUs.
As parallelism is taking center stage in response to multi-core processors and ever-increasing amounts of data and transaction volume, time-tested parallel processing techniques are enjoying a renaissance. One such technique is dataflow parallelism. Based on the concept of interrelated values where change in one value can trigger changes in related values, dataflow is perhaps best-known as a popular technique used in spreadsheet engines.
One advantage of dataflow is that is helps create highly decoupled systems. As a result, dataflow parallelism, or pipeline parallelism, has been used in high-performance computing systems since at least the 1970s.
Recently, Pervasive Software developed a Java-based dataflow framework, DataRush. In an interview with Artima, Mike Hoskins, CTO of Pervasive, describes how DataRush helps take advantage of multi-core processors:
Data flow technology has been around for decades. It's one of the popular ways to produce data-intensive applications when you went to employ parallelism. When we looked around to decide how [to make data processing faster on parallel systems], we realized that dataflow was a highly efficient way to do data-intensive programming.
The user doesn't have to be aware, but under the covers we enjoy the pipeline parallelism that's inherent in all dataflow engines. Think of a visual environment, where you are building a flow from left to right, [and] your flow consists of a series of steps, which we call operators. There could be low-level atomic operators, or there could be high-level composite operators. Our component framework allows you to create assemblies, which is the name we give to those high-level composite operators. You are [that way] able to stitch together a series of operators that operate on the data as it's flowing through the pipeline.
We implemented ... extensions to dataflow principles... We further implement two kinds of partitioning, in addition to the pipeline parallelism: horizontal and vertical partitioning. The operators that we have written already are highly parallelized internally. They dispatch the actual workload on multiple threads, [something the system] can calculate at runtime. Horizontal partitioning is rather traditional—imagine a round-robin partitioner. As we're reading data, we can start to partition that data, and consume multiple threads on multiple cores, as many as you give us.
The real leg, though, comes with our third type of parallelism, vertical partitioning. Imagine you can bring in a large transaction, or a large row, or an EDI retail message, and you can decompose [that] into its individual field elements, its column fragments. Provided there are no data dependencies, you could launch those autonomously on the directed dataflow graphs. Even though you [may] see that as a problem of reading in a million rows and processing, maybe many operations on multiple columns, we actually fracture the data, and enjoy pipeline parallelism, round-robin horizontal partitioning, and even vertical partitioning, so you get maximum use of all the cores.
Hoskins noted in the interview that such highly parallelized processing can be done only on data with relatively few interdependencies between data elements:
That means you have to know certain things about your data to really be able to exploit DataRush. Think of rowset data, traditional structured data, which can be pretty loose, such as a weblog, or it could be flat files, or EDI transactions, or proprietary COBOL layouts that are fixed-length with multiple record types, or straightforward CSVs, or XML, or JDBC—because databases, of course, are big owners of rowset data. If you have anything that looks like rows and columns, records and fields, then we can do reasonable well on that data. If there is data that isn't very structured, it may not be that appropriate [for DataRush]. If your operations involve huge amounts of dependencies, then it's hard for any kind of parallelism to work.
What do you think is the biggest challenge today for a Java developer wanting to benefit from multi-core processors?
Post your opinion in the discussion forum.
Frank Sommers is Editor-in-Chief of Artima Developer. He also serves as chief editor of the IEEE Technical Committee on Scalable Computing's newsletter, and is an elected member of the Jini Community's Technical Advisory Committee. Prior to joining Artima, Frank wrote the Jiniology and Web services columns for JavaWorld.