A highly competitive integrated feature of Ab Initio's flow model is the built-in data parallelism. For those of you who are newbies to this concept, here is a short primer (those of you with experience can skip to the next part)
You are accustomed to stringing components together with flows. By default each component receives data from its upstream neighbor, does its job and passes the data downstream - and can often do it all as a single in-memory operation. Components in a branch create a "pipeline" of data where downstream components can start working on data before upstream components are finished. Multiple pipelines create branch-parallelism or component-parallelism, where disparate branches are processing the data simultaneously.
However, simply by applying file system anchors to the component's Layout and adding a Partioner component at the beginning of a pipeline, the Layout will manufacture multiple instances of a given component , each with its own portion of the flow (a partition). The Partitioner will faithfully take its inbound flow and convert it to the outbound partitions, feedind the data in parallel to multiple downstream instances of the partitioned components.
Think of it this way - a common example - a Sort component is partitioned 10 ways parallel - and on the graph canvas appears as only one component. At run time, 10 instances of the Sort are manufactured, each being fed about 1/10th of the key-based data. For a million-record flow, the individual Sort operations then execute against about 100k records each. It is certainly more efficient for a Sort to crunch 100k records rather than 1 million. Divide and conquer.
Moving right along
Parallelism has some very important guidelines that are sometimes overlooked but can cause unexplained and unidentifiable behavior - sort of like chasing rabbits - and the anomalies will appear only because we've broken basic rule of applying parallelism. Many folks will take a stab at parallelism, watch these anomalies and then back out everything they've done. They understand serial processing, it's safe, predictable and effective. Parallelism is uncharted water for them - and awfully frightful. Was that a dragon I saw in the mist?
Not to fear. As I have noted in prior entries, you need to think in flow-based terms first rather than translating what you know about transactional or set-based terms into the equation. These are units of work in a flow. They are simple to track and understand.
A common means to track and verify a partitioning model is to extend the flow's record definition with an integer and fill it with the value of this_partition() function from the Expression Editor. It is handy for many reasons, but invaluable for troubleshooting a bothersome flow. Simply add one more column to your DML definition - an integer, and at the first possible opportunity place the this_partition() value into the column. As you process the data, you can reference this number either by eyeball or by component-level process to make sure your data isn't being redirected to the wrong partition - or to an unexpected partition. For clarity - Ab Initio's partitioners are rock-solid and etched in stone, but sometimes developers misuse or misunderstand them and so data gets redirected.
Teams put data-parallel into play mostly for key-based operations such as Joins, Sorts and Rollups. The main reason is that when partitioning one flow by key as a branch feeding the Join, then partitioning another flow by the same key as another Join input, it is critical that the same keys are grouped into the same partition or the Join instance for that partition will produce incorrect results. This is usually a symptom of Partitioning on the wrong key or inconsistent key formats (a partition-by-key on a decimal(8) will not necessarily land records in the same partitions as a partition-by-key on an integer(4) - even though they both may represent the same value).
Here is another "tricky" scenario: You partition one branch by transaction_id, likewise another branch by transaction_id and join them together for one integrated output. You then want to Rollup by customer_id. Not so fast - your data is partitioned by transaction_id, not customer_id. Now is the time to re-partition on another key - the customer_id, and the Rollup will work just fine. This seems like a simple example but it's amazing how quickly it gets lost.
Troubleshooting - take a close look at all the partitioners involved, and how many degrees of parallelism each one runs with. This is important because some of your clever developers may mix-and-match degrees of parallelism in the same graph (say, for running a 10-way sort into a 4-way Rollup). Also look for inconsistency in the partitioning. For a partition-by-key, are all the keys the same? Have all the keys been converted to a consistent data type? (can't partition by key on a string in one flow and an integer in another). Are you partitioning by round-robin for a key-based operation? (it won't work).
Confused yet? Don't be - partitioning isn't any kind of magical operation - Ab Initio makes it a lot easier to pull off than any other environment. How do you avoid the confusion and avoid chasing all of the rabbits as they pop out of their holes?
Firstly, when data first arrives in your graph, commit to transforming the data to a consistent, graph-facing form. The lazy-man's way of data processing is to simply use the DML provided to you (say from a file format or database-generated DML, and run with it.
The initial part of your flow should:
() convert all delimited data to fixed length
() place keys at the front of the record
() convert string/decimal keys to integer if they are otherwise numeric
() by default, use fixed length as the type to extract from a database - it is almost always more efficient than delimited because it reduces the overall data size.
() Partition at your earliest opportunity
() Partition-by-key before beginning the key-based flow where it will be used. Use a subgraph, if necessary, to functionally mark the difference.
() Re-partitioning is a very cheap operation. Serializing to repartition is not.
() Don't use a direct file-system anchor for your partitioning layout - use a $variable that is controlled in the sandbox or the EME environment (not in the graph-level parameters)
() Don't serialize the data unnecessarily - for example if you need to move from partition by transaction_id to partition by customer_id, just use another partition-by-key and move on
() Control degrees of parallelism on a per-project basis - such that several multi-file systems exist and each project can choose the one(s) they need without affecting other projects.
() Use multiples of 2 in your degrees of parallelism - better yet use 2 as a multiplier to expand - for example, 2, 4, 8, 16, 32 etc. This layout pattern will decrease or eliminate the probability that your graph will fail on partition boundaries.
The above measures, if applied consistently, will eliminate many of the parallelism rabbits to wit:
You won't accidentally partition on keys of different types (decimal and string, for example) because all data has been converted to the graph-facing type.
You will have boosted the performance of the graph (fixed length DML, integer keys and keys at the beginning of the record all boost processing).
You will minimize the total partitioners on the graph.
You will experience portability as the graph migrates from development to a test and production zone
Your administrators will be able to expand the graph's degrees of parallelism administratively through the sandbox rather than change the graph itself.
Most importantly, you will be able to focus on the functional flow of the data processing and not sweat the underlying parallelism model.
No comments:
Post a Comment