Tuesday, September 02, 2008

Abinitio Glossary

AB INITIO GLOSSARY

Ad-hoc Multifile: A parallel dataset created by naming a set of serial files as its partitions. These partitions can be named by explicitly listing the serial files, or by using a shell expression that expands at runtime to a list of serial files.

Co>Operating System: The Co>Operating System is core software that unites a network of computing resources-CPUs, storage disks, programs, datasets-into a production-quality data processing system with scalable performance and mainframe reliability. The Co>Operating System is layered on top of the native operating systems of a collection of servers. It provides a distributed model for process execution, file management, process monitoring, checkpointing, and debugging.

Component parallelism: A graph with multiple processes running simultaneously on separate data uses component parallelism.

Checkpoint is a phase that acts as an intermediate stopping point in a graph to safeguard against failures. By assigning phases with checkpoints to a graph, you can recover completed stages of the graph if failure occurs.

Control partition is the file in a multifile that contains the locations (URLs) of the multfile's data partitions.

Data parallelism: A graph that deals with data divided into segments and operates on each segment simultaneously uses data parallelism. Nearly all commercial data processing tasks can use data parallelism. To support this form of parallelism, Ab Initio provides Partition components to segment data, and Departition components to merge segmented data back together.

Deadlock occurs when a program cannot progress. It depends on the patterns of the data and typically occurs in graphs with data flows that split and then join.

A graph carries a potential deadlock when flows diverge and converge within the same phase. If the flows converge at a component that reads its input flows in a particular order, that component may wait for records to arrive on one flow even as the unread data accumulates on others because components have a limited buffering capacity.

DML is an acronym for Data Manipulation Language. It is the Ab Initio programming language you use to define record formats (which are kinds of types), expressions, transform functions, and key specifiers. Expressions include a large number of built-in functions. Files with .dml extensions contain record format definitions.

Export: You can export or pass on properties (parameters, layouts, or ports) from components to graphs or subgraphs. This creates a graph level parameter that is referenced by the original parameter in the component.

If you export properties from two different components and give them the same name, the properties will have the same value. A typical use is to export a key parameter from a Partition by Key and a Sort, so they have the same value.

Fan-in flows connect components with a large number of partitions to components with a smaller number of partitions. The most common use of fan-in is to connect flows to Departition components. This flowpattern is used to merge data divided into many segments back into a single segment, so other programs can access the data.

When you connect a component running in parallel to any component via a fan-in flow, the number of partitions of the original component must be a multiple of the number of partitions of the component receiving the fan-in flow. For example, you can connect a component running 9 ways parallel to a component running 3 ways parallel, but not to a component running 4 ways parallel.

To deal with the latter case, Repartition the data by inserting one of the Partition components between the components. This turns the fan-in-flow into an All-to-all flow, allowing a record from any partition of one component to flow into any partition of the other component.

Fan-out flows connect components with a small number of partitions to components with a larger number of partitions. The most common use of fan-out is to connect flows from partition components. This flow pattern is used to divide data into many segments for performance improvements.

When you connect a Partition component running in parallel to another component running in parallel via a fan-out flow, the number of partitions of the component receiving the fan-out flow must be a multiple of the number of partitions of the Partition component. For example, you can connect a Partition component with 3 partitions via a fan-out flow to a component with 9 partitions, but not to a component with 10 partitions.

To deal with the later case, Repartition the data by inserting a Departition component after the Partition component. This turns the fan-out flow into an All-to-all flow, allowing a record from any partition of the Partition component to flow into any partition of the target component.

Flow: A flow carries a stream of data between components in a graph. Flows connect components via ports. Ab Initio supplies four kinds of flows with different patterns: straight, fan-in, fan-out, and all-to-all.

Graph is a diagram that defines the various processing stages of a task and the streams of data as they move from one stage to another. Visually, stages are represented by components and streams are represented by flows. The collection of components and flows comprise an Ab Initio graph. You can create graphs from the main menu of the GDE.

GDE: The Graphical Development Environment (GDE) provides a graphical user interface into the services of the Co>Operating System.

Layout: A layout is a list of host and directory locations, usually given by the URL of a multifile. If the locations are not in a multifile, the layout is a list of URLs called a custom layout.

A program component's layout lists the hosts and directories in which the component runs. A dataset component's layout lists the hosts and directories in which the data resides. Layouts are set on the Properties Layout tab.

The layout defines the level of parallelism. Parallelism is achieved by partitioning data and computation across processors.

Ab Initio uses layout markers to show the level of parallelism on components. If the GDE can determine the level of parallelism, it uses that level as the marker. For example, non-parallel files have a marker of 1. If the GDE cannot determine the level of parallelism, it abbreviates layouts as L1, L2, and so on. An asterisk next to a layout marker (L1*) indicates a propagated layout.

.mdc file: A file with an .mdc file extension represents a Dataset or custom dataset component.

Multifile is a parallel file that is composed of individual files on different disks and/or nodes. The individual files are partitions of the multifile. Each multifile contains one control partition and one or more data partitions. Multifiles are stored in distributed directories called multidirectories.

The data in a multifile is usually divided across partitions by one of these methods:
Random or roundrobin partitioning
Partitioning based on ranges or functions, or
Replication or broadcast, in which each partition is an identical copy of the serial data.

A partition is a file that is a portion of a multifile. A partition is a segment of a parallel computation.

Phase is a stage of a graph that runs to completion before the start of the next stage. By dividing a graph into phases, you can save resources, avoid deadlock, and safeguard against failures.

If a graph has deadlock potential, for example in a merge-split situation, use different phases in the upstream and downstream components to stagger the reading and writing to disk.

To protect a graph, all phases are checkpoints by default. A checkpoint is a special kind of phase that saves status information and allows you to recover from failures.

Pipeline Parallelism: A graph with multiple components running simultaneously on the same data uses pipeline parallelism.

Each component in the pipeline continuously reads from upstream components, processes data, and writes to downstream components. Since a downstream component can process records previously written by an upstream component, both components can operate in parallel.

Record format is either a DML file or a DML string that describes data. You set record formats on the Properties Parameters tab of the Properties dialog box. A record format is a type applied to a port.

Repartition data is to change the degree of parallelism or the grouping of partitioned data. For instance, if you have divided skewed data, you can repartition using a Partition by Key connected to a Gather with an All-to-all flow.

Sandbox: A sandbox is a collection of graphs and related files that are stored in a single directory tree, and treated as a group for purposes of version control, navigation, and migration. A sandbox can be a file system copy of a repository project.

Skew refers to lopsided data storage or program execution. Causes of skew range from uneven input data to different loads on different processors. Repartitioning often alleviates skewed input data. Modifying the layouts often repairs skewed program execution. Skew for a data storage partition is defined as:
(N- AVERAGE) /MAX
Where: N is the number of bytes in that partition
AVERAGE is the total number of bytes in all the partitions divided by number of partitions
MAX is the number of bytes in the partition with the most bytes.

Skew for program execution is similar, using CPU seconds instead of number of bytes.

Straight flows: connect components that have the same number of partitions. Partitions of components connected with straight flows have a one-to-one correspondence. Straight flows are the most common flow pattern.

Subgraph is a graph fragment. Just like graphs, subgraphs can contain components and flows. Subgraphs are useful for grouping a graph into subtasks and reusing them.

Watcher: A watcher lets you view the data that has passed through a Flow.
To use a watcher, do the following:
Turn on debugging mode.
Add a watcher on a flow.
Run the graph.
View the data.

No comments: