Differences between revisions 2 and 4 (spanning 2 versions)
Revision 2 as of 2005-08-30 21:39:07
Size: 5087
Editor: JohnVk
Comment: added open issue about cost of over-populating the TERFS
Revision 4 as of 2005-09-09 22:05:48
Size: 9895
Editor: KenSchalk
Comment: My comments and responses
Deletions are marked like this. Additions are marked like this.
Line 45: Line 45:
 * adv separate: each tool call can be cached seperately
 * adv single: less sdl to write
Line 60: Line 62:

= What's so hard about writing bridges? =

There are several things which make writing bridges hard:

 1. '''Setting up the encapsulated filesystem.''' 99% of UNIX users have never set up a chroot environment. It's not easy. Just getting your head around the idea of chroot, the fact that you have to provide all the files and that you can put in any files in any arrangement you want, is difficult for most people. Even once they understand that, they may not know all the files they need, and discovering the complete set can be difficult.
 1. '''Setting up the environment variables.''' 99% of UNIX users don't think about which environment variables a given command they type depends upon. They just have some shell startup scripts that set some for them, and they get inherited by every process they run. Starting with no environment variables forces you to know every one you need. Often this can be poorly documented and otherwise difficult to find out.
 1. '''SDL the language.''' It's peculiar to Vesta. It's functional which many people find strange. It has data types which people aren't familiar with (bindings). It treats functions as first class values, which people also often find strange.
 1. '''Abstraction and generalization.''' UNIX users are used to thinking in terms of individual command lines they type at a shell prompt. However most build processes that users would write up in SDL they'll want to re-use at some point. That means abstracting them in some way, making it possible to substitute in different pieces (different source files, different command-line arguments, different versions of the executable software). That's what a "bridge" is: an abstracted, generalized way of running one or processes.

=== How do the suggestions above simplify this? ===

People '''do''' write SDL like what's described above (straight-line code that sets up filesystems and environment variables and calls {{{_run_tool}}} one or more times with hard-coded sources/executables). That's usually the ''first step'' towards writing a bridge.

It seems to me (KenSchalk) that the only effort the approach this page is talking about saves is item #4 on the list above (abstraction and generalization). Setting up the filesystem and environment variables may still be hard (depending on the tool). Learning SDL is still a hurdle, though maybe you can get away with doing ''less'' in the language if you're just writing straight-line code.

=== Specific responses ===

  In this mode, what these people would be doing is populating vesta with the tools they need -- in a directory structure copied from what they already do in linux. then it should be something trivial to convert that into a binding w/ the same hier and put that in root to the temporary encapsulated root filesystem (TERFS) w/ the same hier.

Sure, that's easy:

{{{
% cd /vesta-work/jsmith/foo
% mkdir root
[copy files into the root subdirectory]
% cat > build.ves
files
  root;
{
  . ++= [ root ];
  // ...
}
}}}

  the thin bridge version would allow the passing of strings, instead of making the user pass a list of strings, forcing them to learn, at least a little, the SDL list syntax.

How do you parse that into separate strings to make the command line? What if the user tries to put a string in quotes on the command line? Do you pass the command-line to a shell to be interpreted? If so which one (csh, tcsh, bash, sh, etc.)?

  What is the "cost" of putting a lot of files, like 1000s in the TERFS.

You're correct, the files are not copied in any way. They're not link symbolic links ({{{ln -s}}}), they're more like hard links ({{{ln}}}), but that's really just an analogy.

The temporary filesystem only exists in the memory of the evaluator ''until the tool access it''. Whenever the tool tries to access some file/directory, the repository ''calls back'' the evaluator to get information about the file/directory being accessed.

So, the "costs" to consider are:

 1. Time to construct the {{{./root}}} binding in SDL (there are some ways that are faster than others).
 1. Memory in the evaluator to hold the {{{./root}}} binding
 1. Number and latency of network transactions to access each file/directory: from the host running tool to the repository (NFS), and from the repository back to the evaluator (SRPC).

Also, how you organize files in the {{{./root}}} may matter in some cases. Suppose, for example, that you have a tool which lists the entire contents of the current working directory (typically {{{./root/.WD}}} in SDL). If you put lots of extra files in there, the network transactions to list the directory will take longer.

Bridgeless Vesta

AKA Thin Bridge

The theory is that if you didnt have to learn about bridges, and you needed to learn minimal SDL, the ramp to learning vesta would be much smaller.

Thus i propose a bridgeless methodology.

In this mode, what these people would be doing is populating vesta with the tools they need -- in a directory structure copied from what they already do in linux. then it should be something trivial to convert that into a binding w/ the same hier and put that in root to the temporary encapsulated root filesystem (TERFS) w/ the same hier. The point here is that once they start executing run_tool() funcs, the terfs env looks the same to them as their regular unix env.

Then there's some relatively trivial code to move input files into the terfs and result files out, but that's boilerplate.

I am explicitly suggesting that we give up the benefit of portability (but see below) that we get with bridges, to lessen -- perhaps dramatically -- the ramp to get these benefits of vesta:

  1. reproducibility
  2. cached results
  3. shared build products
  4. auto-snapshot work areas.
  5. explicit versioning (no "ground pulled out from under you" syndrome)

That seems like a lot!

Steps to adding a flow in Bridgeless Vesta

  1. pick a location in the repository, eg, /vesta/my.domain.com/play/$USER/mynewflow/ eg /vesta/mmdc1.intel.com/play/jvkumpf/mynewflow/.

  2. mkdir /vesta/mmdc1.intel.com/play/jvkumpf/mynewflow/.

  3. vcreate /vesta/mmdc1.intel.com/play/jvkumpf/mynewflow/pkg/.

  4. mvco /vesta/mmdc1.intel.com/play/jvkumpf/mynewflow/pkg/.

  5. in /vest-work copy build.ves from the template

  6. add tools to /vesta/mmdc1.intel.com/play/bridgelessvesta/path/bin/ it /vesta/mmdc1.intel.com/play/bridgelessvesta/path/bin/newtool/pkg/.

    • under bridgelessvesta/path/bin/ is /bin, /usr/bin, /usr/local/bin etc

    • every tool gets its own pkg. So the hierarchy is just like unix, but with an extra pkg/N level inserted to do the versioning

  7. boilerplate imports of things from /bin including perl, awk, grep, sh, csh, whatever

  8. files clause to include input files

  9. files clause to include any user-specific scripts

  10. assemble command line from strings, directly calling tools
  11. call _run_tool() or thin bridge version of run_tool()

  12. extract results from _run_tool()'s return binding (boilerplate)

Open Issues

1. should we open up people's existing flow scripts and pull out the steps and write them as separate _run_tool() calls, or just call the multi-step perl or csh or sh script with a single _run_tool() call? Why or why not?

  • adv separate: each tool call can be cached seperately
  • adv single: less sdl to write

2. should this template/boilerplate build.ves file call _run_tool() directly, or should it call a thin bridge which just takes its inputs and then calls _run_tool()?

  • adv thin bridge: the thin bridge version would allow the passing of strings, instead of making the user pass a list of strings, forcing them to learn, at least a little, the SDL list syntax.
  • adv directly: the thing bridge would complicate the template because the thin bridge would have to be imported. If this is all boilerplate, then maybe this disadv doesnt matter.
  • adv thin bridge: the thin bridge gives an opportunity for porting or other helps later on

3. (Someone knows the answer to this, just not me) What is the "cost" of putting a lot of files, like 1000s in the TERFS. I do not believe they are copied in there, but rather just vesta api references to the permanent sid storage. So the cost of, for example, populating every _run_tool() call, automatically, with everything in the typical user's path, would be the cost of creating (ok now were at) 10k of these references (i like to call them soft links, but i know they're not).

Portability

Do we really give up portability? Perhaps not. A build.ves file which contains explicit calls to _run_tool() with explicit strings and switches and pathnames -- coupled with some knowledge that this runs on a particular version of linux/gnu/gcc/etc, becomes an abstract description of what that build.ves file should do. Ie, a specific detailed script applied to a specific version of the platform describes an abstract step.

It should be theoretically possible, then, to port this build.ves file to another platform, or, better said, run this build.ves file on another platform, w/o editing the build.ves file by passing it thru a translation layer, that interprets its fixed strings against its target platform, infers from that the meaning and goal of the command and/or option, and creates the equivalent on the new platform.

If this ideas is implemented with a "thin bridge" which just takes its inputs and calls _run_tool() directly, then the thin bridge is a great place to intercept the call and translate to another platform.

What's so hard about writing bridges?

There are several things which make writing bridges hard:

  1. Setting up the encapsulated filesystem. 99% of UNIX users have never set up a chroot environment. It's not easy. Just getting your head around the idea of chroot, the fact that you have to provide all the files and that you can put in any files in any arrangement you want, is difficult for most people. Even once they understand that, they may not know all the files they need, and discovering the complete set can be difficult.

  2. Setting up the environment variables. 99% of UNIX users don't think about which environment variables a given command they type depends upon. They just have some shell startup scripts that set some for them, and they get inherited by every process they run. Starting with no environment variables forces you to know every one you need. Often this can be poorly documented and otherwise difficult to find out.

  3. SDL the language. It's peculiar to Vesta. It's functional which many people find strange. It has data types which people aren't familiar with (bindings). It treats functions as first class values, which people also often find strange.

  4. Abstraction and generalization. UNIX users are used to thinking in terms of individual command lines they type at a shell prompt. However most build processes that users would write up in SDL they'll want to re-use at some point. That means abstracting them in some way, making it possible to substitute in different pieces (different source files, different command-line arguments, different versions of the executable software). That's what a "bridge" is: an abstracted, generalized way of running one or processes.

How do the suggestions above simplify this?

People do write SDL like what's described above (straight-line code that sets up filesystems and environment variables and calls _run_tool one or more times with hard-coded sources/executables). That's usually the first step towards writing a bridge.

It seems to me (KenSchalk) that the only effort the approach this page is talking about saves is item #4 on the list above (abstraction and generalization). Setting up the filesystem and environment variables may still be hard (depending on the tool). Learning SDL is still a hurdle, though maybe you can get away with doing less in the language if you're just writing straight-line code.

Specific responses

  • In this mode, what these people would be doing is populating vesta with the tools they need -- in a directory structure copied from what they already do in linux. then it should be something trivial to convert that into a binding w/ the same hier and put that in root to the temporary encapsulated root filesystem (TERFS) w/ the same hier.

Sure, that's easy:

% cd /vesta-work/jsmith/foo
% mkdir root
[copy files into the root subdirectory]
% cat > build.ves
files
  root;
{
  . ++= [ root ];
  // ...
}
  • the thin bridge version would allow the passing of strings, instead of making the user pass a list of strings, forcing them to learn, at least a little, the SDL list syntax.

How do you parse that into separate strings to make the command line? What if the user tries to put a string in quotes on the command line? Do you pass the command-line to a shell to be interpreted? If so which one (csh, tcsh, bash, sh, etc.)?

  • What is the "cost" of putting a lot of files, like 1000s in the TERFS.

You're correct, the files are not copied in any way. They're not link symbolic links (ln -s), they're more like hard links (ln), but that's really just an analogy.

The temporary filesystem only exists in the memory of the evaluator until the tool access it. Whenever the tool tries to access some file/directory, the repository calls back the evaluator to get information about the file/directory being accessed.

So, the "costs" to consider are:

  1. Time to construct the ./root binding in SDL (there are some ways that are faster than others).

  2. Memory in the evaluator to hold the ./root binding

  3. Number and latency of network transactions to access each file/directory: from the host running tool to the repository (NFS), and from the repository back to the evaluator (SRPC).

Also, how you organize files in the ./root may matter in some cases. Suppose, for example, that you have a tool which lists the entire contents of the current working directory (typically ./root/.WD in SDL). If you put lots of extra files in there, the network transactions to list the directory will take longer.