Accommodating pipes in programs.py #387

cmarkello · 2016-07-27T00:29:02Z

Adding in support for pipes in docker_call function of 'programs.py' module. This is needed to accommodate more streamlined and efficient runs of bash piped commands that use docker containers.
This is done in one of two ways:

If the user provides a list of strings in the 'tool' argument each defining a docker container that is used in each section of the pipe command, then the function will generate the appropriate command that streams docker output through the subsequent steps of the pipe.
If the user provides a single string in the 'tool' argument and the command contains a pipe character, then the function will run the entire piped command within a single docker container.

This change is backwards compatible with old usage of docker_call.

cmarkello · 2016-07-27T00:50:26Z

@hannes-ucsc — Ready for review

hannes-ucsc · 2016-07-27T02:20:04Z

Can you describe a use case for the pipe-in-single-container mode? It seems to me that the pipe-of-containers mode is more useful. If we can eliminate the former mode, I might have a suggestion on how to assemble the pipe without passing shell=True.

Note that @jvivian, @arkal and @jpdna had a very difficult time piping out of a container. They had to disable a particular logging configuration in Docker to make it work.

cmarkello · 2016-07-27T03:10:17Z

I believe, and I consulted with @jvivian that running a pipe in a container would be more efficient if the pipe only required that one containers set of binaries to run. For instance a lot of my vg pipeline has lots of steps with piped commands that contain only vg calls, it is simply faster to run a pipe command through a vg container than it is to run one part of the pipe through the vg container, pipe the output into the input of another vg container run, and so on and so forth.

Both options are supported by my addition to docker_call regardless, you'd just need to supply a list of docker-container names into the 'tool' parameter to use the pipe-of-containers functionality which is a little redundant if it's just the same docker-container.

I don't know the exact context that @jvivian, @arkal, or @jpdna faced when they had trouble piping out of a container. I believe I've resolved that here as I've tested the functionality on my vg pipeline and it works.

hannes-ucsc · 2016-07-27T03:44:21Z

Fair enough. We need tests for both modes. The tests should pipe considerable amounts data (tens of GiB) since that is where we saw the issues in the pipe-of-containers mode.

cmarkello · 2016-07-27T03:59:24Z

Does the machine that runs the unit tests support tens of GBs of memory? I'm guessing my unit test would need to generate a file of that size during run-time.

hannes-ucsc · 2016-07-27T04:53:41Z

Does the machine that runs the unit tests support tens of GBs of memory?

No, but the very point of pipes is that their memory overhead is O(1) not O(n), with n being the number of bytes going through them. As we'll see there is a problem with pipes coming out of a docker container, but we'll cross that bridge when we get there.

You also don't need to pre-generate a file since that, again, contradicts the point of pipes. You can read infinite amounts of random data from /dev/urandom. Read up on that and dd, tee and md5sum or md5. You will need to prove in your tests that the data going into a pipe is the same as the data coming out of it, especially if the pipe crosses a container boundary.

Here's an idea:

dd if=/dev/urandom bs=1m count=100 | tee >(md5 1>&2) | gzip | gunzip | md5 1>&2

cmarkello · 2016-07-28T23:52:54Z

@hannes-ucsc — Ready for review

hannes-ucsc · 2016-07-29T01:37:11Z

src/toil_scripts/lib/programs.py

+
+        Running 'pipe-of-containers' mode for command
+        'head -c 1M </dev/urandom | tee >(md5sum 1>&2) | gzip | gunzip | md5sum 1>&2':
+            command= ['head -c 1M </dev/urandom | tee >(md5sum 1>&2)', 'gzip -', 'gunzip -', 'md5sum - 1>&2']


Is this right? A four element array?

It would also be helpful if there was less duplication so the difference stands out more.

The - isn't necessary for gzip and gunzip, I think. It also is inconsistent with the example above.

The - is necessary in order to have the piping between docker containers
work. The 4 array list is correct; one tool for each sub command in the
pipe. The main use case for it would be to run the pipe within that
container, busy just as an example it shows how the pipe between containers
works.

On Jul 28, 2016 6:38 PM, "Hannes Schmidt" [email protected] wrote:

In src/toil_scripts/lib/programs.py
#387 (comment)
:

:param bool mock: Whether to run in mock mode. If this variable is unset, its value will be determined by the environment variable.

Piping docker command examples:

Running 'pipe-in-single-container' mode for command

'head -c 1M </dev/urandom | tee >(md5sum 1>&2) | gzip | gunzip | md5sum 1>&2':

command= ['head -c 1M </dev/urandom | tee >(md5sum 1>&2)', 'gzip -', 'gunzip -', 'md5sum - 1>&2']

work_dir=curr_work_dir

docker_tools=['ubuntu']

stdout = docker_call(work_dir=docker_work_dir, parameters=command, tool=docker_tools, check_output=True)

Running 'pipe-of-containers' mode for command

'head -c 1M </dev/urandom | tee >(md5sum 1>&2) | gzip | gunzip | md5sum 1>&2':

command= ['head -c 1M </dev/urandom | tee >(md5sum 1>&2)', 'gzip -', 'gunzip -', 'md5sum - 1>&2']

The - isn't necessary for gzip and gunzip, I think. It also is inconsistent
with the example above.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/BD2KGenomics/toil-scripts/pull/387/files/bb7ab62af239974049ec09a2f835cf348411efc7#r72731513,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJpTjFjNY3CYDKlRwgB8FqMs4Z8yWYV7ks5qaVmfgaJpZM4JVuTF
.

hannes-ucsc · 2016-07-29T01:58:37Z

I added and then deleted comments so disregard the emails you may have gotten and instead look at GitHub directly.

Instead of overloading tool as typed list|str we should add a tools= kwarg and then do this:

if bool(tools) == bool(tool):
    raise …
if not tools:
    tools = [ tool ]

It's important that we get the function signature right because it will be hard to change later.

cmarkello · 2016-07-30T22:41:30Z

@hannes-ucsc — Ready for review

cmarkello · 2016-08-01T21:40:25Z

@fnothaft -- ready for review

hannes-ucsc · 2016-08-02T00:31:24Z

src/toil_scripts/lib/programs.py

@@ -15,7 +15,8 @@ def mock_mode():
    return True if int(os.environ.get('TOIL_SCRIPTS_MOCK_MODE', '0')) else False


-def docker_call(tool,
+def docker_call(tool='',


You should never use mutable types as default value for keyword arguments.

It's also unnecessary, tool and tools should default to None and since bool(None) is False so you should be good for the tests below.

hannes-ucsc · 2016-08-02T00:54:13Z

src/toil_scripts/lib/programs.py

-    _log.debug("Calling docker with %s." % " ".join(base_docker_call + [tool] + parameters))
-
-    docker_call = base_docker_call + [tool] + parameters
+    if bool(tools) == bool(tool):


Use require(). Search the code base or ask @jvivian about it.

hannes-ucsc · 2016-08-02T00:56:22Z

src/toil_scripts/lib/programs.py

    try:
        if outfile:
-            subprocess.check_call(docker_call, stdout=outfile)
+            subprocess.check_call(docker_call, stdout=outfile, shell=shell_flag)
+        elif check_output and return_stderr:


Can be rewritten with nested ifs to only examine each flag once.

hannes-ucsc · 2016-08-02T01:32:44Z

Still -1

return_stderr=True should have the following meaning: The return value is a string that includes the stderr from all the programs in the pipe given via parameters=.

return_stdout=True (currently check_output) should have the following semantics: The return value is a string that includes the stdout of the last program in the pipe given via tools=.

I thought we can make tool='foo' 100% equivalent to tools=['foo'] but I don't think that's the case because the former should cause the container entry point to be overridden. You tried to follow my misguided proposal but that just made it worse, since you had to introduce another flag. Don't map tool onto tools[ tool ] and eliminate that extra flag.

Whatever we do, the signature is going to be ugly which is why I hesitate to merge this PR, even after the other issues are resolved.

There is also the issue that ' '.join()-ing the commands naively will prevent this function from being able to support command arguments with spaces in them.

@fnothaft is the merge master now. He will have to make that decision.

cmarkello · 2016-08-03T20:00:06Z

@fnothaft -- ready for review

fnothaft · 2016-08-03T23:36:07Z

I'm in -1 agreement with @hannes-ucsc, for several reasons, primarily:

Whatever we do, the signature is going to be ugly which is why I hesitate to merge this PR, even after the other issues are resolved.

There is also the issue that ' '.join()-ing the commands naively will prevent this function from being able to support command arguments with spaces in them.

RE:

I believe, and I consulted with @jvivian that running a pipe in a container would be more efficient if the pipe only required that one containers set of binaries to run. For instance a lot of my vg pipeline has lots of steps with piped commands that contain only vg calls, it is simply faster to run a pipe command through a vg container than it is to run one part of the pipe through the vg container, pipe the output into the input of another vg container run, and so on and so forth.

This is true, but I don't think that overloading docker_run is the correct approach. Honestly, I would rather:

Write the command with pipes that I am running to a temp file (as a bash script, let's say)
Mount that file inside the container
And then override the entrypoint and run the command via /bin/bash <path_to_file>

This proposed approach is just clunky.

cmarkello · 2016-08-05T00:54:38Z

The problem with running commands contained within files that are passed to a mounted directory within a single container is that you can't run a pipeline command that uses multiple tools that are contained in different Docker containers. A way to work around that would be to create a new Docker container that contains all of the necessary tools, but that wouldn't be granular or practical.

fnothaft · 2016-08-05T03:50:25Z

The problem with running commands contained within files that are passed to a mounted directory within a single container is that you can't run a pipeline command that uses multiple tools that are contained in different Docker containers.

Do we currently have a use case for this? I thought the use case we currently had was running multiple vg commands through a pipe. I agree with your point but I prefer to not solve problems before they exist.

hannes-ucsc · 2016-08-30T19:55:10Z

Charlie, right now, you need just one of the two modes, the one where the entire pipeline runs inside a single container, correct? Do you want to take a stab at implementing just that and see if if it is less invasive?

In the long run we need to clean up docker_call. Maybe we can rewrite it in a way that allows for composing multiple calls to docker_call into a pipeline rather than overloading the signature to accommodate the three modes (no pipe, pipe-inside-container, pipe-between-containers). I think this can be done by making docker_call a factory for subprocess.Popen objects.

cmarkello · 2016-08-30T20:28:39Z

@hannes-ucsc Yeah I can take a stab at it. It shouldn't be hard since I'm basically just removing the pipe-of-containers functionality.
The one thing though that is nice about the pipe-of-containers is that I don't have to inefficiently create intermediate files in one of my pipe commands which uses multiple docker containers. I do have one use-case in my vg toil-scripts pipeline which needs that functionality for convenience.

jvivian · 2016-08-30T20:34:10Z

Any PRs targeting common lib, like this one, are going to be complicated by DataBiosphere/toil#922

cmarkello · 2016-08-30T20:42:34Z

@jvivian Should future PRs like this one be moved to toil-lib instead?

jvivian · 2016-08-30T20:46:25Z

Ya, soon common lib won't exist in toil-scripts and instead will list toil-lib as a dependency. That should happen soon, I just have a couple things blocking me that I'm trying to get resolved.

hannes-ucsc · 2016-08-31T22:18:33Z

As a hobby, I've been dabbling with a composite approach of constructing a docker client invocation, also using the Builder pattern. Not quite happy with it yet, though.

Adding in docker pipes unit test Added docs to docker_call and fixed docker_call unit test Changed datasize for docker_call pipes unit test to 1GB Polished docker_call functionality Added stderr to file handle support in docker_call Factored out pipe_of_containers in docker_call

cmarkello · 2016-09-01T18:35:11Z

@hannes-ucsc Ready for review.

hannes-ucsc · 2016-09-02T07:50:42Z

I pushed a preliminary version of my efforts towards an improved Docker wrapper to the new toil-lib project. Take a look at the doctests for pipe_to() as an example of how to pipe containers together without the use of a shell.

cmarkello added the in progress label Jul 27, 2016

hannes-ucsc added the needs work label Jul 27, 2016

cmarkello force-pushed the docker_call_pipes branch from 2026a1c to bb7ab62 Compare July 28, 2016 22:48

cmarkello removed the needs work label Jul 28, 2016

hannes-ucsc reviewed Jul 29, 2016
View reviewed changes

hannes-ucsc added the needs work label Jul 29, 2016

cmarkello force-pushed the docker_call_pipes branch 2 times, most recently from fb79203 to 887292f Compare July 30, 2016 21:33

cmarkello removed the needs work label Jul 30, 2016

cmarkello added needs work and removed needs work labels Aug 1, 2016

hannes-ucsc reviewed Aug 2, 2016
View reviewed changes

cmarkello added the needs work label Aug 2, 2016

hannes-ucsc reviewed Aug 2, 2016
View reviewed changes

cmarkello force-pushed the docker_call_pipes branch from 887292f to 7dc5122 Compare August 3, 2016 18:28

cmarkello removed the needs work label Aug 3, 2016

cmarkello force-pushed the docker_call_pipes branch from 7dc5122 to aae9530 Compare August 3, 2016 19:28

cmarkello added the needs work label Aug 5, 2016

cmarkello force-pushed the docker_call_pipes branch from aae9530 to 32ce17d Compare August 31, 2016 22:42

cmarkello removed the needs work label Sep 1, 2016

cmarkello added the needs work label Sep 2, 2016

ejacox added ready and removed in progress labels Jul 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accommodating pipes in programs.py #387

Accommodating pipes in programs.py #387

cmarkello commented Jul 27, 2016 •

edited

Loading

cmarkello commented Jul 27, 2016

hannes-ucsc commented Jul 27, 2016

cmarkello commented Jul 27, 2016

hannes-ucsc commented Jul 27, 2016

cmarkello commented Jul 27, 2016

hannes-ucsc commented Jul 27, 2016

cmarkello commented Jul 28, 2016

hannes-ucsc Jul 29, 2016

hannes-ucsc Jul 29, 2016

cmarkello Jul 29, 2016

hannes-ucsc commented Jul 29, 2016

cmarkello commented Jul 30, 2016

cmarkello commented Aug 1, 2016

hannes-ucsc Aug 2, 2016

hannes-ucsc Aug 2, 2016

hannes-ucsc Aug 2, 2016

hannes-ucsc commented Aug 2, 2016

cmarkello commented Aug 3, 2016

fnothaft commented Aug 3, 2016 •

edited

Loading

cmarkello commented Aug 5, 2016

fnothaft commented Aug 5, 2016

hannes-ucsc commented Aug 30, 2016 •

edited

Loading

cmarkello commented Aug 30, 2016

jvivian commented Aug 30, 2016 •

edited

Loading

cmarkello commented Aug 30, 2016

jvivian commented Aug 30, 2016

hannes-ucsc commented Aug 31, 2016

cmarkello commented Sep 1, 2016

hannes-ucsc commented Sep 2, 2016 •

edited

Loading

Accommodating pipes in programs.py #387

Are you sure you want to change the base?

Accommodating pipes in programs.py #387

Conversation

cmarkello commented Jul 27, 2016 • edited Loading

cmarkello commented Jul 27, 2016

hannes-ucsc commented Jul 27, 2016

cmarkello commented Jul 27, 2016

hannes-ucsc commented Jul 27, 2016

cmarkello commented Jul 27, 2016

hannes-ucsc commented Jul 27, 2016

cmarkello commented Jul 28, 2016

hannes-ucsc Jul 29, 2016

Choose a reason for hiding this comment

hannes-ucsc Jul 29, 2016

Choose a reason for hiding this comment

cmarkello Jul 29, 2016

Choose a reason for hiding this comment

hannes-ucsc commented Jul 29, 2016

cmarkello commented Jul 30, 2016

cmarkello commented Aug 1, 2016

hannes-ucsc Aug 2, 2016

Choose a reason for hiding this comment

hannes-ucsc Aug 2, 2016

Choose a reason for hiding this comment

hannes-ucsc Aug 2, 2016

Choose a reason for hiding this comment

hannes-ucsc commented Aug 2, 2016

cmarkello commented Aug 3, 2016

fnothaft commented Aug 3, 2016 • edited Loading

cmarkello commented Aug 5, 2016

fnothaft commented Aug 5, 2016

hannes-ucsc commented Aug 30, 2016 • edited Loading

cmarkello commented Aug 30, 2016

jvivian commented Aug 30, 2016 • edited Loading

cmarkello commented Aug 30, 2016

jvivian commented Aug 30, 2016

hannes-ucsc commented Aug 31, 2016

cmarkello commented Sep 1, 2016

hannes-ucsc commented Sep 2, 2016 • edited Loading

cmarkello commented Jul 27, 2016 •

edited

Loading

fnothaft commented Aug 3, 2016 •

edited

Loading

hannes-ucsc commented Aug 30, 2016 •

edited

Loading

jvivian commented Aug 30, 2016 •

edited

Loading

hannes-ucsc commented Sep 2, 2016 •

edited

Loading