-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accommodating pipes in programs.py #387
base: master
Are you sure you want to change the base?
Conversation
@hannes-ucsc — Ready for review |
Can you describe a use case for the pipe-in-single-container mode? It seems to me that the pipe-of-containers mode is more useful. If we can eliminate the former mode, I might have a suggestion on how to assemble the pipe without passing Note that @jvivian, @arkal and @jpdna had a very difficult time piping out of a container. They had to disable a particular logging configuration in Docker to make it work. |
I believe, and I consulted with @jvivian that running a pipe in a container would be more efficient if the pipe only required that one containers set of binaries to run. For instance a lot of my vg pipeline has lots of steps with piped commands that contain only vg calls, it is simply faster to run a pipe command through a vg container than it is to run one part of the pipe through the vg container, pipe the output into the input of another vg container run, and so on and so forth. Both options are supported by my addition to docker_call regardless, you'd just need to supply a list of docker-container names into the 'tool' parameter to use the pipe-of-containers functionality which is a little redundant if it's just the same docker-container. I don't know the exact context that @jvivian, @arkal, or @jpdna faced when they had trouble piping out of a container. I believe I've resolved that here as I've tested the functionality on my vg pipeline and it works. |
Fair enough. We need tests for both modes. The tests should pipe considerable amounts data (tens of GiB) since that is where we saw the issues in the pipe-of-containers mode. |
Does the machine that runs the unit tests support tens of GBs of memory? I'm guessing my unit test would need to generate a file of that size during run-time. |
No, but the very point of pipes is that their memory overhead is O(1) not O(n), with n being the number of bytes going through them. As we'll see there is a problem with pipes coming out of a docker container, but we'll cross that bridge when we get there. You also don't need to pre-generate a file since that, again, contradicts the point of pipes. You can read infinite amounts of random data from Here's an idea:
|
2026a1c
to
bb7ab62
Compare
@hannes-ucsc — Ready for review |
|
||
Running 'pipe-of-containers' mode for command | ||
'head -c 1M </dev/urandom | tee >(md5sum 1>&2) | gzip | gunzip | md5sum 1>&2': | ||
command= ['head -c 1M </dev/urandom | tee >(md5sum 1>&2)', 'gzip -', 'gunzip -', 'md5sum - 1>&2'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? A four element array?
It would also be helpful if there was less duplication so the difference stands out more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -
isn't necessary for gzip and gunzip, I think. It also is inconsistent with the example above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The - is necessary in order to have the piping between docker containers
work. The 4 array list is correct; one tool for each sub command in the
pipe. The main use case for it would be to run the pipe within that
container, busy just as an example it shows how the pipe between containers
works.
On Jul 28, 2016 6:38 PM, "Hannes Schmidt" [email protected] wrote:
In src/toil_scripts/lib/programs.py
#387 (comment)
:
:param bool mock: Whether to run in mock mode. If this variable is unset, its value will be determined by the environment variable.
- Piping docker command examples:
Running 'pipe-in-single-container' mode for command
'head -c 1M </dev/urandom | tee >(md5sum 1>&2) | gzip | gunzip | md5sum 1>&2':
command= ['head -c 1M </dev/urandom | tee >(md5sum 1>&2)', 'gzip -', 'gunzip -', 'md5sum - 1>&2']
work_dir=curr_work_dir
docker_tools=['ubuntu']
stdout = docker_call(work_dir=docker_work_dir, parameters=command, tool=docker_tools, check_output=True)
Running 'pipe-of-containers' mode for command
'head -c 1M </dev/urandom | tee >(md5sum 1>&2) | gzip | gunzip | md5sum 1>&2':
command= ['head -c 1M </dev/urandom | tee >(md5sum 1>&2)', 'gzip -', 'gunzip -', 'md5sum - 1>&2']
The - isn't necessary for gzip and gunzip, I think. It also is inconsistent
with the example above.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/BD2KGenomics/toil-scripts/pull/387/files/bb7ab62af239974049ec09a2f835cf348411efc7#r72731513,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJpTjFjNY3CYDKlRwgB8FqMs4Z8yWYV7ks5qaVmfgaJpZM4JVuTF
.
I added and then deleted comments so disregard the emails you may have gotten and instead look at GitHub directly. Instead of overloading
It's important that we get the function signature right because it will be hard to change later. |
fb79203
to
887292f
Compare
@hannes-ucsc — Ready for review |
@fnothaft -- ready for review |
@@ -15,7 +15,8 @@ def mock_mode(): | |||
return True if int(os.environ.get('TOIL_SCRIPTS_MOCK_MODE', '0')) else False | |||
|
|||
|
|||
def docker_call(tool, | |||
def docker_call(tool='', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should never use mutable types as default value for keyword arguments.
It's also unnecessary, tool and tools should default to None and since bool(None) is False
so you should be good for the tests below.
_log.debug("Calling docker with %s." % " ".join(base_docker_call + [tool] + parameters)) | ||
|
||
docker_call = base_docker_call + [tool] + parameters | ||
if bool(tools) == bool(tool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use require()
. Search the code base or ask @jvivian about it.
try: | ||
if outfile: | ||
subprocess.check_call(docker_call, stdout=outfile) | ||
subprocess.check_call(docker_call, stdout=outfile, shell=shell_flag) | ||
elif check_output and return_stderr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be rewritten with nested if
s to only examine each flag once.
Still -1
I thought we can make Whatever we do, the signature is going to be ugly which is why I hesitate to merge this PR, even after the other issues are resolved. There is also the issue that @fnothaft is the merge master now. He will have to make that decision. |
887292f
to
7dc5122
Compare
7dc5122
to
aae9530
Compare
@fnothaft -- ready for review |
I'm in -1 agreement with @hannes-ucsc, for several reasons, primarily:
RE:
This is true, but I don't think that overloading
This proposed approach is just clunky. |
The problem with running commands contained within files that are passed to a mounted directory within a single container is that you can't run a pipeline command that uses multiple tools that are contained in different Docker containers. A way to work around that would be to create a new Docker container that contains all of the necessary tools, but that wouldn't be granular or practical. |
Do we currently have a use case for this? I thought the use case we currently had was running multiple vg commands through a pipe. I agree with your point but I prefer to not solve problems before they exist. |
Charlie, right now, you need just one of the two modes, the one where the entire pipeline runs inside a single container, correct? Do you want to take a stab at implementing just that and see if if it is less invasive? In the long run we need to clean up |
@hannes-ucsc Yeah I can take a stab at it. It shouldn't be hard since I'm basically just removing the pipe-of-containers functionality. |
Any PRs targeting common lib, like this one, are going to be complicated by DataBiosphere/toil#922 |
@jvivian Should future PRs like this one be moved to toil-lib instead? |
Ya, soon common lib won't exist in toil-scripts and instead will list toil-lib as a dependency. That should happen soon, I just have a couple things blocking me that I'm trying to get resolved. |
As a hobby, I've been dabbling with a composite approach of constructing a docker client invocation, also using the Builder pattern. Not quite happy with it yet, though. |
Adding in docker pipes unit test Added docs to docker_call and fixed docker_call unit test Changed datasize for docker_call pipes unit test to 1GB Polished docker_call functionality Added stderr to file handle support in docker_call Factored out pipe_of_containers in docker_call
aae9530
to
32ce17d
Compare
@hannes-ucsc Ready for review. |
I pushed a preliminary version of my efforts towards an improved Docker wrapper to the new toil-lib project. Take a look at the doctests for pipe_to() as an example of how to pipe containers together without the use of a shell. |
Adding in support for pipes in docker_call function of 'programs.py' module. This is needed to accommodate more streamlined and efficient runs of bash piped commands that use docker containers.
This is done in one of two ways:
This change is backwards compatible with old usage of docker_call.