Rethinking the shell pipeline

I’ve been playing with an idea of how to extend the traditional unix shell pipeline. Historically shell pipelines worked by passing free-form text around. This is very flexible and easy to work with. For instance, its easy to debug or incrementally construct such a pipeline as each individual step is easily readable.

However, the pure textual format is problematic in many cases as you need to interpret the data to work on it. Even something as basic as numerical sorting on a column gets quite complicated.

There has been a few projects trying to generalize the shell pipeline to solve these issues by streaming Objects in the pipeline. For instance the HotWire shell and Microsoft PowerShell. Although these are cool projects I think they step too far away from the traditional interactive shell pipelines, getting closer to “real” programming, with more strict interfaces (rather than freeform) and not being very compatible with existing unix shell tools.

My approach is a kind of middle ground between free-form text and objects. Instead of passing free-form text in the pipeline it uses typed data, in the form of glib GVariants. GVariant is a size-efficient binary data format with a powerful recursive type system and a textual form that is pretty nice. Additionally the type system it is a superset of DBus which is pretty nice as it makes it easier to integrate DBus calls with the shell.

Additionally I created a format negotiation system for pipes such that for “normal” pipes or other types of output we output textual data, one variant per line. But, if the destination process specifies that it supports it we pass the data in raw binary form.

Then I wrote some standard tools to work on this format, so you can sort, filter, limit, and display variant streams. For example, to get some sample data I wrote a “dps” tool similar to ps that gives typed output.

Running it prints something like:

$ dps
 <{'pid': <uint32 1>, 'ppid': <uint32 0>, 'euid': <uint32 0>, 'egid': <uint32 0>, 'user': <'root'>, 'cmd': <'systemd'>, 'cmdline': <'/usr/lib/systemd/systemd'>, 'cmdvec': <['/usr/lib/systemd/systemd']>, 'state': <'S'>, 'utime': <uint64 38>, 'stime': <uint64 138>, 'cutime': <uint64 3867>, 'cstime': <uint64 1273>, 'time': <uint64 1344635046>, 'start': <uint64 1>, 'vsize': <uint64 61488>, 'rss': <uint64 24408>}>
 <{'pid': <uint32 2>, 'ppid': <uint32 0>, 'euid': <uint32 0>, 'egid': <uint32 0>, 'user': <'root'>, 'cmd': <'kthreadd'>, 'cmdline': <'[kthreadd]'>, 'state': <'S'>, 'utime': <uint64 0>, 'stime': <uint64 1>, 'cutime': <uint64 0>, 'cstime': <uint64 0>, 'time': <uint64 1344635046>, 'start': <uint64 1>, 'vsize': <uint64 0>, 'rss': <uint64 0>}>
 ...

Not super-readable, but its a textual format that you could combine with traditional tools like grep and awk.

But, with the type information we can do more interesting things. For instance, we could filter using a numeric comparison, say finding
all system uids:

$ dps | dfilter euid \< 1000
 <{'pid': <uint32 1>, 'ppid': <uint32 0>, 'euid': <uint32 0>, 'egid': <uint32 0>, 'user': <'root'>, 'cmd': <'systemd'>, 'cmdline': <'/usr/lib/systemd/systemd'>, 'cmdvec': <['/usr/lib/systemd/systemd']>, 'state': <'S'>, 'utime': <uint64 38>, 'stime': <uint64 139>, 'cutime': <uint64 4290>, 'cstime': <uint64 1318>, 'time': <uint64 1344635266>, 'start': <uint64 1>, 'vsize': <uint64 61488>, 'rss': <uint64 24408>}>
 <{'pid': <uint32 2>, 'ppid': <uint32 0>, 'euid': <uint32 0>, 'egid': <uint32 0>, 'user': <'root'>, 'cmd': <'kthreadd'>, 'cmdline': <'[kthreadd]'>, 'state': <'S'>, 'utime': <uint64 0>, 'stime': <uint64 1>, 'cutime': <uint64 0>, 'cstime': <uint64 0>, 'time': <uint64 1344635266>, 'start': <uint64 1>, 'vsize': <uint64 0>, 'rss': <uint64 0>}>
 ...

Then we can add numerical sorting:

$ dps | dfilter euid \< 1000 | dsort rss
 <{'pid': <uint32 1>, 'ppid': <uint32 0>, 'euid': <uint32 0>, 'egid': <uint32 0>, 'user': <'root'>, 'cmd': <'systemd'>, 'cmdline': <'/usr/lib/systemd/systemd'>, 'cmdvec': <['/usr/lib/systemd/systemd']>, 'state': <'S'>, 'utime': <uint64 38>, 'stime': <uint64 139>, 'cutime': <uint64 4290>, 'cstime': <uint64 1318>, 'time': <uint64 1344635365>, 'start': <uint64 1>, 'vsize': <uint64 61488>, 'rss': <uint64 24408>}>
 <{'pid': <uint32 769>, 'ppid': <uint32 745>, 'euid': <uint32 0>, 'egid': <uint32 0>, 'user': <'root'>, 'cmd': <'Xorg'>, 'cmdline': <'/usr/bin/Xorg :0 -background none -logverbose 7 -seat seat0 -nolisten tcp vt1'>, 'cmdvec': <['/usr/bin/Xorg', ':0', '-background', 'none', '-logverbose', '7', '-seat', 'seat0', '-nolisten', 'tcp', 'vt1']>, 'state': <'S'>, 'utime': <uint64 1602>, 'stime': <uint64 3145>, 'cutime': <uint64 22>, 'cstime': <uint64 9>, 'time': <uint64 1344634285>, 'start': <uint64 1081>, 'vsize': <uint64 108000>, 'rss': <uint64 16028>}>
 ...

And, nice display:

$ dps | dfilter euid \< 1000 | dsort rss | dhead 4 | dtable pid user rss vsize cmdline
 pid     user      rss    vsize  cmdline
   1   'root'    24408    61488 '/usr/lib/systemd/systemd'
 769   'root'    16028   108000 '/usr/bin/Xorg :0 -background none -logverbose 7 -seat seat0 -nolisten tcp vt1'
 608   'root'    15076   255312 '/usr/bin/python /usr/sbin/firewalld --nofork'
 747   'root'     8276   452604 '/usr/sbin/libvirtd'

Note how we do two type-sensitive operations (filter by numerical comparison and numerical sort) without problems, and that we can do the “head” operation to limit output length without affecting the table header. We also filter on a column (euid) which is not displayed. And, all the data that flows in the pipeline is in binary form (since all targets support that), so we don’t waste a time re-parsing it.

Additionally I think this is a pretty nice example of the Unix idea of “do one thing well”. Rather than having the “ps” app have lots of ways to specify how the output should be sorted/limited/displayed we have separate reusable apps for those parts. Of course, its a lot more typing than “ps aux”, which makes it less practical in real life.

I’ve got some code, which while quite rudimentary, does show that this could work. However, it needs a lot of fleshing out to be actually useful. I’m interested in what people think about this. Does it seem useful?

51 thoughts on “Rethinking the shell pipeline”

antono says:

August 10, 2012 at 10:49 pm

Awesome idea. I would like to see coreutils2 and other popular tools following this approach.

Reply
cjanscen says:

August 10, 2012 at 10:56 pm

Interesting. I wanted to do something like this myself, except using something like YAML as the output/interchange.

YAML is implicitly (but strongly) typed and can represent data in a couple different structures (have not looked at GVariants though).

Bonus points for defining some sort of iteration syntax, or making a shell which shares the language (vs stringly-typed bash).

I just want to rewrite all of linux userland actually this is why I had to stop, I don’t know where to stop.

Reply
David says:

August 11, 2012 at 12:13 am

The big problem with this is always the “isn’t ps aux easier” question. I think you realize that with your last comment.

Currently shell users have to deal with complexity in the tools they use to manipulate the data (sed/awk/grep/echo/tail). What you propose may remove the complexity of those tools and replace it instead with a more complex pipeline.

I’m not sure thats better, and its also really hard to get off the ground because if you can’t support all the utilities that I use from day to day, then I still have to learn all the previous tools to deal with the 1% you can’t deal with. Once I’ve learned those tools, I’m going to ask “why should I use your tools if the pipeline is longer, and I still have to use BASH”

IMHO (and I hate to rain on your parade), but it always made more sense to me to try and replace BASH entirely. If you had a way to replace BASH with something like python that was a bit more interactive friendly, then you might be able to make a really compelling case. Things you would want to do:
1) Be able to recall that output of a previous command without rerunning it. Imagine running something like “ps” and getting a list of processes that you could then examine without having to constantly refine the pipe. Something like “ps; filter($LAST,pid<1000); grep($LAST,command=/*init*/);" Realize thats not what you want so "grep($LAST[1],command=/*dbus*/);" etc…
1.1) Have some way to convert that final accumulation of filters into a single useful line of code.
2) No more dealing with BASH's odd escaping or quoting issues, because everything is just a python function.
3) Ability to load libraries to deal with the proliferation of non-text files like Word or Excel documents.
4) Perhaps a sane way to handle the complex hierarchical files like XML since PERL6 seems to be far far away.

I would be looking at things that have been successful in interactive data manipulation (like R) and trying to identify what made them successful and how that might be applied to build a modern shell which is not going to be constrained by all the limitations in memory and CPU that lead to BASH in the first place. Unfortunately I think trying to make the CLI tools simpler is actually the opposite of what is really desirable, because it doesn't address the limitations of BASH. What makes R so great is that the core language supports these very powerful filtering and splicing mechanisms (dps[dps$euid<1000,c("pid","user","rss","vsize","cmdline")] would cover everything in yours but the sorting and dhead), but you can't do that because BASH would go nuts and be completely unable to parse properly with any whitespace.

Reply
Benjamin Otte says:

August 11, 2012 at 2:08 am

You _obviously_ want dgit.

Reply
Luke Shumaker says:

August 11, 2012 at 2:23 am

Yes and no. It is easy to see the benefits of your system.

However, what is preventing the existing system from doing these things? Why couldn’t we have a `ps` that outputs that as CSV? GNU `sort` can sort by an arbitrary column of CSV, a `dfilter` like program would be simple, and cut+column would fill the roll of dtable.

That said, I can look at any one line of output in your format, and know what each “column” does. It reminds me of a similar proposal to use lisp instead of XML or plain text. I think it was on the one programming blog with a picture of a boat, but I can’t think of the name at the moment.

I’d also be interested in your negotiation system, you seem to have glossed over it.

Reply
Chris Sherlock says:

August 11, 2012 at 2:24 am

That is awesome! Best solution I’ve ever seen to this sort of thing.

Incidentally, there is a conversation going on at Hacker News https://news.ycombinator.com/item?id=4368993

Reply
Joel Parker Henderson says:

August 11, 2012 at 2:47 am

Very interesting and it definitely seems useful. A suggestion is to make the commands start with the same name, for example when you write “dps” and “dsort” I would personally prefer “psd” and “sortd” so the commands alphabetize together, stem together, and can be read faster IMHO. (the “d” may not be the best choice of letter because it typically means daemon… maybe some other suffix?). Good luck with your project!

Reply
Ryan says:

August 11, 2012 at 3:27 am

Brilliant!!! You just re-invented dynamic typing and data serialisation. This changes everything!!

[WORDPRESS HASHCASH] The poster sent us ‘0 which is not a hashcash value.

Reply
R.I.Pienaar says:

August 11, 2012 at 3:28 am

Like the idea. My coworker and I wrote jgrep.org as a similar sart but around json

Have built on this and some of my other tools have powershell like object passing over the shell pipes using json which combines well with jgrep

Reply
Ryan says:

August 11, 2012 at 3:29 am

“Mathematicians stand on each other’s shoulders while computer scientists stand on each other’s toes.”

[WORDPRESS HASHCASH] The poster sent us ‘0 which is not a hashcash value.

Reply
David Mytton says:

August 11, 2012 at 3:34 am

So the advantage is that commands output typed fields so you can choose to use other commands to sort, filter, etc with all the advantages that having types provide i.e. easy to sort numerical data. The disadvantage seems to be the long commands.

Maybe the use case for this is to make the internals of these commands more maintainable? As you say, instead of “ps” having many ways to format the output, it makes calls to the sort/limit/etc commands internally whilst still offering the control through the command line options. That could be a candidate for internal refactoring so you get cleaner code but still have the ability to get typed output as a user.

Reply
Robert Ames says:

August 11, 2012 at 3:38 am

JSON? Have you considered using JSON / e4x-ish type stuff instead of a gratuitious new serialization format?

ps -json | json-grep foo=bar | json-table foo=Foo bar=”Some Bar” baz=Other

Foo Some Bar Other

1 2 3

4 5 6

–Robert

Reply
comex says:

August 11, 2012 at 3:39 am

For the record, I vote for a format that doesn’t include explicit integer-size annotations – they’re almost certainly more harm than good in an interactive pipeline.

Reply
unix lover says:

August 11, 2012 at 3:52 am

” Even something as basic as numerical sorting on a column gets quite complicated.” What about “sort -k” ?

Please read the man-pages, and buy yourself a good book about shell scripts programming and stop wasting your time trying to reinvent the wheel.

The traditional shell pipes are just perfect. We don’t want (neither we need) anything different.

Reply
Jack says:

August 11, 2012 at 4:03 am

Indeed, I think the unix philosophy is not something that unix itself has followed. Just look at any modern variant of the old unix commands and count the number of options and special operators they have. Basically because of the limits imposed by human readable text output. Adding type and structure to command output is really the way it should be, the pretty-printing should be left to specialized formatting commands, just as you have done.

Reply
cheater says:

August 11, 2012 at 6:36 am

This looks interesting. There are several questions, though:

the format: difficult to read. Difficult to type – for that reason alone it won’t be something people will want to use. Don’t forget, the shell is for typing. People will end up cursing you out for inflicting this on them. The guy who came up with the Web, Tim Berners-Lee, said he was sorry about wasting the huge amount of ink for printouts that include “http://” since it is superfluous. Type annotations should be completely eliminated except in situations of ambiguity, and actually – if things are ambiguous maybe they just should be eliminated. It should be possible to implement the literals in sh, Bash, and ZSH, at the very least. That’ll give you enough to specify the format in itself.

dbus: Who cares about dbus? I understand you’re a desktop programmer and everything you do is in dbus. However, in the “hardcore” shell world, no one ever hears of DBus. Unless they want to interface with Gnome, and then only briefly, because using it from the command line is a pain. Is that really what people want? I do not know and I can’t answer. However the idea of using C-ish things could be useful. However, let’s be honest with eachother: the people using dbus don’t care about the shell for the work they use dbus in; and people using the shell for their work don’t care about using dbus. This is forced. I understand you have a hammer in your hand which you know and like, but not everything is a nail, and I’m sure you know all this but just feel the need to remind you.

The format being passed around: ideally, the shell should be able to tell if one of the programs has to produce or some other has to receive objects (and don’t fool yourself, your GVariants are still objects). At the point of initial execution the shell should be able to check which applications can support this. If the pipeline cannot be set up intelligently, it should exit with “Broken pipe” or something like that (there are people better with shells than I, who could say more about this). You could do this by requiring every program that supports object i/o to only do it if it’s passed –object. What more, you should have the shell typecheck the object schemata or whatever we want to call it. In normal pipes, you use text and that’s all. There’s no way to have “bad text”. You can, however, have “bad objects”, or incompatible objects. For example, ask for fields that don’t exist. I am not sure what should be happening at this point. A part of me says “use exceptions and go freeform”, another says “have contracts for greater security and robustness”. Going freeform is often annoying in that it messes up in very unpredictable ways. However, if this should lift off, then there should be a way to extend the standard posix toolkit – by this i mean awk, bash, maybe sed – to report the types required by their programs. I am not sure if that is easy.

Utilities: you want the standard utilities to start supporting some object format in the end. Having “dps” and “dbash” and “dawk” and “dsed” and “deverything” will annoy people to no end, so should be transitional. “Do one thing well” wasn’t about input/output formats, man: it’s about having a single logical concept in the program, as opposed to many different ways in which a program can *process data*. Doesn’t talk about ways in which a program can *access data*. Look at the recent trend in nul-separated lists (e.g. look at find, xargs)

Input/output: why are we streaming around text? For actual usability, speed, and as little mangling as possible, you want to stream around shared memory addresses in hex. Then, if someone wants, he can print out what’s in those addresses with the use of some tool. Do you really want to pump so much text through our prototypical object-enabled awk? Think about it, you can have objects in programs that are self-referrent. That’d never fly. And even if you disable this (how? you can’t *really* predict cycles!) then you will not be able to work with huge objects, which is the point at which you actually want to start using objects and text. Furthermore: what if the data I want to expose isn’t actual data – what if it’s generative? What if I want to expose a search algorithm, which displays fields that can be accessed, but it only does so in a generative manner as they get accessed? What if Google had to pre-generate all of its search results before displaying the first page – that would never fly. Not everything is a non-dynamic dictionary like what your GVariants I understand are. Most things can however be made to display themselves in the form of a dynamic dictionary.

Methods: static objects are fairly fucking boring. I would like to condemn them at this point and remind you that you yourself use a lot of different, dynamic abilities every day, and that not being able to call methods isn’t a large enough step. The shell is for programming just like any other language you use, so I don’t see why it should be archaically reduced to the capabilities of cobol.

Other languages: not everything is C. Of course, Python, Java, Qt, JavaScript, what have you are just C. But the future seems to be bright for very different languages. Ask people who program in Erlang, Haskell, Clojure, Scheme, even Prolog, see what they think. Ask them how they would string their programs together, and ask what strengths their language could use to make this process of pipelining better and more powerful. I’m sure things like C, Python or JS can follow suit through sophisticated libraries. What I’m trying to say is, don’t use the least common denominator to get your ideas. This is important. Look at the best points of each language you can come up with, come up with an idea, and then the ones who don’t support it will work towards supporting it.

Methods again: If you don’t allow methods you disable an important form of discovery, that is, make sure programmers can’t learn what they do by doing. If you have methods, you can ask an object to return its type, schema, or documentation. More and more languages that support this in their shells show how crucial this is and what a great way of facilitating development you would be using if you didn’t do it.

Regarding standard utils: There’s definitely potential for the standards being extended. Example with made-up syntax below. I use a dot, it might already mean something in awk, I don’t know. This one prints processes of uid less than 1000 in the form of cmd, pid sorted by pid.

ps | awk ‘/$0.uid < 1000/{yield {$0.cmd, $0.pid} }' | sort -n1 -F 'pid' | awk '{print}'

Notice we use "yield" to stay in the hex-address representation world, then use print to print the actual content out.

This one finds flash instances by searching for Firefoxes plugin-container program:

ps | grep 'plugin-container' -F 'cmd' | awk '{print $0.pid}' | xargs kill -15

This one finds flash instances too:

ps | grep 'plugin-container' -rF | awk '{print $0.pid}' | xargs kill -15

Maybe awk could even do something like $pid instead of $0.pid. Not sure that's necessary, though.

Notice I haven't specified –object in either of those. awk or grep can report that it needs object input, so the shell can make the other programs follow suit.

Reply
20after4 says:

August 11, 2012 at 6:43 am

Front page on hacker news: http://news.ycombinator.com/item?id=4368993

Reply
Thomas Leonard says:

August 11, 2012 at 7:11 am

This looks great. I’ve been meaning to do something similar myself, but never got the time.

A few related things I’d been thinking about:

– What if applications could specify a stylesheet (e.g. XSLT) in the output? Then ps could suggest a suitable rendering and the shell could display it in table form by default (or a GUI could render it as a GTK table, etc). grep and sort would just pass the stylesheet hint through.

– It would be super useful if “filename” was a separate type (not just “string”). The shell could render files and directories in the right colours, adjust paths if you cd, let you double-click to open them, etc.

– If applications could specify the content-type of the output, we wouldn’t need to have binary data dumped to the terminal. The shell could splice binary output data to a temporary file, or select a suitable viewer application.

– Would be nice if applications could output progress reports. e.g. find could say which file it was looking at, du how far it had got. The shell could display these temporarily if things were taking a long time.

Basically, how extensible is the format?

Reply
1. alexl says:
  
  August 11, 2012 at 7:43 am
  
  Thomas:
  
  I’d like to avoid any form of header describing the data, as that doesn’t work well with record-level filtering which seems to be a very common operation in shell. Filenames in dbus/gvariant are typically bytestrings to avoid non-utf8 problems, but they have no specific type other than that.
  
  The format is just GVariant, but the negotiations could be extended to support more formats or format details.
  
  Reply
claes says:

August 11, 2012 at 9:03 am

For certain types of unix pipeing, I have found it useful to pipe from tool to CSV, and then let sqlite process the data, using SQL statements. SQL solves many of the sorting, filtering, joining things that you can do with unix pipes too, but with a syntax that is broadly known. Especially the joining I have found hard to do well with shell/piping.

I think a sqlite-aware shell would be awesone, especially if common tools had a common output format (like csv with header) where that also included the schema / data format.

Reply
Snark says:

August 11, 2012 at 9:08 am

@unix lover: “sort -k” is nice as long as you don’t end up with a first column containing strings which can contain spaces themselves… then you fall in the nightmare of having to format the output of the various commands to use another separator, and tell the various commands to parse their input with that separator too.

There is definitely room for a better shell pipeline ; I’m not sure that blog post proposition is the best solution, but I find it nice to see some discussion of that problem.

Reply
Chris Smith says:

August 11, 2012 at 9:14 am

Seriously after 3 years of PowerShell, it’s a really bad idea. It’s more complicated and slow. Making the output of the tools we have to use more parseable with cut/awk is a better option.

Reply
claes says:

August 11, 2012 at 9:16 am

I did some experimenting with the SQL approach a few years ago, you can see an example of my idea here:

https://github.com/claes/osql/blob/master/example.py

but without better integration with the shell it would not be convenient.

However, I think it would be very convenient if at the shell prompt you could do something like

select * from $processes where pid < 1000 sort by rss limit 4

Reply
nud says:

August 11, 2012 at 9:55 am

Have you considered “sniffing” for the output stream and doing the right thing (ie apply a full dtable treatment) when the output is in a terminal ? (quite similar to ls colouring or git’s EDITOR usage)

If you add back some of the “field selection” parameters to dps, then it means

$ dps | dfilter euid \< 1000 | dsort rss | dhead 4 | dtable pid user rss vsize cmdline

could be shortened to

$ dps aux | dfilter euid \< 1000 | dsort rss | dhead 4

Commands would then just do the right thing if outputted in a terminal and have more flexible behaviour while outputting in another pipe.

Reply
maht says:

August 11, 2012 at 11:16 am

1. Pipes is byte orientated not text

2. everything you think is wrong with them is a feature

3. gnu is not unix

Reply
nojhan says:

August 11, 2012 at 12:10 pm

I found your idea close to TermKit: https://github.com/unconed/TermKit/ am I right?

Reply
Tom Limoncelli says:

August 11, 2012 at 12:47 pm

Hell yes.

Ignore the nay-sayers. They’re idiots.

We need something like this. PowerShell has proven it is a good idea and we should have done this eons ago.

My recommendation is to make sure there is no way to accidentally guess wrong about what kind of pipe. Maybe if the program is in /usr/dbin assume the program is aware of the new pipe format. That way dsort becomes sort, etc.

You might consider making the language compatible with PowerShell. Yes, that sucks for political reasons but they have spent a LOT of time thinking out how to do it thru trial and error. You won’t have to repeat their trials.

Tom

Reply
philip says:

August 11, 2012 at 2:14 pm

I left a comment on HN [1] so I’m not gonna repeat it here, but I’ve been working on exactly the same thing as a side project. We have very similar ideas (typed streams, a couple of basic tools that do one thing well) and I think this is an awesome way of exploring how the unix interface can be improved in the future.

One thing I’m not sure about is the binary format and the “output format depending on pipe consumer” feature – in my eyes, a textual representation like json + json schema for the types is a more consumable and wide spread structured data format; regarding the actual tool output, isn’t a single configurable tool like dtable enough?

Also, by outputting plain json, one could leverage a number of available filtering / sorting json cli tools available [2] – even if untyped, they are still useful.

[1] http://news.ycombinator.com/item?id=4370036
[2] https://github.com/ddopson/underscore-cli

Reply
Jeremie Pelletier says:

August 11, 2012 at 3:16 pm

That is a great idea! It’s easy to either create wrapper programs like dls for ls to output this data, or for the program authors to add –dout or something switch to allow both backwards and forward compatibility.

It would be even better if multiple message formats could be used. The shell could then handle translating from one format to another or automatically adding the –dout switch/calling enabled stub program between pipes.

There are many formats and many scripting languages out there. It’s only natural that the shell has support to for them to easily talk to one another. Ruby has YAML, JavaScript/CoffeeScript have JSON, Java has XML, Flash has AMF, there’s also the interesting MsgPack and others.

The added benefit is that unix shell programs could then easily become extensible as web services 😀

Reply
Jan Schaumann says:

August 11, 2012 at 3:18 pm

The idea of passing more structured data or objects in a pipeline comes up every so often. On the surface, it looks like a great idea, but it inevitably fails in the end because it limits flexibility and adds complexity. (I recently gave a presentation on the topic of simplicity, citing precisely this example: http://www.slideshare.net/jschauma/fancy-pants-13484018/50)

In order to use the new pipeline, *all* your tools need to understand the new data model; otherwise you break the very important “filter” function of unix command-line tools. Imposing a requirement of structure on the tools that consume the data implies that you anticipate (correctly!) all possible uses of the data; anything your data model assumes about the use or hides from the user is thus written in stone.

But common unix tools are so successful and useful precisely because they do not pretend to be able to anticipate all possible uses. The data is as flexible as possible, allowing any user to come up with any possible use case and manipulate it as they deem fit.

The unix philosophy is not only “do one thing well”, it’s:

‘Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.’ (Douglas McIlory, inventor of the pipe)

The second and third sentences are equally important.

Reply
Stefan says:

August 11, 2012 at 5:48 pm

Why reinvent the textual representation? How about something a bit more compact like JSON?

Reply
philip says:

August 11, 2012 at 8:38 pm

@Jan Schaumann: the point here is to use *structured” text streams. I fail to see how this is a bad idea. Instead of having ls output data in 50 different ways depending on command line flags, just let it output a json stream with *everything* contained in it. Then you can slice and dice the data however you want, filter it, project (map) it, sort it, convert to whatever format you want, including plain machine unreadable text.

How is this restricting the possible use cases of anything? If in the future ls needs an additional field for something, it wouldn’t break existing tools. Even if the output format changes, it wouldn’t break anything, at most people would have to adapt their filter / sorting / etc scripts. However, this would equal to the actual current unix ls command changing its output and I don’t see that happening – and if it did, go and adapt your awk and sed scripts that have to parse some arbitrary text.

Current plain text output is the most “non-composable”, rigid way of doing things – sure we make do by painfully parsing columns and rows, but it can be much, much easier.

Reply
Andrew Pennebaker says:

August 11, 2012 at 8:45 pm

Eh, half of the benefit of textual data representation is ease of reading and manipulation by human operators. If you want something more amenable to machine parsing and manipulation, it would be better to just use a database or at least some standard format such as JSON.

If you intend to send large amounts (GBs) of data between programs, command line piping will be much too slow anyway. Better to use a well tested system of exchanging data, such as a database, or for real time exchange, perhaps a queriable server.

Reply
Terry A. Davis says:

August 12, 2012 at 3:41 am

I made LoseThos without pipes or streams. Yeah, I support pictures at the command-line. Those could be serialized. I have a neat feature where you can put call-backs for text — my help, command-line source editor, forms all can have text call backs and you can do text indirection for form letters.

Viva la difference. i decided nothjing in losethos is for printers. Documents are for the screen with scrolling markees and stuff.

Reply
Lars Arvestad says:

August 12, 2012 at 6:38 am

I think a variant of this has been thoroughly implemented in the Scheme shell, scsh. It should still be around. Their ambition was also to have sane programming language for the shell. Take a look, I never got into it, but remenber it was well documented.

Reply
abc says:

August 12, 2012 at 11:47 am

Hi,
I love the bash also I didn’t code many scripts yet, but the output == return principle is awesome. To make your structured pipe more usable I think it would be really helpfull to have a convert program which does the reverse of dtable, so that you could import the output of the normal ps aux into the GVariants format. Also if this might not be too easy I think it would unleash the power of this tool.

x

Reply
Jan Schaumann says:

August 12, 2012 at 10:41 pm

@philip:

> @Jan Schaumann: the point here is to use *structured” text streams. I fail to
> see how this is a bad idea. Instead of having ls output data in 50 different
> ways depending on command line flags, just let it output a json stream with
> *everything* contained in it. Then you can slice and dice the data however
> you want, filter it, project (map) it, sort it, convert to whatever format you
> want, including plain machine unreadable text.

What about the human consumer? Many tools have as a primary (or one of the main) use case that of an actual human reading and understanding the results. I don’t think I’d like “ls” to spit out json, and I also don’t think I’d want to have to invoke a “turn json into human friendly format” filter (which would have to differ for about every command invoked).

Secondly, sometimes you do not _want_ all possible output. In your example of ls, this means that it will always look up all additional attributes of a file. If your NIS server is down, it will hang because it can’t resolve UID->username (which you can work around with ‘-n’).

> How is this restricting the possible use cases of anything? If in the future ls
> needs an additional field for something, it wouldn’t break existing tools.

No, but any possible tool I wish to write will have to be able to read your format (say, json). If the format passes objects, then the filter cannot continue processing until the entire object is received and the entire json input is interpreted. If my initial tool generates a few million lines of output, this isn’t going to make the filter very happy.

> Even if the output format changes, it wouldn’t break anything, at most
> people would have to adapt their filter / sorting / etc scripts.

Off-loading this task to the people writing a filter is a pretty good example of what I mean. Suppose you have ‘dls’ and I wrote two filters ‘dsort’ and ‘dhead’. ‘dls’ adds new fields in the output, but ‘dsort’ and ‘dhead’ are not aware of these fields. They will simply ignore them. Now in order for me to use ‘dls’, I need to go and update these filters (and whatever other tools I might wish to use to post-process ‘ls’).

Compare that to the current situtation: even if somebody decided to change the output of ‘ls’, this would not require an update to any other binary to be able to post-process the output.

> However, this would equal to the actual current unix ls command changing
> its output and I don’t see that happening – and if it did, go and adapt your
> awk and sed scripts that have to parse some arbitrary text.

That is not quite the analogy. The correct analogy would be that if you update ‘ls’ to add/change output, you also would have to build a new ‘awk’ and a new ‘sed’.

I’m quite convinced that the current stream interface is a good example of ‘Worse is Better’; yes, conceptually the idea of passing objects is appealing, and it may be more “correct”, but it’s not actually “better”. Simplicity always wins.

Reply
teknopaul says:

August 13, 2012 at 1:06 am

the beauty and the power of the pipe is its simplicity.

Reply
alexl says:

August 13, 2012 at 8:11 am

Got this comment by mail from Alex Elsayed (blog issues):

One thing I think is important and many of the commenters are missing is
that this is more a ‘structured data’ shell pipeline as opposed to an ‘object’
pipeline, and that is a VERY GOOD THING.

Objects encapsulate what operations you can do on them as well as the data,
whereas structured data simply adds context.

To those commenting about how this ‘limits’ what can be done with the data,
your comments are completely true and insightful… for object pipelines. As
this is a structured data pipeline, they really don’t quite match reality.

Note how the examples are all about running a filter that looks for values,
rather than calling methods on what is passing through. This is not dependent
on what the creator ‘anticipated’ – it’s more akin to having all data
producers be set to ‘verbose’ (ps -eo $everything) and using a key/value
output format (avoiding field termination issues and such, no more bugs due to
running find -print instead of -print0) as opposed to calling the ‘getpid’
method on ‘process’ objects.

One thing I would suggest:

Have a ‘dformat’ (or ‘dfmt’ for lazy people like me) command, that takes a
normal shell commandline (like [ls, /tmp]) as its ARGV, looks up a formatter
plugin corresponding to the executable in /usr/share/dpipeline or similar, and
uses that to format the output. That removes the need to define large numbers
of commands like dps and such, and avoids namespace pollution. At that point,
all you need are dfmt and the filter/display commands. It also makes it easier
to ship formatters with the applications they belong to, modulo friendly
upstreams.

I would recommend using a different data format, though. Using GVariant ties it
to glib, and for something like this being tied to a specific library may not
be the best idea. I’d second the suggestion to use JSON – perhaps have the
first entry be a ‘schema’ that defines the fields and their datatypes, and then
all subsequent entries are just the fields and their values.

I’m not sure you would really gain all that much by using a binary
representation anyway, mainly just because of use cases, but that would be
something that requires empirical verification before I’d put money on either
side.

Reply
Robin says:

August 13, 2012 at 11:17 am

For the “JSON is not compact enough” concern, there’s also BSON (binary JSON): http://bsonspec.org/

Reply
esarbe says:

August 13, 2012 at 11:40 am

Yes, it does. Look usefull, I mean. I’m not so sure about the serialization format (it looks rather awkward to write and parse) but /any kind/ of common serialization/structured data format would be a big step forward if we could just make all command line tools understand it.

Reply
alexl says:

August 13, 2012 at 12:39 pm

More from Alex Elsayed:

@Jan Schaumann, there are a few ways around the issues you bring up.

First of all, regarding the human consumer, see my earlier comment about
‘dfmt’ and embedding a simple schema – using the same architecture, a ‘dunfmt’
that uses a similar system (or even the same plugins) to reformat in a type-
specific way would be trivial, and could even be plugged into a d* command
automagically if the output is a TTY.

With regard to not wanting all possible output, you have a point. However,
nothing says that a tool is forbidden from only returning a subset of what it
is capable of (possibly with -n and similar flags), or that the system couldn’t
be extended so that bidirectional negotiation similar to GStreamer caps could
be used to only read the data the user wants anyway (I seem to remember there
being a trick where accept() on a pipe and some further dancing got you a
bidirectional communications channel).

This would, however, probably be better implemented in a dedicated shell
mechanism of some sort rather than by abusing pipes.

Third, yes, any tool you write will have to be able to parse it… Except that
you could always use dtable or the like, and turn it into NULL-separated
values or whatever. And JSON isn’t particularly difficult, either, since
essentially every language has tools to parse it. Even Bash has such a library
in the form of ‘JSHON’.

In addition, most JSON parsers I’ve encountered support reading and writing
“objects” (the JS term for hashes/dictionaries) simply concatenated with each
other in addition to having them be encapsulated in an array. If you make each
‘line’ a single object (as would usually make sense for CLI programs like ps
or ls) then they can in fact be parsed incrementally, and dealt with as they
come rather than as a big dump all at once. An example would be Perl’s JSON
module and the “incremental parsing” functions therein.

Fourth, if the system does as I suggest in my third point and uses one entry
per line, ‘dhead’ never needs to know what is inside – it just reads N objects
and then terminates. Similarly, ‘dsort’ has no need to understand the object –
it merely needs the ability to take some sort of identifier as an argument to
tell it what field to sort by. If an entry has a value named ‘foofoobar’ that I
want to sort by, I just run ‘dsort -k foofoobar’. It then looks up the
‘foofoobar’ key in each element and sorts by the result. No intrinsic
knowledge ever needs to be included into dsort itself.

Your example about alterations to the output of ‘ls’ not breaking clients is
also factually untrue – there are distros which default to different ways of
formatting times in ls. This breaks scripts that parse the output. With this,
it can simply be stored as a Unix timestamp and formatted at the display end.
That entire class of problem comes from mixing mechanism (provide a time) and
policy (But how do we format it?). This system allows separating them.

And the fact that dsort merely taking a key name, looking it up, and sorting
by the result removes any intrinsic knowledge of the data structures means
that it really *is* a good parallel with changing your awk script, and not
rebuilding the awk binary.

Reply
Jan Schaumann says:

August 13, 2012 at 1:31 pm

@alexl

> nothing says that a tool is forbidden from only returning a subset of what it
> is capable of (possibly with -n and similar flags), or that the system couldn’t
> be extended so that bidirectional negotiation similar to GStreamer caps could
> be used to only read the data the user wants anyway (I seem to remember
> there being a trick where accept() on a pipe and some further dancing got
> you a bidirectional communications channel).

That sounds like adding significant complexity. Adding flags to select a subset of things to output then sounds like it would negate the benefits of allowing other tools to operate on the object output and do their filtering. At the very least, it duplicates capabilities.

> This would, however, probably be better implemented in a dedicated shell
> mechanism of some sort rather than by abusing pipes.

I can see that. Of course that means that in order to take advantage of these new pipes, users have to start using a new shell (which for many people is a significant hurdle). Some people prefer one shell over another, have extensive customizations and scripts, functions etc. they rely on.

Building a new shell for this object model will raise the barrier to entry significantly, I think.

> Your example about alterations to the output of ‘ls’ not breaking clients is
>also factually untrue – there are distros which default to different ways of
> formatting times in ls. This breaks scripts that parse the output.

Yes, but it does not break the tools invoked in the script. Changing output format willy nilly is a very bad idea, I very much agree on that. But even if my various scripts break, I do not have to update or rebuild sed(1), awk(1), grep(1) etc.

Oh, talking about grep(1)… how would grep(1) work in this model? It seems to me that grep(1) needs to be able to parse regular plain text streams for its most common use case. But it’s also a filter, so it would need to understand the new object format, too and decide when to use which?

Ie, “grep <file" needs to work — would the new shell implement "<" to read a file and turn each line into an object "{ line: text, linenum: N }" or something like that?

How would regular expression matching work? Ie, in a text stream, I can create regexes that match multiple fields; if I'm operating on objects with unordered fields, how do I match "^[a-z123]+ .*(foo|bar).*" across fields?

Reply
1. alexl says:
  
  August 13, 2012 at 6:24 pm
  
  From Alex Elsayed:
  
  @Jan Schaumann
  
  (fyi, alexl is posting these comments for me because I can’t submit comments
  due to a weird 403 forbidden error; I’m “Alex Elsayed” or “eternaleye”)
  
  You wouldn’t *need* a new shell to make use of bidirectional communication –
  as I said, I seem to remember their being some sort of ‘accept() on a pipe’
  trick that basically turns it into a socket for all intents and purposes other
  than namespace. I just said that it would probably be *better* to do it that
  way, and have the shell set up bidirectional file descriptors alongside the
  standard streams vaguely similar to the TermKit design so that control is out-
  of-band from bulk data.
  
  Regarding grep, I think it would still work. For one thing, JSON ‘objects’ are
  iterable – you can walk the keys. Sure, they can contain other objects, but
  some sort of depth-first search isn’t out of the picture. By default, maybe
  have grep do a regex match on the values of every field that way; the user
  could use -k to restrict which fields to search, or even use –keys or
  similar to search keys instead of values. If all else fails, you can still
  reformat it into text anyway for a more traditional grep.
  
  That said, I do think that grep may be the wrong conceptual model for such a
  filter, though. I think something that behaves more like ‘test’/[[ ]] might fit
  better, since it’s checking values rather than parsing strings. That doesn’t
  even exclude regexes – IIRC, Bash supports regexes in [[ ]], and I know Zsh
  does.
  
  Therefore, perhaps some sort of ‘dvalfilter’ that either takes a comparator
  string (“/foo/bar/0/qux >= 72”) or if that interface is icky something more
  like -k /foo/bar/0/qux -ge -v 72. (here /foo/bar/0/qux is equivalent to the
  perl $obj->{foo}->{bar}->[0]->{qux})
  
  To match across multiple fields, just specify more than one using ‘and’ & ‘or’
  operations – think of it like passing -e to grep multiple times.
  
  Reply
danw says:

August 13, 2012 at 2:10 pm

I had an idea like this years ago that I never did anything with.

Except, instead of having alternate versions of commands that output xml/json/gvariant/whatever, I was going to modify the existing commands, having them still output the same stuff to stdout as they do now, and output something like your format to a separate “stdmetaout” fd (which would also include information about which columns of the textual output correspond to which fields in the structured output. Maybe in some cases it wouldn’t fully duplicate the data, but would just say, eg “the PID field can be parsed out of columns 10-15 of stdout”).

So as long as the data was being piped between metadata-aware processes, all the metadata would stick around, but once it got passed to a non-metadata-aware process (or reached the shell’s stdout), the metadata would be dropped and you’d be left with just the text.

So, you could do stuff like

ps aux | awk ‘$RSS > 1024’ | sort -k START

which would print all the processes using more than 1M, sorted by start time (and would also print the “ps” header at the top still, because the metadata would indicate that that was header information, so it shouldn’t be moved by “sort”).

That is about the extent of how far I got in thinking about it…

Reply
joel says:

August 13, 2012 at 9:42 pm

what would be great would be a wrapper for common unix utilities to convert from the standard formatting to object formatting. That way you could do something like

$ dwrap ps

and that would interpret the results of the standard ps call and convert it to an object notation. I’ve long wanted to make one of these that would convert to JSON, but never really got around to starting a project that immense.

Reply
Pingback: Один из разработчиков GNOME предложил новую реализацию неименованных каналов | AllUNIX.ru — Всероссийский портал о UNIX-системах
Pingback: Один из разработчиков GNOME предложил новую реализацию неименованных каналов | conon print Один из разработчиков GNOME предложил новую реализацию н
Artem Tarasov says:

August 17, 2012 at 9:48 am

Perhaps, a better idea is to resurrect some functional shell. The idea is quite old, see e.g. http://stuff.mit.edu/afs/sipb.mit.edu/user/yandros/doc/es-usenix-winter93.html

Nowadays, the best option is to use Haskell, or, better yet, a language based on it, but with slightly different syntax adapted for shell needs. This way, we’ll have both strong typing and laziness. The latter will solve problem of ‘not wanting all possible output’, and also will allow for loop fusions, eliminating unnecessary copying/serialization/deserialization.

Something of that sort exists already: http://hackage.haskell.org/packages/archive/HSH/2.0.3/doc/html/HSH.html

Reply
Pingback: Один из разработчиков GNOME предложил новую реализацию неименованных каналов : Записки начинающего линуксоида
Pingback: Один из разработчиков GNOME предложил новую реализацию неименованных каналов | Интересное в сети

51 thoughts on “Rethinking the shell pipeline”

Leave a Reply to joel Cancel reply