Reformatting

A discussion on d-d-l today reminded me of a thought I had a few weeks ago. At the moment, source code control works on lines, as though we were still using punched cards. This means that you can’t reformat your source code if you decide you want, say, a different number of spaces in your indentation.

What I would like is a wrapper around checkout and commit scripts which understood how to tokenise various programming languages. Then when you checked something out, it would look in a configuration file to see how you prefer your source code formatted, and do all the indentation as necessary. When you checked it back in, it would undo that and store one token per line. That way, there would never be an argument about indentation again, and more importantly, blame would show who last modified each token, not each line. It’s not as though lines had any major significance in many programming languages.

I wonder whether I could persuade the bzr people to consider this.

Published by

Thomas Thurman

Mostly themes, triaging, and patch review. View all posts by Thomas Thurman

16 thoughts on “Reformatting”

Joe Buck says:

August 18, 2008 at 6:50 pm

Since the VCS is storing (directly or indirectly) every version of the file, it isn’t really true that they are line-oriented (except in the sense that some use lines as a compaction mechanism).

What you seem to be after is a way to distinguish changes that only affect formatting from other changes, to get an accurate “blame”. But that can be done without modifying anything in the VCS other than the blame command or equivalent.
Jonathan Pryor says:

August 18, 2008 at 6:51 pm

You would need to change not just checkout and commit but also diff as well, otherwise `bzr diff`/`git-diff`/etc. would be ~unusable.

Plus any other places that produce/consume diff output.

And what about mailing patches to a mailing list for review?

And what about tools support? Eclipse, etc. have their own diff viewers, and may need some sort of support for this.

Which isn’t to say that this isn’t a good idea — it very well may be — but implementing it will require of work among varied groups of people, which may hinder progress & uptake. Plus, sensible answers are needed for all of the above questions (and more).
Eduardo Padoan says:

August 18, 2008 at 6:58 pm

I wouldn’t go so far as token-level, but some diff tools (like google’s diff-match-patch) work at character-level, which would be cool to have on a vcs.
Hans Petter Jansson says:

August 18, 2008 at 7:00 pm

I’ve always wanted to be able to store parse trees (with a few extensions for formatting, comments and unparseable blocks), and let the editor take care of presentation. With the parse tree always available behind the scenes, you can do a lot of nifty stuff. Smart revision control, cool visualization and intelligent search/replace are but tips of the proverbial iceberg, especially with a programmable editor.
Name says:

August 18, 2008 at 7:17 pm

This is not likely to happen for any mainstream VCS. Currently all a VCS needs to do is store a blob of bytes, without regard to what’s in it. If you wanted to store a parse tree, the VCS would need to have custom storage formats for each language.

Second problem is that parse trees change. For example, Python’s syntax has changed a great deal over its lifetime (new keywords, new blocks, etc). You’d have to update your VCS every time a new version of the language was released. And Python’s a simple example — can you imagine writing a parser for C++ or Perl?

Lastly, pre-processors complicate matters because now there are multiple languages mixed into one file. How do you store a parse tree of a C/C++ file with heavy macro use? Even basic GObject declarations would, I suspect, be very difficult to parse.

If you want to change indentation width, just use tabs. Then everybody can have their own width without the source code being mangled.
Hans Petter Jansson says:

August 18, 2008 at 7:48 pm

@Name: Sure, and those are some of the reasons it hasn’t happened yet. Still, it’s just hard – not impossible. The VCS could store and diff arbitrary trees of string tokens instead of the current list of lines, so you wouldn’t have to update the VCS with the programming language. Preprocessing is a bigger problem – you’d need to write a combined preprocessor/parser.

Oh, and indenting with tabs really means “indenting with tabs and spaces” since you often need extra alignment for multi-line statements. When different people’s editors substitute tabs for different amounts of spaces, those statements become misaligned. I don’t really care that much about how other people indent, but since you brought it up – that’s a decent argument for avoiding tabs.
Name says:

August 18, 2008 at 8:07 pm

I think there is a difference between indentation and alignment. Tabs are for indentation. Avoiding tabs because they can be used for the wrong purpose just makes life more difficult on whoever has to read you code.
Hans Petter Jansson says:

August 18, 2008 at 8:20 pm

@Name: Right, but all editors I’ve used do the wrong thing in indent-with-tabs mode, because they don’t differentiate indentation and alignment – and users aren’t likely to notice, because tabs and spaces tend to look the same.
Henrique says:

August 18, 2008 at 9:06 pm

Or make diffs include modelines, like Vim does… :P
Thomas Thurman says:

August 18, 2008 at 10:10 pm

Henrique: It’s a partial solution, but only a partial solution.
Jerome Haltom says:

August 19, 2008 at 2:12 am

That’s not the proper place for it. THe proper place for it is in your development environment. IDE or whatever. There are two settings that are relevent here. One is the ‘native style’ of the code: what should be commited. The other is your personal style, what you should see. When the IDE opens a file, it should display your personal style. When it saves a file, it should write the native. That’s it. Simple.
Thomas Thurman says:

August 19, 2008 at 2:20 am

Jerome: That really isn’t going to work without some level of support from version control. Suppose one of the things that differs is the amount of whitespace; what if my personal style said “four spaces”, the native style said “eight spaces”, and I actually put in five? What would happen then?

And besides, Occam is coming for you: why do we need a “native style” at all in the world you suggest?

That said, do you actually know any IDEs which can behave this way?
Matthew W. S. Bell says:

August 19, 2008 at 3:16 am

Source Code in Database; roughly.
Thomas Thurman says:

August 19, 2008 at 3:22 am

Indeed.
Steven Walter says:

August 19, 2008 at 6:20 pm

You could achieve something similar to this using git’s “smudge” and “clean” filters. The clean filter gets run during commit, and it outputs the “canonical” source-code representation (which could be whatever you wanted). Then, during checkout, git runs the smudge filter, which would format the code according to your preferences.
Thomas Thurman says:

August 19, 2008 at 6:30 pm

I like that idea very much.

Published by

Thomas Thurman

16 thoughts on “Reformatting”

Leave a Reply