The information contents of programs

I was recently asked by an executive about what computer language he should use to write a "million line" program? This is of course a very reasonable question. A million lines just means "large project", and the number of practical choices boils down to a short list of C#, Java, or perhaps Ada or C++. With so few choices my opinion on this issue (which is C#) cannot be particularly valuable.

But the executive was a friend, a former programmer, and I decided to tease him a little bit first. So I asked him back:

"Do you really have a million line problem? You know very well that a million lines is 20 volumes of 1000 pages with 50 lines per page – like an Encyclopedia Britannica.”

He knew that something was up when I changed “program” to “problem” so he was smiling when he answered:

“I am quite sure that the program will be of the order of a million lines. Other similar projects have established this beyond any doubt. But you are right about the problem, it is much smaller. We have a couple of hundred pages of notes from the team of Subject Matter Experts with the problem, but those people do not want to or are not able to understand the software issues, so these notes are not actionable except as guidance to the software team. And the worst thing is that they keep changing it.”

“So you agree that the parts of the problem that are not software issues are worth about 10,000 lines. Now my new question is, do you really face unique software issues that need 990,000 lines to express them?”

“Sure, I must be. Systems of this sort are very complicated. The considerations are tricky, the senior people spend a lot of time arguing, modeling, learning about standards and interfaces, specifying the configurations. I’ve seen this many times and so have you.”

Now I had to clarify things because our discussion became predictably confused:

“Wait. We are talking about three different things here. Let me label them:

A. the basis for the decisions – the experience, the learning, the standards, the contents of the “bag-of-tricks” that the engineers have, and the rejected solutions. But this seldom gets into the code, even though it is recommended to put some of this “process” information into comments. These cannot be the lines we talk about.

B. the result of programming, the source code. We agree that this is one million lines, but it cannot be wholly (or even 99%) ascribed to just software issues. We know from experience that one line in the problem causes the programmers to write many lines. The problem knowledge is thoroughly intermixed with the implementation. So this too is not what I was referring to.

C. the “expression of the software decisions” – the results of the implementation decisions that were made specifically for this project. I am thinking of parts of the classes, the choices of implementation types, the names of the algorithms, the implementation terminology, the caching rules, what part of the problem goes with which module, and so on.

You have heard of generative programming. That is what I am driving at. The “expression of software decisions” is the generator that would process the 10,000 lines and output the expected million lines. What I am really asking is, assuming that the program could be generated, how large do you think the generator would be?”

The former programmer turned executive was ready to show his knowledge:

“OK I’ll play along. There is no way to know exactly. But each line of the generator potentially contributes to every line of the problem so generation is roughly order N*M, the rough inverse is order square root. To believe that the generator is 99% of the result size is to believe that the problem statement has absolutely no repetitive patterns of any kind. I do not believe that. Just on the basis of symmetry of “one line going into (million/10,000 = hundred) lines”, I would guesstimate a generator of 10,000 lines. “

“I agree. Of course we count only lines that specifically depend on your project. We count the expression of the decision to use some well-known implementation, but we do not count the code for the generic implementation itself. For example, hash table implementations are already part of language runtime and this trend to move more implementation code into the language will continue. This is what Sergey Dmitriev calls Language Oriented Programming.

“You are telling me that instead of a million line program, I could have a 20,000 line problem – 10,000 from the specs, and 10,000 from the expression of the software decisions, combined by running the generator. That would be cool because my costs of maintenance, and the costs due to bugs are all proportional to lines. I could even keep up with all the changes. Who is doing this?”

We were about to land so I was going to tease him once more:

“The most sophisticated product in the universe uses generative techniques.”

“What???”

“Life. DNA. It is often called a blueprint but that is a bad metaphor. It is really a generator. Ever wonder how a few measly gigabits can describe a human being, and a few percent of that can describe a human brain? It is because the genome describes how a brain is grown, that is how it is generated instead of detailing what it has in it and how those things are connected like you would have in a blueprint or in source code. It is a proof of your square root rule. A few dozen megabits in the genome can direct the growing of terabits worth of synapses in the brain. Lucky for us because evolution would have never worked with a code that is order n square bigger than what we have! Evolution would have needed a billion squared years to get the project done.”

“So your point is that generative programming will take only a billion years. But seriously, what technologies are missing to make this happen?”

“We must pay much more attention to the “domains” where the 20,000 lines we talked about come from. We need a very flexible and general database for representing the domain knowledge, and we need an editor to capture this knowledge that is acceptable for your subject matter team and also for the programmers. Unfortunately the experience with source code editors is not entirely applicable because the subject matter domain representation is much denser. The domain representation must also be easily processable by the generators.

There is a lot to do. Keep watching the Intentional Software web site and blog. Meanwhile good luck with your project!”