Mere Code

Diverse Topics of General Interest to the Practicing Programmer

Big or small?

I’ve been thinking a bit about whether it is better to have one big code base that has a lot of different components and features, or whether there should be many small code bases that each do one thing well.

I don’t have any answers, but perhaps these half-formed thoughts will help: positive, negative and interesting things about having many small related projects. These thoughts are mostly inspired by working on a bunch of different testing-related projects.

Positive

  • “Do one thing and do it well”
  • Enforces a certain kind of interface discipline
  • Avoids/postpones scaling problems with big projects
    • test suite run times
    • documentation navigation
    • bug triage
    • forking mailing lists etc.
  • Newcomers only need to “buy in” to one idea at a time
  • Aligns with conceptual understanding of the problem
  • Better separation of commit privs etc
Negative
  • Release overhead
  • Duplication of infrastructure
    • buildbot / hudson / pqm
    • bug tracker
    • mailing list
  • Duplication of license / copyright games
  • Harder for newcomers to see the big picture
  • Problems caused by interactions between different versions
  • Depending on multiple libraries is a pain on many platforms
  • Lag with commit privs etc
Interesting
  • Perhaps smaller & self-contained means easier to upstream
  • Some projects like adding only small dependencies, other projects like adding few dependencies
Do these make sense? What would you add?

Twisted is in a sense the opposite of the small/many paradigm, in that it includes a great deal of extra features along with its core.

Comments

jml on 2011-01-25 21:29
Good answers both! I'm sorry I haven't responded earlier.

Glyph, I wonder how things would be different if the cost of duplicating infrastructure was substantially lower.
glyph on 2010-11-30 21:52
I don't think that you can actually model the distinction between different types of projects as "small" vs. "big". Granted, it's a popular dichotomy, but that doesn't mean it's not a false one :).

For example: for many projects, "smaller" is better. But Twisted's attempt to go from "big" to "small", as spiv notes, was a disaster (although not an unqualified one). The further we pushed it, the more duplication of process was created. If we had continued to really break out all the subprojects fully, I suspect that most of them would be completely moribund now. For example: if a change to core which broke conch didn't alert us immediately to that fact, it would be years before anybody got around to fixing it, probably past the point where all hope would be lost that it was fixable at all. Does this really have anything to do with big vs. small? Not really, it's more a question of shared vs. duplicated build infrastructure.

Another way to look at this problem is: projects aren't small or big, it's that they fit into some conceptual model and possibly also provide one themselves. The real question of how to organize things is more about what those relationships are than about the absolute size of the project.

For example: "small tools that do one thing well" is frequently extolled as a key part of the UNIX philosophy. Yet, many of those "small" tools have intimate dependencies on a gigantic pile of infrastructure that the huge lower-level projects, such as the kernel, X.org, glibc, et. al. provide.

The best projects are the ones which provide a firm core and strong conventions upon which may be built a wide variety of tools which follow those conventions, and be seamlessly integrated according to them. This is what Twisted aims to be; its 'bigness' is mostly a result of the fact that 'smallness' requires a per-project duplication of effort which, with a small development team, we have empirically demonstrated we can't afford.
spiv on 2010-11-24 22:14
Perhaps a related question is how easy is to take a many small code bases and bundle them as one later, versus separating one big code base into many small ones? I think small has a slight advantage over big here: in general the pain I've observed in e.g. building bzr's windows installer or various core+plugins PPAs has probably been a little less than Twisted's pain in producing releases of individual subpackages. But I'm not sure, and perhaps the tradeoffs vary quite a lot between projects.

I'm frankly deeply frightened by “scaling problems with big projects” though. To focus on just one aspect of that, I'm increasingly feeling that increased complexity brings massive, perhaps exponential, increase in cost. Just compare how easy it is to do a quick hack and feel that it is good enough in a small project versus a large one. For a mini case study, look at ControlDir.sprout in current bzrlib: over time we've added features like separate metadirs, shared repositories, stacking policies, hardlinking, reusing transport objects, subtrees, optimisations to the way we do a fetch into a newly sprouted repository… and each of those things has taken a toll on this one function. All nice things to have, but now it alarmingly difficult to make further improvements to this function because complexity is fragile — happily I mostly trust the test suite here, which helps, but only the full test suite. And of course I *do* want to improve that function…

In the case of ControlDir.sprout I'm hopeful that some refactorings may ease the pain a little, but I think truly radical surgery is required if it is going to be anything other than hideous.

So I guess I'd lean towards many small over one big, but try to find ways to reduce the duplication release effort and project infrastructure.