During a code review, it’s not uncommon for a reviewer to comment, “it looks like there’s some duplication here; can you please DRY this code up?”
We often assume it’s clear what this comment means. But it’s worth examining what the reviewer is asking for, and why it’s important.
DRY stands for “Don’t Repeat Yourself.” It is often colloquially understood to mean “don’t duplicate code,” but it’s intended to mean much more than that. According to the programmers who coined the term, “DRY says that every piece of system knowledge should have one authoritative, unambiguous representation.”
Another name for this concept is “the SPOT rule:” Single Point Of Truth.
This means a lot more than just “avoid copying and pasting code.” It means that each idea, each rule, each concept, each piece of documentation, each constant, and each behavior the code performs should be represented in your system exactly once. Avoiding duplicated code is clearly part of this. Simply deduplicating similar code is a step in the right direction, but that step alone, performed without thought or consideration, doesn’t automatically make your code healthier.
This same theme shows up often in discussions of refactoring code. A primary goal of that process is that every declaration behavior should occur “Once And Only Once.” Restructuring code to follow this rule typically leads to simpler software that’s easier to maintain.
Speaking of maintenance…
Duplicate declarations of behavior, constants, and other knowledge create room for inconsistency and therefore room for bugs as the system is modified.
When declarations are duplicated like this, it’s easy for developers — especially those who aren’t very familiar with the code — to miss a duplicate declaration. As requirements change and bugs are fixed, developers might not make the required changes in all the places that the behavior is duplicated. These logical inconsistencies lead to bugs and inconsistent behavior which can be difficult and time-consuming to track down.
Software containing duplicate declarations of some behavior, especially if they’ve evolved to contradict each other, is more complex than software that follows the DRY principle. This complexity makes it increasingly difficult to make changes to the software, and even simple maintenance tasks take longer as this complexity gets worse.
Software that has a single point of truth for a given property is much easier to test. It’s easier to check that some invariant holds when that invariant is specified in only one place. Declaring some behavior in multiple places means you have to have to test it in multiple places — or, worse, your test might only cover one of the two places that behavior is declared, giving you a false sense of security as inconsistency creeps in over time.
The word “orthogonality” is sometimes used when discussing the DRY principle:
In a purely orthogonal design, operations do not have side effects; each action (whether it’s an API call, a macro invocation, or a language operation) changes just one thing without affecting others. There is one and only one way to change each property of whatever system you are controlling.
This is a desirable property that’s only found in systems which don’t duplicate behavior. Having exactly one point where you can change some piece of knowledge in a system makes maintenance easier, and avoiding side effects also helps make the components of your software more testable in isolation.
So, next time you’re asked to clean up some duplicated code in a pull request, consider what’s really being asked of you. First, identify the action, knowledge, policy, or algorithm that’s being duplicated.
It might not jump out at you immediately when looking at the code, but it’s important that you identify the actual idea or concept that’s being duplicated. Deduplicating lines of code without careful thought can lead to software that’s more confusing than before — it’s important that you extract a concept or idea, not just the code that happens to implement it right now.
(This might mean you’ll need to rewrite some of that code, to more directly represent the underlying concept. That’s a good thing!)
The resulting helper function/helper method should do one thing. This accomplishes all of our goals: it provides a single, unambiguous place in the software for this knowledge to live, allows easier testing of that behavior, and helps developers make consistent and correct changes in the future.
When there’s a subtle behavioral difference between two pieces of mostly-duplicated code, it is tempting to extract a helper function that implements both behaviors and allows the caller to select one by passing an argument to the new function. Don’t do this. This couples these two behaviors together, and coupling between unrelated pieces of knowledge in software has several disadvantages.
Functions that do different things based on a parameter are more difficult to understand, reuse, and test. It’s important both for testability and reusability that a function does one well-defined thing, but parameters that change a function’s behavior indicate that the function does more than one thing.
Functions that do more than one thing are more difficult to use correctly. They require you to carefully read the documentation, and maybe the underlying code, to ensure you’re passing arguments that will yield the behavior you want.
They also make maintenance and refactoring harder. When a function’s behavior changes based on its arguments, refactoring and other maintenance work will often require looking at all the call sites to understand exactly how each caller is using the function. This is tedious and time-consuming work.
Finally, once you have a function like this it’s tempting to keep adding little functionality changes that are controlled by parameters. This leads to functions which are unwieldy and impossible to understand. There are stories out there of methods that take 300 parameters — this is a slippery slope, and it’s best to avoid it entirely.
Instead, extract multiple functions that each do one thing. Each call site can then use one or both functions as needed, it’s still clear what each of the new functions does, and both functions are reusable and testable.
Picking names and types for the new function you’ve extracted is important. The new function should be self-documenting: its name and return type, and the names and types of its parameters, should be all that another developer needs to know to use the helper function correctly.
The types you choose should express meaning. Rather than returning an opaque value like a string or bitmask, try to return something like a URL or domain-specific sum type (also known simply as an enum).
I drew on these resources for this article, and if you have time I recommend you read each of them:
- Software Engineering Stack Exchange: Is it wrong to use a boolean parameter to determine behavior?
- Wikipedia: Cohesion
- Coding Horror: Curly’s Law: Do One Thing
- Orthogonality and the DRY Principle
- Once And Only Once
- Compactness and Orthogonality
- Don’t Repeat Yourself
- The Extract Method Refactoring
As always, I welcome discussion and feedback; I’m @cdzombak on Twitter.