Worst practice: Duplicating code

Many developers are taught early on that copy-and-paste is a bad idea. Literally copying code from elsewhere in an application is bad because it creates a maintenance nightmare: Finding a bug or changing the functionality requires that you find all the copies and fix them all. Copies are also bad because they make a program needlessly larger.

Many IDEs have “extract method” or “introduce method” refactoring functions that take existing code and turn it into a new Java method. If you create a method instead of copying and pasting, your code will be shorter, clearer, and cleaner, as well as easier to debug and maintain. CPD, the copy-and-paste detector from the PMD Open Source Project, is a useful tool for finding where copy-and-paste has been applied. It uses a clever algorithm to find duplicated tokens, and by default it looks for a run of 100 or more tokens, most of which must be identical to be declared a copy. A token is an element such as a keyword, literal, operator, separator, or identifier.

CPD is distributed as part of PMD, which is an extensible cross-language static code analyzer.

One of my open source GitHub repositories contains all the code examples from my Java Cookbook plus many other code samples. Unfortunately, some of the examples not used in the book do not get the regular maintenance they deserve.

(In my defense, sometimes a developer does copy a code example for legitimate reasons that wouldn’t apply when building a real application.)

While writing this article, I ran CPD against my repository, and it found several issues. Here are two

Copy to Clipboard

The first one is interesting. It is obviously an editing error; when you use the vi editor, a number followed by an insert causes the insertion of that number of copies of the insert. However, numbers followed by the letter G (for go) are used to jump to a line by number.

My guess is that I typed a number to jump to a line, forgot the G, and typed a line to be inserted at that location, causing the line to be erroneously inserted many times. Strangely, this mistake has been in my public repository since 2003, and nobody has ever reported it to me.

The second issue identified an 18-line (184 tokens) duplication in the following files:

Copy to Clipboard

The same program demonstrated the use of regular expressions to parse the common Apache Log File format, and it seems as if I somehow accidentally created the same file with two different names, perhaps while merging files into this repository from another.

Here I am, rightfully busted by a tool that I often recommend. I shall have to use CPD more often