CamelCase vs underscores: Scientific showdown

In the odd case that you are an experienced programmer who doesn’t have a preference over using camel case or underscores for identifiers, try making up your mind now. Try choosing independently of (language) convention, habit or type of the identifiers. If you are a Lisper and like dashes, just vote for your next favorite.

if ( thisLooksAppealing )
{
    youLikeCamelCase = true;
    votePoll( camelCaseFormatting );
}
else if ( this_looks_appealing )
{
    you_like_underscores = true;
    vote_poll( underscore_formatting );
}

Did you vote? Good! Now it’s my turn to do some work, as I will try to take you through a semi-scientific explanation to prove which formatting is best suited for programming.

I wouldn’t have written this post, if I hadn’t read Koen’s Tao of Coding. As an ex-colleague he converted me to the underscores camp. The trigger to write this post was when reading a reply on a formatting discussion.

“honestly the code is easier to read” Opinion or fact?

It inspired me to look for scientific resources. Surely, studies must have been done right? As it turns out, not too many, but I found one. But first, in case you never had this discussion, … the usual opinions, and rebuttals. If you are looking for the facts, skip to round 3.

Round 1: The opinions

Pro underscores

  • Underscores resemble natural writing the most, and thus are more readable. Spaces are simply replaced with underscores. Extreme example: isIllicitIgloo vs is_illicit_igloo.
  • Consistency with constants. Underscores are still needed in all-caps naming conventions. E.g.: THIS_IS_A_CONSTANT
  • Abbreviations could still be kept uppercase easily. E.g.: TCP_IP_connection vs tcpIpConnection
  • Classes can be kept camel case, giving a clearer difference between them and identifiers/functions. E.g.: CamelRider.ride_camel() vs CamelRider.rideCamel().

Thank you, Yossi Kreinin, for the last two points, as discussed in IHateCamelCase.

Pro CamelCase

  • Camel case is easier to type, and underscores are hard to type.
  • Camel case makes paragraphs easier to read. my_first_variable=my_second_variable-my_third_variable vs myFirstVariable=mySecondVariable-myThirdVariable
  • Camel case is shorter.
  • Camel case is used by convention in a lot of major languages and libraries. (You weren’t allowed to use this argument when voting!)

Round 2: Rebuttals

Anti underscores

  • Underscores are ugly, camel case is more elegant.

Anti CamelCase

  • Underscores aren’t that hard to type. Seriously, as a programmer it is your duty to learn blind typing with all ten fingers. Learn qwerty, and save yourself the trouble of having to use the exotic AltGr button.
  • Use whitespaces and an IDE with color coding to easily see the difference between operators and identifiers.

Round 3: The facts

When reading the abstract of the research paper, it seems science is on the camel case side.

Results indicate that camel casing leads to higher accuracy among all subjects regardless of training, and those trained in camel casing are able to recognize identifiers in the camel case style faster than identifiers in the underscore style.

Existing research

Natural language research in psychology found that replacing spaces with Latin letters, Greek letters or digits had a negative impact on reading. However, shaded boxes (similar to underscores) have essentially no effect on reading times or on recognition of individual words. Removing spaces altogether slows down reading 10-20%.

Experiment setup

Empirical study of 135 programmers and non-programmers. Subjects have to correctly identify a matching phrase (maximum of 3 words long) out of 4 similar phrases. The important variables researched:

  • Correctness: whether the subject identified the correct phrase.
  • Find time: time taken to identify the phrase.
  • Training: how being a programmer affects the performance.

Results

  1. Camel casing has a larger probability of correctness than underscores. (odds are 51.5% higher)
  2. On average, camel case took 0.42 seconds longer, which is 13.5% longer.
  3. Training has no statistically significant impact on how style influences correctness.
  4. Those with more training were quicker on identifiers in the camel case style.
  5. Training in one style, negatively impacts the find time for other styles.

The paper concludes:

Considering all four hypotheses together, it becomes evident that the camel case style leads to better all around performance once a subject is trained on this style. Training is required to quickly recognize such an identifier.

Discussion

Personally, I find the conclusion flawed for a couple of reasons.

Correctness isn’t of much importance when programming. Correctness refers to being able to correctly see the difference between similar identifiers. E.g. startTime vs startMime. This is not a common scenario when programming. Additionally, with modern IDE’s you have auto completion and indications when a written identifier doesn’t exist. This makes me believe results (1) and (3) are irrelevant. As a sidenote, I believe the correctness of camel casing is due to the slowness of the reading. When you need to take more time to read something, you will read it more accurately.

When discussing possible threats to validity they mention the following. “Essentially all training was with camel casing, it would be interesting to replicate the study with subjects trained using underscores.” Result (4) and (5) just seem unfair when taking this into account. Isn’t it obvious that people who are used to camel case are better at it. Additionally, it has a proven negative impact on the “find time” for underscores.

So, only the slowness of reading camel case (2) remains. It takes 13.5% longer on average to read a camel case identifier than an underscore identifier. Multiply this for entire code blocks, and you have my semi-scientific opinion on the war between camel case and underscores!

For those brave enough to stick around until the end, what is your opinion now? Again, try choosing independently of convention, habit or the type of identifiers.P.s.: If you still believe camel casing to be more appropriate for programming, it would be interesting to leave a comment with argumentation. 😉 I could update “Round 2: the rebuttals” to include your comments to make the article more balanced.

Update: I’ve discussed a follow-up study in a new post. They reproduced the study and measured it takes 20% longer on average to read a camel case identifier, and additionally using eye tracking they identified camel case identifiers require a higher average duration of fixations.

Interpolation no more

Mathematics isn’t my forte. That’s why each time I have to implement some form of interpolation, I feel a headache coming up. Even simple linear interpolation has been bugging me, as it’s the same thing over and over again. One of the beauties of software development is you can write something once, and as long as it works, you don’t have to worry about it anymore. Why then, have I already implemented interpolation for around 5 different data types? My latest need for interpolation, I finally decided to put some effort into this bugger and solve it once and for all.

Armed with a recently discovered library which allows usage of generics for calculations in C#, I managed to come up with the following classes.

Abstract interpolation class diagram.

Implementations of AbstractInterpolation decide on the type of interpolation between certain key points. How key points should be interpreted is decided by the implementation of AbstractKeyPointCollection. The difference between the two implementations will be demonstrated later. In order to know what to interpolate (the TValue template parameter), a trivial implementation of  AbstractTypeInterpolationProvider is required for the type which needs to be interpolated. All classes have a base class AbstractBasicArithmetic, which allows to do basic arithmetic operations on the specified template type TMath. This type determines the value type to use, and return for the calculations of the interpolation. They are done through a ‘Calculator‘ object, as specified in the generic calculations article. Normally you have to pass this calculator object yourself, but I created a CalculatorFactory which creates the corresponding correct calculator for known value types like double.

Perhaps, to better understand the design, a code sample is easier to demonstrate. Consider the desired result pictured in the following image. Of course to achieve drawing simple lines, better solutions exist, but for demonstration purposes, this works really well. The subsequent code sets up some points to interpolate between, and the interpolation which will be used.

CumulativeKeyPointCollection<Point, double> keyPoints = new
    CumulativeKeyPointCollection<Point, double>( new PointInterpolationProvider() );

keyPoints.Add( new Point( 50, 50 ) );
keyPoints.Add( new Point( 100, 100 ) );
...
keyPoints.Add( new Point( 300, 330 ) );

LinearInterpolation<Point, double> linear = new LinearInterpolation<Point, double>( keyPoints );

CumulativeKeyPointCollection is chosen instead of AbsoluteKeyPointCollection, because every key point added to the collection ‘continues’ on the previous point. The line needs to go through the points in the order in which they were added. When using AbsoluteKeyPointCollection, every key point has an absolute position. It doesn’t matter in which order the key points are added. E.g. Interpolating between known locations at specified times, where time is then used as the absolute position.

Next, some interpolated points are calculated as follows.

List<Point> interpolatedPoints = new List<Point>();
double curPercentage = 0;
while ( curPercentage < 1 )
{
    interpolatedPoints.Add( linear.Interpolate( curPercentage ) );
    curPercentage += 0.001;
}

And that’s how easy interpolation can be! To achieve the result as pictured earlier, I just draw line segments between all the calculated points, and draw rectangles on top of the key points. All that remains for the code to work, is the type provider for Point which was passed to the key point collection. As you can see, that’s just a simple implementation of AbstractTypeInterpolationProvider.

    /// <summary>
    ///   Allows AbstractInterpolation to interpolate over a list of points.
    /// </summary>
    public class PointInterpolationProvider : AbstractTypeInterpolationProvider<Point, double>
    {
        /// <summary>
        ///   Create a new provider to allow AbstractInterpolation to interpolate
        ///   between a list of points.
        /// </summary>
        public PointInterpolationProvider()
            : base( 2 )  // The amount of dimensions this type has.
        {
        }

        protected override double GetDimensionValue( Point value, int dimension )
        {
            switch ( dimension )
            {
                case 0:
                    return value.X;
                case 1:
                    return value.Y;
                default:
                    throw new NotSupportedException(
                        "Unsupported dimension while doing interpolation on type " +
                        typeof(Point) + "." );
            }
        }

        public override double RelativePosition( Point from, Point to )
        {
            // Pythagoras to get distance.
            return Math.Sqrt( Math.Pow( to.X - from.X, 2 ) + Math.Pow( to.Y + from.Y, 2 ) );
        }

        public override Point CreateInstance( double[] interpolated )
        {
            return new Point( interpolated[ 0 ], interpolated[ 1 ] );
        }
    }

To show the flexibility of this approach, cardinal spline interpolation is possible by just changing the used LinearInterpolation class to CardinalSplineInterpolation.

CardinalSplineInterpolation spline =
    new CardinalSplineInterpolation( keyPoints );

Every time I need to implement something a bit more mathematically challenging, and search for help, I seem to end up on web pages created during the Middle Ages of the internet. Aren’t there any reuseable libraries out there which support e.g. interpolation out of the box on the actual types you want to perform it on? Like Point in the before given sample. I experience that so many libraries implement concrete implementations instead of generic solutions. For the interpolation to work I had to implement a generic binary search algorithm because SortedList (used to store the key points) didn’t support a binary search out of the box. To indicate intervals of values I had to create a generic Interval class, which I immediately could use in other source files as well. All this, of course brings some overhead with it. Instead of Donald Knuth‘s famous “premature optimization is the root of all evil” quote, I find Bill Harlan’s succinct expression even better.

It is easier to optimize correct code than to correct optimized code. – Bill Harlan

Source code can be found as part of my FCL Extension library in the Whathecode.System assembly and namespace Whathecode.System.Arithmetic.Interpolation.

Abstraction is everything

Recently I had a discussion with someone where I bluntly stated I didn’t like virtual functions. Now that I’m reading more about software design principles I can formulate myself a bit better. After a few quick google searches, I couldn’t find the following argumentation anywhere, so I thought it might be relevant to post about it.

First, to rephrase myself: “Prefer pure virtual functions over non-pure virtual functions where possible.” As a quick refresher, pure virtual functions are required to be implemented by a derived class, while non-pure aren’t. Classes containing pure virtual functions are called abstract classes.

In good tradition I’ll start by quoting a design principle to which my statement relates.

Open/Closed Principle: Software entities should be open for extension, but closed for modification. – Bertrand Meyer

More specifically, the “Polymorphic Open/Closed Principle”. This definition advocates inheritance from abstract base classes. The existing interface is closed to modifications and new implementations must, at a minimum, implement that interface.

Although most people understand abstract classes, and how it relates to the template method pattern, I often see implementations relying on overridable functions instead. I rarely use non-pure virtual methods. Only after considering all other possibilities, I might find an overridable virtual method the cleanest approach.

Consider the following C# example taken from the msdn documentation on override:

public class Square
{
    public double x;

    public Square( double x )
    {
        this.x = x;
    }

    public virtual double Area()
    {
        return x * x;
    }
}

class Cube : Square
{
    public Cube( double x ) : base( x )
    {
    }

    public override double Area()
    {
        return 6 * base.Area();
    }
}

I understand that the sole intent of the example is to demonstrate the usage of the override keyword, so I’m not criticizing it. It just serves as a nice example where I find an other approach to be a better implementation. I also added additional functionality to highlight the Open/Closed Principle.

 

public abstract class AbstractShape
{
    public abstract double Area();

    public override string ToString()
    {
        return GetType().ToString() + ": area " + Area();
    }
}

public class Square : AbstractShape
{
    private double _sideLength;

    public Square( double sideLength )
    {
        _sideLength = sideLength;
    }

    public override double Area()
    {
        return _sideLength * _sideLength;
    }
}

public class AbstractCompositeShape : AbstractShape
{
    private List<AbstractShape> _childShapes;

    protected AbstractCompositeShape( List<AbstractShape> shapes )
    {
        _childShapes = shapes;
    }

    public override double Area()
    {
        double totalArea = 0;
        foreach ( AbstractShape shape in _childShapes )
        {
            totalArea += shape.Area();
        }
        return totalArea;
    }
}

public class Cube : AbstractCompositeShape
{
    private Cube( List<AbstractShape> sides ) : base( sides )
    {
    }

    public static Cube FromSideLength( double sideLength )
    {
        List<AbstractShape> sides = new List<AbstractShape>();
        for ( int i = 0; i < 6; ++i )
        {
            sides.Add( new Square( sideLength ) );
        }
        return new Cube( sides );
    }
}

Of course this example shouldn’t be considered as a real world example. An implementation is mainly dependant on its usage. This example could be useful where unique identification of every face is important, and where a lot of other shapes are expected. Just to state a few things that would also be worth considering: generics, localization, performance, …

The second sample prevents misuse and promotes reuse, which are advantages of following the Open/Closed Principle like this. You can’t forget to implement a required function. It also demonstrates there are more alternatives to virtual functions than just abstract functions. Composition was used here to achieve the same goal as the first example.

As a set of guidelines I would specify:

  • When a function always needs a custom implementation in an extending class, never use non-pure virtual functions.
  • When it is possible to group a certain default implementation in a subclass, do so instead of defining it as a default non-pure virtual function. (e.g. Area() function of AbstractCompositeShape in the example above.)
  • Question yourself why the behavior of a function needs to be changed. Can this new behavior be encapsulated? (e.g. Usage of composition in AbstractCompositeShape.)
  • Not overriding a certain virtual method shouldn’t have major functional consequences. (e.g. ToString method.)

The lead architect of C#, Anders Hejlsberg, explains why non-virtual is default in C#, as opposed to Java. He also advises caution when making functions virtual.

UPDATE
A more Java oriented argumentation can be found here.

OO VS Procedural

Consider this part 2 of a critical review of the book “Clean Code” by Robert C. Martin. Previously I mentioned that I don’t agree with some of the ideas found in this book on how to write clean code. I argued that following some of his ideas, result in the exact opposite, unmaintainable code. More specifically, his wording of the principle:

Functions should do one thing. They should do it well. They should do it only. – Uncle Bob (Robert C. Martin)

I argued that the main reason for a function should be reuse, and not readability. As a result of this statement, I also disagree with his statement that comments should be replaced by functions wherever possible.

It’s important to note, that I do agree with most of the other concepts mentioned in the book (as far as I have read it), and even gained some new insights.

I will not go into detail about all pros and cons of procedural programming versus OOP, but I do want to point out that his argument against OOP seems flawed.

Procedural code (code using data structures) makes it easy to add new functions without changing the existing data structures. OO code, on the other hand, makes it easy to add new classes without changing existing functions.
Procedural code makes it hard to add new data structures because all the functions must change. OO code makes it hard to add new functions because all the classes must change. – Uncle Bob (Robert C. Martin)

To help you visualize, here is the procedural example from the book. I’m quite sure you’ll be able to figure out the OO example yourself.

public class Geometry {
    public double area(Object shape) throws NoSuchShapeException {
        if (shape instanceof Square) {
            Square s = (Square)shape;
            return s.side * s.side;
        }
        else if (shape instanceof Rectangle) {
            Rectangle r = (Rectangle)shape;
            return r.height * r.width;
        }
        else if (shape instanceof Circle) {
            Circle r = (Circle)shape;
            return PI * c.radius * c.radius;
        }
        throw new NoSuchShapeException();
    }
}

John already mentions one argument in his post, relating to what “hard” and “easyexactly means in this statement. By changing these definitions with what he is actually referring to, it becomes obvious it doesn’t say anything about whether or not one is better than the other.

Procedural code allows you to extend functionality by adding one function which supports behavior for all the required data structures. OO code, on the other hand, requires you to add the new function to all the different data structures.
Procedural code makes you adjust all functions to support a new data structure. OO code makes you implement a new function in all data structures.

I would even say, it becomes clear why this is actually a pro argument for OO:

  • Handling all data structures inside one function breaks the cohesion principle, which ironically as John already stated, is a principle on which Robert C. Martin himself based his Single Responsibility Principle.
  • When adding a new data structure in procedural code, you need to make sure that every function supports it. OO forces you to do all the required implementations.

Based on his previous quotation Martin concludes:

The idea that everything is an object is a myth. Sometimes you really do want simple data structures with procedures operating on them. – Uncle Bob (Robert C. Martin)

Everything can be represented as an object (also concepts), so I don’t see how that is a myth. I do agree with the conclusion, but in my opinion it has got nothing to do with the first sentence. Instead of the geometry example, a better example would be different functions operating on a set of Point objects to draw various different shapes.

Function Hell

Recently I bought the book “Clean Code” by Uncle Bob which discusses, well …, clean code. I’m very picky when it comes to clean code, which is why I wanted to read opinions of others to perhaps broaden or refine some of the practices I use myself.

I just read through a chapter on functions, of which a presentation by the author himself can be found online. So far I generally agreed to most statements made in the book, but the extent to which the following rule is applied bothers me, … a lot, … and then some more.

Functions should do one thing. They should do it well. They should do it only. – Uncle Bob

To demonstrate, consider the following “good practice” code taken from his blog post about the subject.

class SymbolReplacer {
    protected String stringToReplace;
    protected List alreadyReplaced = new ArrayList();
    private Matcher symbolMatcher;
    private final Pattern symbolPattern = Pattern.compile("\\$([a-zA-Z]\\w*)");

    SymbolReplacer(String s) {
      this.stringToReplace = s;
      symbolMatcher = symbolPattern.matcher(s);
    }

    String replace() {
      replaceAllSymbols();
      return stringToReplace;
    }

    private void replaceAllSymbols() {
      for (String symbolName = nextSymbol(); symbolName != null; symbolName = nextSymbol())
        replaceAllInstances(symbolName);
    }

    private String nextSymbol() {
      return symbolMatcher.find() ? symbolMatcher.group(1) : null;
    }

    private void replaceAllInstances(String symbolName) {
      if (shouldReplaceSymbol(symbolName))
        replaceSymbol(symbolName);
    }

    private boolean shouldReplaceSymbol(String symbolName) {
      return getSymbol(symbolName) != null && !alreadyReplaced.contains(symbolName);
    }

    private void replaceSymbol(String symbolName) {
      alreadyReplaced.add(symbolName);
      stringToReplace = stringToReplace.replace(
        symbolExpression(symbolName),
        translate(symbolName));
    }

    private String symbolExpression(String symbolName) {
      return "$" + symbolName;
    }

    protected String translate(String symbolName) {
      return getSymbol(symbolName);
    }
  }

This code explains the title of this post. Due to the amount of functions, the overview of this trivial procedural task is lost. The irony is that the decomposition of the code into so many functions is supposed to make it more readable.

A lot of the replies on his post share my sentiment that over extracting like this isn’t favorable. The replies of “Angry Man”, albeit harsh, describe the problems perfectly, and provide for a fun read in case you don’t get the point I’m trying to get across. The question that arises is: “When should a separate function be created?“.

I’ll try to define my approach by a few rules of thumb, where the emphasis is on reusability, encapsulation and global readability, instead of Uncle Bob’s specification which sadly seems solely based on creating small methods as a means to achieve local readability. What I mean exactly by this is explained later, along with the argumentation.

  • Prevent creating new methods that are guaranteed to only be called from one location.
  • Group strongly associated code together (cohesion/coupling principle), and put a comment above it mentioning its intent. Separate these blocks of code from other blocks by using e.g. newlines.
  • For every ‘block’ of code written, think whether it might be used somewhere else. When it’s reusable, encapsulate it inside a suitable function/class.

Every experienced programmer probably follows similar guidelines. Notice how instead of blindly following a rule, you actually think about how the code will be used later on. I would specify this as a design principle:

Always develop towards an API.

Argumentation

One of the main design principles I’d like to use as an argument is “encapsulation“. Creating a new function within a class makes it available to the entire class, and possibly extending classes. When there is no reuse of a certain piece of code, other scopes shouldn’t be bothered by it.

By following the rule of placing code in logical blocks, and documenting it’s intent, the entire block can be interpreted just as easily, and even better than by extracting it into a function. It’s intent can be described in plain english. This is what I defined as local readability. Additionally, global readability is maintained, as you immediately see where, and how often a block of code is used. Thanks to encapsulation, you know only about the definitions relevant to the current scope.

It’s a well-known fact that, when a method becomes too long, this is a strong indication that it should be refactored, following the Separation of Concerns principle. By blindly following the “extract till you drop” rule of Uncle Bob, you attempt to solve a problem, without reflecting what you are actually trying to solve. Often it just moves the problem to another scope, as too many methods in a class also indicate the need for refactoring.

UPDATE

While continuing reading the book I came across a perfect example to demonstrate this sad fact. Consider the following extracted function which can be found in Chapter 5 on Formatting, page 83.

private String getPageNameOrDefault(Request request, String defaultPageName)
{
    String pageName = request.getResource();
    if (StringUtil.isBlank(pageName))
        pageName = defaultPageName;

    return pageName;
}

Instead of immediately extracting this code into a function, he might have realised this behavior is pretty common. You could consider writing a “StringUtil.ensureValue(string value, string default)” function. Actually a quick google check brings up “StringUtils.defaultIfEmpty()” in the apache.commons library. You don’t want to write a new function for every string needing a default value. Actually this behavior is even more common, as you might also need default values for other types than strings. For small conditional checks like this some languages (including java) empower you with conditional operators. All you really need is the following:

String pageName = request.getResource();
pageName = StringUtil.isBlank(pageName) ? defaultPageName : pageName;

The thing that amazes me is that his trigger happy extraction rule violates so many good rules which are mentioned in the book as well. Besides the already mentioned encapsulation problem, it’s not hard to see how the above mentioned example violates the “Don’t Repeat Yourself” principle. If you would argue you might need to call this code more than once, don’t. This should be called exactly once, and stored in a variable. Which of the following two is more readable to be used after the first call: “getPageNameOrDefault(request, defaultName)” or “pageName“.

As far as I’ve read through the book (chapter 5), a better name for it would be “Readable Code” instead of “Clean Code“. I do hope the remainder contains more valuable tips. I am afraid however of the second section, full with code samples, which probably apply this nonsense “extract till you drop” rule.

Language Oriented Programming

Yesterday I spent several hours thinking about a concept I have been toying with for several days. Every programming language I have ever used, has its own limitations. The past few months I have been reaching those a lot, which is part of the reason why I started this blog. It made me wonder whether a programming language could be created, which can adjust itself at any given time. One important result is that this would allow you to create new keywords for a set of operations that are used regularly, solving the problem for which now only exist workaround solutions. Another cool aspect I realized, is when you start this process from byte level, the programming language itself could be its own compiler at runtime.

As a thought experiment I tried to come up with a basic set of rules, allowing for this required flexibility. If I didn’t make any logic errors, the rules I came up with can be implemented with a surprisingly small implementation, which for now, I’ll attempt to do in C#. When this turns out fruitful, the next step would be to do the same in assembly. The C# implementation, and the assembly implementation, would be able to run the same code.

While thinking of a name for the programming paradigm I had in mind (Read as: I needed to give the newly created solution in Visual Studio a name.), I checked Wikipedia for existing ones, but didn’t find any similar ones at first. Only by a subsequent google search with some keywords I found appropriate I stumbled upon Language Oriented Programming, and I’m really excited about it! The article reflects a lot (and more) of what I was thinking about and does a really good job of explaining it. As a nice ending brainfuck quotation:

Language Oriented Programming will not just be writing programs, but also creating the languages in which to write our programs. – Sergey Dmitriev, cofounder and CEO of JetBrains Inc.