Friday, December 9, 2011

Intro to Functional Programming (with examples in Java) - Part 1

What is functional programming?

Quoting Wikipedia, "In computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data."

As someone with a math background, I like the simple, axiomatic ideas of the lambda calculus and the fact that you can derive fairly rich constructs from simpler ones (all using a relatively simple collection of operations).

Of course, like many developers these days, my programming experience has almost entirely been in the imperative, object-oriented world. Over the past year, I've done a lot of reading on functional programming in Scala, Clojure, Haskell, and Erlang, and have been solving online puzzles in Scala. Functional programming now makes sense to me from a practical standpoint. At the same time, whenever I try to explain some cool, short functional code to colleagues, their eyes glaze over, for what seem to be the following two reasons:

  • The syntax. Haskell or Lisp (including Clojure) look really different from languages with a C heritage. While Scala can be made to look a lot like Java, doing so tends to involve using its more object-oriented flavour, which defeats the purpose of a functional programming example. In my opinion, idiomatic Scala feels more like Haskell.
  • They are also usually not terribly excited, since they could generally do the same thing with some slightly more verbose imperative code, so they don't really see the point.

In this post (and a likely followup or two), I'll try to address the first point, by building up some functional programming constructs in Java. For the second point (at a later date), I'll try to show some examples where functional programming lets you focus more on what your code is doing, and less on the details of what the computer is doing.

First-class Functions

One thing that seems to be needed for any functional programming is the idea that functions are objects that can be passed around. In Java, of course, objects are instances of classes, so let's create our first function class in Java:

This is a single-purpose function class that, given a string will return its length. We can pass this around, have it as a member, store it as a map value, etc. Unfortunately, it only does one thing. Let's make a base class that generalizes this a bit further to all functions that take a String as input and return an Integer:

This is more useful. We could imagine having a method that acts on a Collection of Strings and wants to produce an Integer for each one. We could then pass it any FunctionStringToInteger as a parameter, independent of how the Integers are produced from the Strings.

Let's generalize this further, using Java generics to handle any function of one argument. In this case, we want to generalize on the input type T1 and the return type R. (I'm using T1 instead of T, since we'll soon be looking at functions that take multiple parameters, which could each be of different type.)

While the generic parameters can be a little messy, we can now model any function of one parameter as a subclass of this Function1 class. Here's a simple example:

Of course, having the function as an object and invoking it directly doesn't really say anything about first-class functions (and mostly just looks like a much more verbose way of calling a method). Lets write a method that takes a function as a parameter, so the first-classedness becomes more useful:

Now we can call transformList with a list of elements and a function that transforms individual elements and get back a list of all of the transformed elements. This is actually a fairly common pattern, so transformList is part of the standard library in a lot of functional languages (though it's usually called "map" -- we'll switch to calling it map in part 2).

We can also define classes for functions that take more than one parameter:

That's cool, but isn't it annoying that you need to make FunctionN classes for each number of parameters that you want? Yes, but you define the classes once and then you use them forever, or ideally they're already defined in a library (e.g. Functional Java defines classes F, F2, ..., F8, and the Scala standard library defines classes Function1, ..., Function22). For this blog, we'll do things the hard way, and implement everything ourselves. (In particular, I will be adding more functionality to the base FunctionN classes in part 2.)

Higher-order Functions

Once you have first-class functions, it's not hard to imagine creating functions that take functions as parameters. Above, we defined the transformList method that took a function as one of its parameters, but methods aren't functions as we've defined them (that is, methods are not first-class in Java). Here's a silly example of a higher-order function with a very long name:

Why don't we turn transformList into a function? Then we could pass it to other functions/methods to build up more complicated things. Unfortunately, this is an area where Java makes things more of a pain for functional language programming. I think it's related to Java not having "higher-kinded" types.

The basic problem is that, while you can write a generic Function2 that has a Function1 as a type parameter, you don't necessarily know the R, T1 type parameters for that Function1. What we would really like to do would be to write a type-safe Function2 that can take any Function1 as a parameter, regardless of what that Function1's types are, and without necessarily having that Function1 when creating the Function2. Unfortunately, I don't think that's entirely possible in Java, but we're going to come up with a way of coming close.

Let's try to define a generic transformList in a naive way:

Well, that's not going to work. If you paste that code snippet into an IDE (as I just did), all of the Rs and T1s should be highlighted in red. There is no context to define what these type parameters should be in the general case. So, what can we do to get around this? One approach, if you're into taking risks, is to just drop the unbound type parameters:

That will compile, and you get the correct result if you pass the right types of parameters. Unfortunately, if you call apply with a list of Integers and a function that takes something else (e.g. String) as a parameter, you won't notice your mistake until runtime (when a ClassCastException is thrown). You will get a compiler warning regarding unchecked calls within the definition of the apply method of rawTransformList, but that is it.

If we cannot directly define a generic, statically-typed higher-order function for transformList, and a raw version is not safe, maybe we can define a generic factory method that builds these statically-typed functions by providing a "hint" of the type parameters:

This is looking better. We don't need to know exactly which Function1 will be passed to transformList, but we do need to know the input and return types for that Function1. Unfortunately, this will create a new Function2 every time we call it. We could hold onto transformList instances for every input/return type pair that we use, but that's still wasteful (especially since the type parameters will be erased by the compiler, and the stateless instances will all be identical at runtime).

I believe the best approach is to combine the above two approaches. Use a private static raw instance singleton, which is returned by a generic public factory method:

This gives you a Function2 that statically type-checks its inputs and outputs, despite the fact that it's the same one being used for all type parameters. Because of Java's compile-time type erasure, it makes no difference at runtime.

What's next?

I think this post is getting long enough. We've gone over the basics of first-class and higher-order functions. In the next posts, I'll add currying to the functions and talk about the idea of everything (conceptually) being a function.

The code for this series of blog posts is available in my public github repository.

Wednesday, March 30, 2011

Immutable Binary Trees (Part 2)

On Monday, I introduced a basic implementation of immutable binary trees in Java. At the end of that post, I wrote:

There are several problems with this implementation that I plan to address in a future post:

  • The remove method creates a new node at each level, even if the element to remove does not exist.

  • The use of null to indicate the lack of a left or right child opens us up to the potential of NullPointerExceptions and wastes space from those fields. (In particular, in a balanced binary tree, half the elements will be leaf nodes.)

  • Currently, there is no way to create an empty tree, since every node has a value. If you look at the ImmutableTreeRunner implementation above, I had to generate my first random value before I could create the ImmutableBinaryTree.



This is that future post. I'll present a much cleaner, more memory-efficient implementation that solves these problems. That said, this is still not a good binary tree implementation to use. I'm not implementing any balancing, so it's generally going to be quite slow. It's just a more elegant unusable solution.

The first problem was largely laziness on my part, I suppose. (It would have been possible to better handle remove in the old implementation.) The other two problems can be addressed by accepting that not all immutable binary trees should be created equal, and establishing a class hierarchy:



That may look overly complicated, but all of the logic will actually degenerate into handling empty trees and handling non-empty trees. The NonEmptyTree subclasses only add fields and provide one-line implementations for NonEmptyTree's abstract methods.

Note that I gave the EmptyTree rounded edges in the diagram to distinguish it -- this will be implemented as a singleton. Conceptually, there is only one empty immutable tree). Also, this makes for some simpler code, as we can use == to see if a child is the empty tree.

First, let's look at the implementation of the ImmutableBinaryTree abstract base class:



We specialize the declared return type of add and remove, while still satisfying the ImmutableSet interface I put up on Monday. We declare an internal addAll helper method (though this could also be added to the ImmutableSet interface). Finally, we create two factory methods for immutable binary trees -- one that returns the empty tree and another that constructs a series of trees until all the given elements have been added.

Next we implement all of the abstract methods for the empty tree. These are all one-liners:



We also store the (untyped) singleton EmptyTree instance and provide a (typed) static accessor for it. In this case, working without the T type is perfectly legitimate -- there is no state influenced by T in instances of this class. That said, the Java compiler apparently can't tell that this is safe (which is fair, since usually it's not safe). So, I've used the @SuppressWarnings annotations to reassure the compiler that I know what I'm doing. (Note that a similar approach is used at least in the Apache Harmony implementation of Collections.emptyList(), which is solving almost exactly the same problem.)

Before thinking about the behaviour of non-empty trees, I think it makes sense to think about their state. Firstly, all non-empty trees represent a node with a value. So, NonEmptyTree will have a field of type T for that value. Additionally, non-empty trees have 0, 1, or 2 children. If they have 1 child, it is either a left or right child. This motivates the NonEmptyTree hierarchy presented above. By isolating the state (that is, the fields) into subclasses, we ensure that memory is not wasted on fields that are unused (that is, are permanently null). It also makes it easier to reason about the state being used in each subclass.

Fortunately, by ensuring that all of the subclasses provide a consistent exposure of state, that is, they supply a left and right child on demand (where they may be the singleton EmptyTree), we can safely implement all of the logic in NonEmptyTree:



Why did I declare emptyTree local variables to capture the empty tree instance before checking equality against passed parameters? It turns out that the Java type inferenceer (or at least the one used in my local version of Eclipse on a Mac) wouldn't allow me to check e.g. tree == EmptyTree.instance(). It's possible that I messed up with the covariance of the type parameter. Regardless, adding the local variable was an easy of making my intentions clear to the type inferencer.

The general flow is as follows:

  • add is a special case of addAll, unless the current node already holds the value being added.

  • remove either removes the current node (at which point, we merge the left and right children), or replaces the relevant child with the result of the remove operation applied to it.

  • addAll either merges the newly added tree's children with this node's children (if they have the same root value), or replaces the relevant child after applying the addAll operation to it.

  • replaceChildren runs through the five relevant cases for replacing the children of a given node:

    • If both children are equal to the current value, then nothing has changed. Return this.

    • If both new children are the EmptyTree singleton (and weren't before, or they would have been caught by the previous check), then we return a new Leaf.

    • If one new child is the EmptyTree, we return a new LeftBranch or RightBranch.

    • Otherwise, both children are non-empty so we return a new DualBranch, which is guaranteed to be different from the current node (by the first check).



  • The toList and contains methods are fairly self-explanatory.


Finally, we have all of the specializations of NonEmptyTree, which are a series of constructors and one-line method implementations:











There are a couple of points that I found interesting working on this code:

  • There is not a single null check in this code. Instead, that role has been absorbed by EmptyTree, which I like to think of as a sort of "type-safe" null. Instead of checking for null and deciding how to react in every situation, I've defined a "null-like" element, and specified how it should implement my interface in a consistent way. I did still check for EmptyTree in replaceChildren, to ensure that the most efficient concrete NonEmptyTree is returned. I could have just returned a new DualBranch, and the correctness of the algorithms would still hold.

  • A Java object, as I understand it, occupies enough memory to hold a reference to its runtime class, and the space occupied by its fields. Since EmptyTree is a singleton, its memory consumption doesn't really matter, but should effectively be this single reference to its class. Leaf is a concrete class with a reference to its class and a reference to the contained value of the node. On the other extreme, DualBranch represents four references to: the class, the value, a left child, and a right child. Thanks to the immutable nature of this code, we are required to return new objects, which don't necessarily need to be of the same type as our current (runtime) class. We can make those objects as efficient as possible. If I had implemented a proper balancing strategy, half the nodes would be Leafs, and would occupy roughly half the memory of corresponding DualBranches (disregarding the space occupied by the values themselves, aside from their references).



While there's nothing new or clever in these posts, I've found them to be a good learning exercise, and have helped me reinforce some functional programming ideas. In particular, while the whole "type-safe null" idea expresseed by EmtpyTree is pretty standard fare in Haskell, it's still pretty novel to me as a (mostly) Java developer. Also, the particular refactoring in this post was motivated by the elegant simplicity of Scala's immutable List, which I feel that I now understand more clearly.

Of course, I did need to add those two @SuppressWarnings annotations in EmptyTree, which I believe Scala avoids by having Nil (the singleton empty list) have a type parameter of Nothing (which inherits from everything -- if this sounds weird but intriguing, read up on some Scala, it actually all makes sense). I don't believe it's possible to do the same in Java, so I'm happy to at least be consistent with Apache Harmony.

Monday, March 28, 2011

Immutable Binary Trees

I have been reading a fair bit about functional programming recently, and the advantages you can get in a concurrent execution environment by using immutable data types (or persistent data structures).

While it's not too hard to wrap your head around an immutable list structure with an efficient prepend operation, I wanted to try implementing a slightly more complicated immutable structure (but not as complicated as a trie). So, I decided to try implementing an immutable binary tree in Java, and then work on improving it a bit.

Note that this is not a "good" implementation. In particular, I'm not going to bother rebalancing the tree, so it's very possible for many operations to be O(n) instead of the O(log n) that a balanced tree would provide.

The main idea that I had trouble fully grasping when learning about immutable data structures is the idea that you usually don't need to copy everything when adding/removing an element. So, let's look at an example of a binary tree in pictures and consider what elements need to be replaced and what can be reused when we add a new element to the tree.

Consider the following tree:



The rules are simple: at each node, values less than the current node's value are stored in the left subtree, while values greater are stored in the right subtree.

Now, suppose we want to add the number 11 to the tree. It should be added as a right child of the 10 node. Since the current 10 node has no children (and is itself an immutable tree), we must allocate a new 10 node with 11 as a right child. The new 10 node will be the left child of a new 13 node, which can reuse the old 16 node as its right child (along with the children of 16). Finally, a new 8 node must be allocated with its right child pointing to the new 13 node and its left child pointing to the existing 4 node.

Here is a picture with the newly-allocated nodes highlighted in red:



In total, four new nodes were allocated: one for each level in the tree. For a balanced tree, this generalizes to O(log n) node allocations, or the same runtime efficiency of a mutable binary tree. The remainder of the tree continues to point to the old values. Assuming we aren't holding onto a reference to the old tree root, the original 8, 13, and 10 nodes would now be eligible for garbage collection. (Though if another thread were accessing the old version of the tree, it would not be affected by this change, avoiding a potential race condition.)

Now, let's take a look at how to implement a simple immutable binary tree in Java.

To start with, I'll create a simple ImmutableSet interface (which doesn't follow the java.util.Set interface, since that is fundamentally based on mutability -- e.g. the "normal" add method has a void return type):



I've included a toList method for testing purposes, since it was easier than implementing the Iterable interface.

The fields and constructor of the ImmutableBinaryTree implementation are as follows:


All fields must be final in an immutable data structure.

Here is the implementation of the add method:


We create a leaf node for the new element and add it as a subtree to the existing tree. The addSubtree method is reused in the implementation of remove below, and could be used to implement a public addAll method that allows merging of two ImmutableBinaryTrees.



The remove method has two base cases (when we don't recurse into child branches). If the element is not in the tree, we return the unmodified this. If the current element is to be removed, we merge the left and right subtrees, arbitrarily choosing to add the right subtree as a descendent of the left (assuming the left subtree is not null, or else we simply return the right subtree).

The contains method is effectively the same as it would be in a mutable binary tree:


The toList method does an in-order traversal of the tree to return an ordered list:


For debugging purposes, and to visualize the tree depth, I decided to override toString:


Finally, here is a runner I used to compare the correctness of my implementation for addition/removal against the standard mutable TreeSet from the Java standard library:


There are several problems with this implementation that I plan to address in a future post:

  • The remove method creates a new node at each level, even if the element to remove does not exist.

  • The use of null to indicate the lack of a left or right child opens us up to the potential of NullPointerExceptions and wastes space from those fields. (In particular, in a balanced binary tree, half the elements will be leaf nodes.)
  • Currently, there is no way to create an empty tree, since every node has a value. If you look at the ImmutableTreeRunner implementation above, I had to generate my first random value before I could create the ImmutableBinaryTree.

Tuesday, January 11, 2011

Functional Programming Videos

I received a link from an email on the Scala mailing list to some excellent videos on functional programming:

http://cufp.org/videos

So far, I've watched "Naïveté vs Experience" and "Real world Haskell". Excellent viewing!