The Y combinator is a somewhat magical aspect of the untyped lambda calculus. Many people tried to explain this magic, but I found them somewhat unsatisfactory.
 Most explanations show the combinator in the beginning and then simply demonstrate that the given combinator is correct.
 Some other explanations contain lines like “Why don’t we try passing it to itself? And look, this turns out to be exactly what we want!”
The problem with these is that they don’t give an intuition for how one could constructively derive the combinator. What on earth makes you think that passing it to itself is a good idea? As such, this post, intended for those who know the Y combinator already but do not satisfy with the existing explanations, attempts to explain how one could constructively derive the Y combinator.
Introduction #
In lambda calculus, there’s no primitive support to define recursive functions. Can we define recursive functions in lambda calculus regardless? And how?
While our question focuses on lambda calculus, to make it easier to follow, we will use an actual programming language. Here, I choose to use Racket, but other programming languages might work as well.
As mentioned above, this post is intended for those who know basic programming language theory. However, I will try to provide links as many as possible to accommodate nontarget audience.
Enabling recursion #
In Racket, we can write code to compute $3! + 4!$ easily.
30
Defining this recursive fact
function is possible because letrec
allows us to refer to fact
inside the lambda function that will be fact
itself.
Of course, lambda calculus has no letrec
. We might ask how Racket implements letrec
, and the short answer is that it does so by using mutation, which lambda calculus doesn’t have either. Therefore, to simulate lambda calculus, we will write the factorial function in Racket without using letrec
and mutation.
Here’s our first straightforward attempt:
fact: unbound identifier in: fact
We simply change letrec
to let
. While let
and letrec
are very similar, let
doesn’t have the ability that allows us to refer to fact
inside that lambda function. In fact, (let ([‹x› ‹v›]) ‹e›)
doesn’t add any expressive power to lambda calculus at all because its semantics is completely equivalent to ((λ (‹x›) ‹e›) ‹v›)
, a lambda term. Some programming languages simply treats let
as a syntactic sugar, and desugar (let ([‹x› ‹v›]) ‹e›)
to ((λ (‹x›) ‹e›) ‹v›)
.
As expected, our straightforward attempt doesn’t work because the highlighted fact
is unbound. How can we fix it? Let’s take a look at another example:
c: unbound identifier in: c
The above code has a similar problem. the highlighted c
is unbound. One way to fix the problem is to explicitly pass c
to the lambda function, so that the highlighted c
becomes bound:
100
Our solution to the unbound fact
problem is going to be the same. While fact
is unbound inside the lambda function, it is bound in the body of let
, so we can pass fact
explicitly to the lambda function!
fact: arity mismatch;
the expected number of arguments does not match the given number
expected: 2
given: 1
arguments...:
There’s still a problem: fact
is now a function of arity 2, so the highlighted function application would result in an arity mismatch error. What does fact
need as the first argument? The answer is fact
! So we can fix the code accordingly:
30
With this trick, we seem to be able to write any recursive function $f$ in a language without letrec
by the method A as follows:
But is it always correct? Consider the following code:
30
is transformed to:
fact: arity mismatch;
the expected number of arguments does not match the given number
expected: 2
given: 1
arguments...:
As we can see, the method A is not always correct, because when $f$ is used in a nonfunction position, no change is made. Fortunately, it is possible to force every occurrence of identifier $f$ to be in a function position! If $f$ is a function of $n$arity, then $f$ is observationally equivalent to $\lam{x_1 x_2 \dots x_n}{f x_1 x_2 \dots x_n}$ (when none of $x_1, x_2, \dots, x_n$ is equal to $f$). Hence, if we insert another step between the method A’s step 2 and step 3 in order to replace all variable occurrences that are bound to $f$ with $\lam{x_1 x_2 \dots x_n}{f x_1 x_2 \dots x_n}$, what we will get is the method B, which now works for any program.
Let’s try using it!
is transformed to:
30
Cool, it works now!
Omega #
Let’s digress a little bit here. What will happen if we perform the syntax transformation to this function which loops forever?
Because every bound occurrence of diverge
is in a function position, we can use the method A. The above code is transformed to:
To obtain a lambda term, we simply desugar let
as described above:
Recall that this is known as the $\Omega$combinator, a classic example of a lambda term whose evaluation doesn’t terminate.
Deriving a family of Y combinator #
While we now know how to write recursive functions in lambda calculus, the transformed function doesn’t look like the original function that we would write in the letrec
version. It would be nice if there is a term ‹makerecursion›
such that:
produces the actual factorial function. However, this clearly won’t work, because the highlighted identifiers are going to be unbound. To fix this, we simply make it bound by wrapping a lambda function around the body.
We will call the lambda function that is the argument of ‹makerecursion›
a recursion maker. What we want then is to find ‹makerecursion›
so that for any ‹recursionmaker›
, (‹makerecursion› ‹recursionmaker›)
produces the actual recursive function.
How should we approach this? We don’t know what ‹makerecursion›
looks like, but for the factorial function, we do know what ‹factmaker›
looks like. The strategy, then, is to morph a piece of code that uses the method B to implement the factorial function:
to a piece of code that uses ‹factmaker›
:
If we can do this, the hole ‹???›
is going to be ‹makerecursion›
that we are looking for.
As a first step, notice that fact
and n
are both the formal parameters in the transformed code, whereas in ‹factmaker›
they are separated. Straightforwardly, we can use currying to separate the two apart. We of course need to change how we call the function as well. That is, change:
to
For the next step, notice that we can use fact
naturally at callsites, but in our current code we have this (λ (v) ((fact fact) v))
instead. What we will do is to shadow fact
with that value so that it appears that we can use fact
normally.
The first highlighted code is now very similar to ‹factmaker›
, and the second highlighted code exactly matches with the client side code. What we want to do next is to somehow convert (let ([fact (λ (v) ((fact fact) v))]) ‹body›)
into (λ (fact) ‹body›)
where ‹body›
is the first highlighted code, and also move (λ (v) ((fact fact) v))
somewhere else. That’s simply desugaring let
!
We managed to make ‹factmaker›
appear! Next, we can abstract ‹factmaker›
out as a variable named factmaker
:
Eventually, we will want ‹makerecursion›
as a lambda term. However, there’s no let
in lambda calculus, so we will reduce number of let
s by fusing the two nested “let fact
be ...” together in a straightforward way:
As a final step, ‹makerecursion›
should consume factmaker
as an argument, so we can abstract factmaker
out:
30
And we are done! The highlighted code is ‹makerecursion›
that we are looking for.
Notice that there’s nothing inherent about the factorial function here. Had we use any other recursive function as a starting point, we would end up with a term that is $\alpha$equivalent to this ‹makerecursion›
as well. We will highlight this fact by $\alpha$convert variable names to remove all references to the factorial function:
Actually, we can simplify our ‹makerecursion›
a little bit further. Recall from Enabling recursion that if $e$ evaluates to a function of one argument, then $e$ is observationally equivalent to $\lam{x}{e x}$ (provided that $x$ does not freely occur in $e$). The transformation back and forth between $e$ and $\lam{x}{e x}$ is known as $\eta$conversion. Here are some terms that we could derive from the above ‹makerecursion›
:

From
‹makerecursion›
, by $\eta$converting the following highlighted code:
we obtain:
which is formally known as the Z combinator ($Z$).

From $Z$, by $\eta$converting the following highlighted code:
we obtain:
which is formally known as the Y combinator ($Y$).
The original ‹makerecursion›
, $Z$, and $Y$ are fixedpoint combinators. The word combinator means that there’s no free variable in these terms, and the word fixedpoint means that the application of the fixedpoint combinator to $f$ is a solution to the equation $x = f(x)$ for any function $f$. For example, this property says that $Z(f) = f(Z(f))$. How is this fixedpoint identity important? And how are ‹makerecursion›
, $Z$, and $Y$ different? These questions will be answered in the next sections.
Fixed point #
Why is the fixed point identity such as $Z(f) = f(Z(f))$ true, and how is it important? This section will attempt to answer these questions. We will use the Z combinator as a primary example.
First, we will take a look at recursion makers from another perspective. In Deriving a family of Y combinator, recursion makers arise naturally, but we only know that (Z ‹recursionmaker›)
results in a recursive function. What exactly are their role and properties? Let’s take a look at the ‹factmaker›
as a concrete example:
What can we pass into $\text{factmaker}$? It obviously consumes the factorial function! But provided that we don’t have the real factorial function already, the best we can do is to provide any other value instead. Say, we pass in a nonsensible value $42$. This results in:
What we have here is the factorial function that only works on $n = 0$. For $n > 0$, the highlighted $42$ will be used as a function value which results in an error. We will call this function $\text{fact}_0$.
What else can we pass into $\text{factmaker}$? We just obtained $\text{fact}_0 = \text{factmaker}(42)$, so we can use it. This results in:
Again, what we have here is the factorial function that works on $n \le 1$. Let’s call it $\text{fact}_1$. Next, we can pass $\text{fact}_1 = \text{factmaker}(\text{factmaker}(42))$ into $\text{factmaker}$ to get $\text{fact}_2$. We can clearly see the trend here. $\text{factmaker}^k(42) = \text{fact}_{k1}$, the factorial function that works on $n \le k  1$.
In general, any recursion maker $f$ propels the computation one more level deeper.
To get the actual factorial function, which should work for any natural number $n$, we evidently need an infinitely high tower of $\text{factmaker}$. That is, $\text{fact} = \text{fact}_\infty = \text{factmaker}(\text{factmaker}(\text{factmaker}(\dots)))$.
Note that:
$$\begin{align*}
\text{fact} &= \text{factmaker}(\text{factmaker}(\dots))\\
\text{factmaker}(\text{fact}) &= \text{factmaker}(\text{factmaker}(\text{factmaker}(\dots))) & \text{applying factmaker}\\
&= \text{fact}\\
\end{align*}$$
which agrees with our definition of $\text{factmaker}$: applying $\text{fact}$ to $\text{factmaker}$ produces the actual $\text{fact}$ function.
Recall that the Z combinator makes the recursion possible: $\text{fact} = Z(\text{factmaker})$. Substituting this in the above equation, we obtain that $\text{factmaker}(Z(\text{factmaker})) = Z(\text{factmaker})$.
In general, for any recursion maker $f$, $Z(f) = f(Z(f))$. The importance of this identity is that $Z(f) = f(Z(f)) = f(f(Z(f))) = f(f(f(Z(f)))) = \dots$ which creates an infinite tower of the recursion maker $f$, making recursion possible.
Yet another proof
We can also show that $Z(f) = f(Z(f))$ directly using the definition of $Z$:
$$\begin{align*}
Z &= \lam{f}{(\lam{x}{f (\lam{v}{(x x) v})})(\lam{x}{f (\lam{v}{(x x) v})})} & \\
Z(f) &= (\lam{\color{red}{x}}{f (\lam{v}{(\color{red}{x x}) v})})(\lam{x}{f (\lam{v}{(x x) v})}) & \\
&= f (\lam{\color{blue}{v}}{((\lam{x}{f (\lam{v}{(x x) v})}) (\lam{x}{f (\lam{v}{(x x) v})})) \color{blue}{v}}) & \beta \text{reduction}\\
&= f ((\lam{x}{f (\lam{v}{(x x) v})}) (\lam{x}{f (\lam{v}{(x x) v})})) & \eta \text{conversion}\\
&= f (Z(f)) & \\
\end{align*}$$
Y vs. Z #
How are ‹makerecursion›
, $Z$, and $Y$ different? We can empirically test them:

The original
‹makerecursion›
:30

The Z combinator:
30

The Y combinator:
Nonterminating
Uh oh.
We derive the Y combinator from the Z combinator via $\eta$conversion, which intuitively should preserve semantics. Yet something clearly went wrong when we try to use the Y combinator for computing factorials, since the computation doesn’t terminate (while the Z combinator works as expected). In fact, the problem is much worse: any attempt to use Y
will always result in an infinite loop in Racket!
Recall the following:
if $e$ evaluates to a function of one argument, then $e$ is observationally equivalent to $\lam{x}{e x}$ (provided that $x$ does not freely occur in $e$).
and how we convert $Z$:
to $Y$:
In Racket, when we evaluate, say, (Y factmaker)
, does (x x)
evaluate to a function of one argument? We can manually step through the evaluation to see what’s going on:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28  1. (Y factmaker) ==> ((λ (f) ((λ (x) (f (x x))) (λ (x) (f (x x))))) factmaker) 2. ((λ (f) ((λ (x) (f (x x))) (λ (x) (f (x x))))) factmaker) ==> ((λ (f) ((λ (x) (f (x x))) (λ (x) (f (x x))))) (λ (fact) ...)) 3. ((λ (f) ((λ (x) (f (x x))) (λ (x) (f (x x))))) (λ (fact) ...)) ==> ((λ (x) ((λ (fact) ...) (x x))) (λ (x) ((λ (fact) ...) (x x)))) 4. ((λ (x) ((λ (fact) ...) (x x))) (λ (x) ((λ (fact) ...) (x x)))) ==> ((λ (fact) ...) ((λ (x) ((λ (fact) ...) (x x))) (λ (x) ((λ (fact) ...) (x x))))) 5. ((λ (fact) ...) ((λ (x) ((λ (fact) ...) (x x))) (λ (x) ((λ (fact) ...) (x x))))) ==> ((λ (fact) ...) ((λ (fact) ...) ((λ (x) ((λ (fact) ...) (x x))) (λ (x) ((λ (fact) ...) (x x)))))) 6. ((λ (fact) ...) ((λ (fact) ...) ((λ (x) ((λ (fact) ...) (x x))) (λ (x) ((λ (fact) ...) (x x)))))) ==> ((λ (fact) ...) ((λ (fact) ...) ((λ (fact) ...) ((λ (x) ((λ (fact) ...) (x x))) (λ (x) ((λ (fact) ...) (x x))))))) ... 
The answer is no! Hence, in Racket, our $\eta$conversion to turn $Z$ to $Y$ is unsound!
While Racket evaluates the term (Y factmaker)
exactly as described above, we can imagine another way to evaluate the term:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18  1. (Y factmaker) ==> ((λ (f) ((λ (x) (f (x x))) (λ (x) (f (x x))))) factmaker) 2. ((λ (f) ((λ (x) (f (x x))) (λ (x) (f (x x))))) factmaker) ==> ((λ (x) (factmaker (x x))) (λ (x) (factmaker (x x)))) 3. ((λ (x) (factmaker (x x))) (λ (x) (factmaker (x x)))) ==> (factmaker ((λ (x) (factmaker (x x))) (λ (x) (factmaker (x x))))) 4. (factmaker ((λ (x) (factmaker (x x))) (λ (x) (factmaker (x x))))) ==> ((λ (fact) ...) ((λ (x) (factmaker (x x))) (λ (x) (factmaker (x x))))) 5. ((λ (fact) ...) ((λ (x) (factmaker (x x))) (λ (x) (factmaker (x x))))) ==> (λ (n) (case [(0) 1] [1 (* n (((λ (x) (factmaker (x x))) (λ (x) (factmaker (x x)))) ( n 1)))])) 
In this evaluation trace, (Y factmaker)
successfully evaluates to the factorial function! The difference of these two evaluation traces is due to evaluation strategies, which describes what reducible expression (redex) should be $\beta$reduced.
Most programming languages including Racket employ the first evaluation strategy presented above, which always evaluates arguments of a function application before the application itself. This strategy is known as callbyvalue. As another example for how this strategy works, it would evaluate (double (double 1))
where double
is defined as (λ (x) (+ x x))
in the following way:
On the other hand, the second strategy presented above calls the function application without waiting for arguments to be evaluated first. This strategy is known as callbyname. It would evaluate (double (double 1))
in the following way:
We have seen that different evaluation stategies might not evaluate a term to the same result. This is disconcerting. Fortunately, the ChurchRosser theorem guarantees that given a term $t$, if two strategies successfully evaluate $t$ to values, then the two values from both strategies are equal, as we can see from the (double (double 1))
example. The only case that two evaluation strategies would evaluate a term to different results is when one of them doesn’t terminate, which is exactly what happens with the Y combinator in the callbyvalue strategy.
There are advantages and disadvantages for each evaluation strategies. For instance, the callbyname strategy has a beautiful property that if a term could be evaluated to a value by some evaluation strategy, then it will be able to evaluate the term to the value as well. This is the reason why the strategy successfully evaluate (Y factmaker)
. The disadvantages are (1) it’s not really efficient, taking 6 steps to evaluate (double (double 1))
while the callbyvalue strategy can evaluate the term in only 4 steps, and (2) it doesn’t interact well with mutation, which is a feature that most programming languages have. This is the reason why most programming languages use callbyvalue instead.
Due to the callbyvalue strategy, we can’t run the Y combinator in the standard Racket language. One cool feature of Racket however is its ability to create and use a new programming language via the hash lang line. In fact, the creators of Racket even call Racket a Programmable Programming Language. It is very easy to create a callbyname language with Racket. Even better, someone did that already and we can just use it!
30
The final remark in this section is that, the Z combinator is also known as the callbyvalue Y combinator since it’s the Y combinator for callbyvalue languages.
Y and me #
I learned the Y combinator two years ago from the PL class taught by Shriram Krishnamurthi. One great explanation (which Shriram pointed us to) is Matthias Felleisen’s A Lecture of the Why of Y, which similarly attempts to derive the Y combinator. However, the explanation that corresponds to the “Enabling recursion” section goes in a much slower pace and implicitly contains some elements from the “Fixed point” section. The strength of this approach in my opinion is that it crucially uses the fixedpoint identity $Y(f) = f(Y(f))$ to justify the “selfapplication trick” in the derivation right away, so it motivates really well why $Y(f) = f(Y(f))$ is important. The approach that I use, on the other hand, simply focuses on where things are bound and how can we direct values to appropriate places, which results in a simpler explanation in my opinion. I also make use of let
a lot, which I think helps with readability. The weakness is that I also need to talk about desugaring.
I didn’t discover this approach myself. The core insight of this approach is from a student’s homework that I graded in the current iteration of the PL class. A part of the homework asks students to write random programs in a language that is very similar to Racket without letrec
and mutation. I didn’t expect that students will be able to write a really meaningful program unless they know a fixedpoint combinator already, ... or so I thought. It turns out that one student wrote a program to calculate 1 + 2 + ... + n for arbitrary n without knowing a fixedpoint combinator! Here’s the program in Racket:
55
Granted, this is simply a rewrite of one of Matthias’s slides into the let
form with uncurrying, but it really opened up my eyes.
Acknowledgements #
Since our grading policy dictates that we can’t look at students’ name, I don’t know who this student is. Regardless, I would like to thank them for this incredible insight. I also would like to thank Shriram and Matthias for their teaching. Lastly, I thank James Wilcox for his various suggestions for this post.