Memoizing

How to Take Good Notes

By Matt Neuburg

Matt Neuburg is the author of REALbasic: The Definitive Guide, and a member of the editorial board of REALbasic Developer. This article was originally published in REALbasic Developer Issue 1.5 (April/May 2003).

The term “memoizing” is not a misspelling of “memorizing,” but the two notions are related. Memoizing is a technique that can come in handy in programming situations where you’re performing a calculation that has input, and the same input always yields the same result. The motivation for memoizing arises when the calculation is lengthy and is likely to be performed on the same input more than once. Memoizing itself is a cross between simply performing the calculation, on the one hand, and looking up the answer in a table, on the other.

Perhaps before going any further I should pause to point out that a lookup table is itself an important programming technique that is often neglected by beginners — perhaps because it seems too easy, or too obvious. For example, suppose you had to implement REALbasic’s Hex function, which translates a decimal integer to a hex string. What this really boils down to is translating a value between 0 and 15 into a hex digit, so let’s concentrate on that. Clearly there are two cases. If the value is between 0 and 9, we can use it as an offset from the ascii value of the character "0". If it’s between 10 and 15, we can subtract 10 and use the result as an offset from the ascii value of the character "A".

Hexify By Calculation:

Function hexify(i as integer) As string
   dim base as integer
   if i < 10 then
      base = asc("0")
      return chr(base + i)
   else
      base = asc("A")
      return chr(base + i - 10)
   end
End Function

This works, but it is what a more experienced programmer might refer to as “over-engineered”. There are only sixteen discrete input values, and we know the right answer in every case, so why not cut right to the chase? We start with a table where the input value functions as an index to look up the output value, and we just look up the answer. In the following implementation, the table is expressed as a string.

Hexify By Lookup

Function hexify(i as integer) As string
   dim table as string
   table = "0123456789ABCDEF"
   return mid(table,i+1,1)
End Function

If you think this sort of brute-force solution is beneath your dignity as an ingenious programmer, think again. Lookup tables are a perfectly valid form of implementation, and are used more often than you might think because they can be much, much faster than calculation.

Memoization, as I said before, is a cross between calculation and a lookup table. You typically use it when you don’t know the right answers off the top of your head, and you’re not willing to have your program waste any time performing the calculations ahead of time to construct the table. Instead, you maintain a table which is empty to start with, and then, when the time comes to perform the calculation, you stop and look at the table. If the table doesn’t hold the value you need, because the program has never performed the calculation for the given input, you perform the calculation and enter it in the table before returning it. This is the “memo” referred to by the term “memoization”; you not only calculate the answer, you also write it down. That way, the next time you have to perform the calculation for the same input, you see that it’s already in the table and just return the answer immediately by looking it up.

Let’s start with an example that’s only a little more involved than the one we just used. Instead of returning the hex string for an input between 0 and 15, we’ll return the hex string for an input between 0 and 255. Why isn’t this as simple as using the built-in Hex function? Because we want to make sure that the resulting string is two characters long. This is called “enhexing”, and is common when we have to communicate a byte-string across the Internet. For example, the REALbasic 5 MD5 function generates a sequence of bytes. This sequence is inappropriate for visual representation, and impossible to send in an email. Therefore we enhex the byte-string into a series of character pairs, where every character is a hex digit.

Our task, then, is to cycle through a byte-string, translate each byte into a two-digit hex string, concatenate all those hex strings, and return the result. The problem is described on p. 191 of my book, and we’ll start with the same solution. To maximize speed, we avoid the overhead of string concatenation by depositing the two-digit hex strings in a memoryblock.

Enhexing a Byte-String

Function doEnhex(s as string) As string
   dim m,m2 as memoryblock
   dim i,j,u as integer
   u = lenb(s)
   m = newmemoryBlock(u + 1)
   m2 = newmemoryblock(2*u+1)
   m.cstring(0) = s
   u = u - 1
   for i = 0 to u
      m2.cstring(j) = hexabyte(m.byte(i))
      j = j + 2
   next
   return m2.cstring(0)
End Function

The question is how to implement the Hexabyte function. One way is to calculate the two-digit hex string from scratch.

Hexabyte by Calculation

Function hexabyte(i as integer) As string
   dim ss as string
   ss = hex(i)
   while lenb(ss) < 2
      ss = "0" + ss
   wend
   return ss
End Function

This works fine, but if we’re looking to maximize speed, it seems clumsy. Most of the values returned by Hex are two digits; yet we’re wasting time testing all of them for length. And we’re wasting even more time prefixing "0" to the ones that need it. Also, the overhead of calling Hex is completely unnecessary, because it must be one of just 255 distinct values. We may as well just use a lookup table. Let’s suppose we’ve constructed this table beforehand. Then we can simply use i as an index.

Hexabyte by Lookup

Function hexabyte(i as integer) As string
   return table(i)
End Function

There is, however, a middle road — memoization. We build the table on demand, for any particular input values we happen to be given. That way, we waste no time constructing the table beforehand, and we avoid the calculation altogether for any input values we happen never to be given. But once we’ve done the calculation for any particular input value, we never have to do it again. The implementation is the same as our original calculation, but with some additional steps to check the table at the start and help build it at the end.

Hexabyte by Memoization

Function hexabyte(i as integer) As string
   dim ss as string
   ss = table(i)
   if ss <> "" then // it’s already in the table!
      return ss // just use the table entry
   end
  // it’s not in the table, so we must calculate it
   ss = hex(i)
   while lenb(ss) < 2
      ss = "0" + ss
   wend
   table(i) = ss // help build the table before returning
   return ss
End Function

This illustrates one of the paradoxes of memoization: on the surface, it looks longer and more involved, and less elegant, and even less legible than simple calculation. Well, that’s the way it is in the world of computers. Sometimes there are things more important than the surface elegance or legibility of an implementation — like making the best use at runtime of those most precious computing resources, time and space.

It happens that in this particular example there isn’t much motivation to memoize; but it isn’t hard to imagine, from the example, what the motivation would be in real life. It might be, for instance, that the calculation takes too long to perform in advance for all the possible input values, or that the answers simply cannot be obtained in advance.

A style of problem where this motivation comes immediately into play is where the calculation involves recursion, and especially when the recursion on its own threatens to become dangerously inefficient. Take, for instance, the Knapsack problem. Here it is.

We have an array of items of class Thing, where every Thing has a Size and a Value, which are integers. Let’s call the array Shoebox, but keep in mind that these Things represent types, not individual entities. We imagine that we have a knapsack of limited capacity. The object is to fill the knapsack in such a way as to maximize the value of its contents. We may use as many as we like of each Thing in the Shoebox (that’s why I say they are ultimately types), so long as the total of their Sizes is less than the capacity of the knapsack, and we wish to maximize the resulting total of their Values.

An obvious solution is to use recursion, which stems from the observation that if we have some items in the knapsack, the problem is exactly the same except that the capacity of the knapsack is less. This means that if we pretend that the problem is solved for this smaller capacity, then all we have to do is try every single Thing in the Shoebox, looking for the one that gives the maximum resulting Value total without causing the Size total to exceed the capacity.

Here’s an implementation of this recursion. Shoebox is assumed to be global. We call Solve, handing it the capacity of the empty knapsack, and we get back the maximum value achievable with the Things in the Shoebox.

Recursive Solution to Knapsack Problem

Function solve(capacity as integer) As integer
   dim t as thing
   dim maxValue, total as integer
   dim remainingCapacity as integer
   for each t in shoebox
      remainingCapacity = capacity - t.size
      if remainingCapacity >= 0 then
         total = t.value + solve(remainingCapacity)
         if total > maxValue then
            maxValue = total
         end
      end
   next
   return maxValue
End Function

There are, however, two drawbacks. One is that we still don’t really know the solution; we know the maximum achievable Value total, but we don’t know what Things were used to achieve it. The other is that the recursion threatens to get out of hand very quickly as the number of Things or the capacity of the knapsack become large.

The solution to both drawbacks is memoization. The input to Solve is the knapsack’s capacity, so the empty knapsack’s capacity is how big our array will have to be; we will be indexing on each given capacity. Actually, we’ll keep two arrays: one to hold the maxValue calculated for each capacity, and the other to hold a pointer to the best Thing for each capacity, this being the Thing that yielded the maxValue. These arrays are called MaxValueForCapacity and BestThingForCapacity, and are assumed to be global.

The solution is a huge improvement; it turns exponential processing time into linear processing time. This is not the place for a rigorous proof, but just think about what the recursion does. We start by asking to Solve the whole capacity of the knapsack. But this calculation is immediately suspended while we Solve some smaller capacity. But this second calculation is suspended too, and so on. Finally, at the bottom of the recursion, we do an actual calculation for an actual (small) capacity, and memoize it. So the very first calculation we actually do is memoized, and furthermore this first calculation is for a small capacity, so there is a very good chance that it will be needed again — and when it is, the answer will be looked up, not calculated. Now generalize: All the smallest capacity values are the ones that are calculated and memoized first, so our efficiency is vastly increased right from the start. But what we’re doing is also vastly better than calculating a lookup table beforehand, because we waste no time calculating any capacity values that aren’t actually needed for this particular configuration of Things and initial capacity.

Recursive Memoizing Solution to Knapsack Problem

Function solve(capacity as integer) As integer
   dim t as thing
   dim bestThing as thing
   dim maxValue, total as integer
   dim remainingCapacity as integer
   if bestThingForCapacity(capacity) <> nil then
      return maxValueForCapacity(capacity)
   end
   for each t in shoebox
      remainingCapacity = capacity - t.size
      if remainingCapacity >= 0 then
         total = t.value + solve(remainingCapacity)
         if total > maxValue then
            maxValue = total
            bestThing = t 
         end
      end
   next
   maxValueForCapacity(capacity) = maxValue 
   bestThingForCapacity(capacity) = bestThing
   return maxValue
End Function

After the recursion ends, we can learn the solution as follows. Start with the capacity of the knapsack and look at the BestThingForCapacity for that capacity. That Thing is in the solution. Now subtract that Thing’s Size from the capacity, and look at the BestThingForCapacity for the remaining capacity. That Thing is in the solution too. And so on.

In these examples, the inputs were integers. That, of course, is what made it so easy to implement memoization by use of an array (because an array has integral indices). Memoization is harder if the inputs aren’t integers; it might, in fact, be impossible. But not necessarily! Remember, we learned about hashing in an earlier installment of this column. As long as the input values can be used as keys in hashing, we can use them to store and retrieve the corresponding outputs. Thus, you can often memoize using REALbasic’s Dictionary class even when you can’t use an array. Similarly, the fact that a function might have more than one input parameter is not necessarily any bar to the use of memoization, since you might be able to use a multi-dimensional array, or a Dictionary of Dictionaries.