I love exploring weird ideas. Among my colleagues, I'm rather known for it. I like finding unusual solutions to problems in code, and more than a few have panned out.
Today's crazy idea wasn't one of those.
As developers, we often love to share our successes. We'll even venture to show what we don't know by asking questions publicly. But we rarely share our failures.
Honestly, I wonder why. Maybe it's our impostor syndrome talking, or maybe we're just afraid of shaming and destructive-type criticism from other developers.
Whatever the reason, it really is a pity. I've made my share of mistakes, and I can say that I learn far more when things go wrong than when they go right!
So I humbly offer my own misadventure into an ill-conceived algorithm from this afternoon.
It all started when I was chatting with my programmer friend Richard on IRC about some code I'd just finished reviewing. He asked for the link, which I shared, and he pointed out a couple of improvements he could see. This led to a brief discussion of iterators and how to best remove elements from arrays, which segued into my mentioning FlexArray.
An idea started forming in my head. You know how, when you remove an element from an array, you have to shift all the elements after it to the left one position? That's one of those annoyingly expensive things that we can never really get around in sequential arrays.
I thought, "hm, wouldn't it be nice if we could somehow just mark an element as dead, sort of like you do in an object pool, and work around it? You could put off all those expensive removals until when you wanted to deal with them, if ever in some cases."
In theory, it makes perfect sense. If you lay out 20 playing cards on a table and remove one from the middle, you don't actually have to slide the rightmost 10 over to fill the spot. You can just decide to leave the spot blank. It saves a lot of time!
But, SPOILER ALERT, computers don't think like people. This is a fundamental fact that none of us can ever quite get through our heads. Sooner or later, we will all make the assumption that a computer is capable of the same mental shortcuts we are, simply because we aren't aware of those shortcuts ourselves half the time.
For example, any of us can glance into a room with 30 people and pick out the tallest person within moments. A computer, however, has to have a fairly complex algorithm to do the same thing. Our brains work faster than computers on certain problems. Shoot, our brains automatically perform differential equations to calculate how long it will take for a passing car to reach a given point, and we never even know it! We also have considerably larger memory caches, faster IO time, far more inputs, and infinitely better multiprocessing.
Just think of it - you're walking around with a three-pound, 100TB RAM, ultra-parallel processing supercomputer in your skull. No wonder AI is taking so darn long to catch up.
Of course, I hadn't yet realized my own onboard supercomputer had majorly borked.
One thing I have learned is to plan EVERYTHING before writing major code. Out came the whiteboard, and I began to hash out how this could work.
Actual Index: 0 1 2 3 4 5 6 7
Offset Index: 0 1 2 3 5 6
(X = ignored) [ ][ ][X][ ][ ][X][ ][ ]
I figured that if I were to mark an element as dead, perhaps with a magic values like "0xDEADBEEF" or a boolean flag, I'd need to be able to offset indices to account for all those elements we were ignoring. So, if the user wanted index 2 in the above example, they ACTUALLY wanted internal index 3. How do I predictably offset the indices so the user always accessed the intended element? This was the essential problem. If I could beat this, I'd be able to do the rest easily!
(By the way - don't EVER use magic values in store-whatever user-facing data structures. 0xDEADBEEF
may be a perfectly valid value to them. That kind of prestidigitation is only acceptable if you absolutely control all the data going in.)
The quick-and-dirty way would obviously be to increment one element at a time, and if the element is marked as ignored (or "dead"), jump over it. However, I knew this approach would considerably slow down FlexArray, which currently accesses elements using pointer arithmetic.
Then it struck me: if I'm storing a single true/false value ("it's dead, Jim") per element, I simply need a bitset the same length as the internal array. Then I can just work in binary, which is always going to expose the most efficient possible solution to a problem. I'm reasonably comfortable in binary, so I was confident I could work out a perfectly brilliant solution to this problem.
It didn't take me long to work out the math for reliably accessing the offset index every time.
I'd start with the bits representing whether the element was usable (0) or dead (1). I decided to name it db
(for "DeadBeef", not "DataBase"). I would also have the index (i
), provided by the user, of the desired element.
db = 00100100
i = 5
For the math to work reliably, I'd need to work in the 0-based numbers, so I would need to offset my length (l
), basically making it the same as the largest possible index.
l = 7
I knew I would need to ignore all the elements after the index we were trying to access. A simple bitshift would throw away any unneeded information, if there were any dead elements after the index.
db2 = db << l-i
Now I only needed to count the number of 1
s in db2
, and I'd know exactly how much to add to i
to get the internal index! I was so close, I could taste it.
So, how do you count the number of 1s in binary? A quick search on DuckDuckGo revealed this to be known as "population count", also sometimes called a Hamming weight.
Thankfully, I own a nifty little book called Hacker's Delight by Henry S. Warren Jr., which contains detailed discussions about how to perform complex binary math efficiently. I flipped to the chapter about population counts (Chapter 5, if you have the book) and started reading.
It turned out, there was a lot more involved in counting 1-bits than just a quick bit of math. See, to get the population count of a single byte is fairly trivial - just a few instructions, and we're good! However, if we're counting within more than just one byte, things get trickier. This is because the obvious approach - get the population for each byte and add it to an accumulator - is actually one of the least efficient ways to do it. Instead, we can use a carry-save adder (or CSA) to speed this up.
In short, if you have the CSA adding things in groups of 8, you'll spend 8 instructions per byte to get the population count of any sized array. (That would be O(n)
, for those of you following along with the algorithmic stuff.) I knew it was possible to make this slightly more effective by having the CSA process in groups of 16 - at 7 instructions per byte - or 32 - 6.5 instructions per byte - but 8 made for easy math (as you'll see shortly) and a pretty close approximation.
[If you want to know more about how this works, I highly recommend picking up a copy of Hacker's Delight].
Still, something wasn't quite right, and I began to suspect that this idea might fall apart.
This expense of 8n
cycles was on every element access. Data caching wouldn't really overcome this, because I'd have to store the offset for every conceivable index (space expense of n
bytes for the whole data structure), and anyway, I'd still have to run that operation the first time any given index was accessed. If I added or removed elements, I'd have to reevaluate everything, more or less.
Still, I decided to take this experiment as far as it would go. Perhaps the expense would still be less than the traditional shift-everything situation? Maybe it would be worth it in certain situations?
When copying an element from position n
to position n-1
, you typically have two instructions: (1) copy from n
to cache, and (2) copy from cache to n-1
. (This is because many systems cannot do straight memory-to-memory copies.)
Interestingly, both the "DEADBEEF" access-element operation and the traditional remove-element operation share a same worst-case scenario - removing element 0
-
which made comparison easier.
Given an array with 1000 elements, removing 0
would require 999 elements to be shifted back one position, for a total of 1998 instructions.
Given the same array and a "dead" element 0
, accessing any element would definitely require the offset to be calculated, which would require 8 instructions per byte. Each bit represented 1 element, and there were 8 bits in a byte, so the instructions would be 8*n/8
, or just (conveniently) n
. In other words, to calculate the index offset on a 1000 element array, I'd need 1000 instructions, and then a few more for the rest of the math.
Or, basically, I'd be looking at roughly HALF of the instruction count for removing an element, on every single access. Ow.
Now, a few C++ programmers are probably screaming at me right now, because std::bitset
already has a count()
function!
Don't worry - I figured the standard library probably had something along those lines from the start, but I always work these sorts of problems the same way: shed abstraction, work it out low-level, and then compare the available abstractions to the ideal algorithm.
And, as expected, std::bitset::count
benchmarked right where I suspected it would, given that the aforementioned binary math was the fastest known way to do a population count. (I think they use groups of 32, based on the benchmark you'll see in a moment.)
Here's my code to test my math and assumptions.
#include <bitset>
#include <iostream>
#include "pawlib/flex_array.hpp"
int offset_index(int i, std::bitset<1024> db, int l)
{
db.set(200,1);
db.set(400,1);
l = l - 1;
std::bitset<1024> db2 = db << l-i;
return i + db2.count();
}
int main()
{
std::bitset<1024> db;
int l = 1024;
int i = 512;
pawlib::FlexArray<int> arr;
for (int n = 0; n < l; ++n)
{
arr.push(n);
}
std::cout << arr.getSize() << std::endl;
int oi = offset_index(i, db, l);
arr.yank(i);
std::cout << oi << std::endl;
return 0;
}
Compiling that, running it through valgrind --tool=cachegrind
, and then viewing the Estimated Cycles for the functions yank()
(which removes the first element from the array), and offset_index()
, confirmed my suspicions.
pawlib::FlexArray<int>::yank(unsigned int)
: 476 cycles
offset_index(int, std::bitset<1024ul>, int)
: 261 cycles
The numbers put the final nail in the coffin. This idea was a bad one.
And, there we have it: the story of how I came up with an allegedly brilliant idea, tested it, and then blew it to smithereens, Mythbusters-style.
So, what have we learned today?
1) Computers don't think like we do, no matter how hard we try to make them.
2) Often, the most obvious solution isn't the fastest. I'd have never thought that just adding the population counts for a bunch of bytes was the slowest way of attacking the problem!
3) Shedding abstractions and working at a lower level gives a whole new perspective to a problem...and potentially saves you hours, days, weeks, or months of pain and heartache. Just imagine if I had plowed ahead with this idea, without first inquiring into its inner workings!
4) Premature optimization may be the root of all [kinds of] evil, but failure to account for optimization yields just as much. Strike a healthy balance.
I should also caution you of something that is NOT a lesson here! Some would say, "You should be assuming that the standard library authors know what they're doing, and you'll never be faster." There was a time I might have believed that, but after I watched FlexArray
actually beat std::vector
last year, I learned that isn't always the case. People always come up with new and better ideas.
This idea may not be new. It certainly wasn't better. But it's there, and maybe others can learn from it!
Your turn - what programming mistakes have you made that you've learned from?