Occasionally I've received a few questions and/or comments about that post in other forums, but today I got a couple posts from two well-intentioned commentators that highlighted some very interesting misunderstandings about how modern computers and compilers work, misunderstandings that are actually very common. Unfortunately, a couple of the author's comments have been removed (I didn't remove them; I only remove obvious spam comments), so I will summarize the questions after I briefly describe what I built.
What I built
Using C, I wrote a handful of lines of C code that took a sorted sequence of 4-byte integers that represented a Twitter user's followers, and I intersected that sequence against thousands of other similar sequences for other Twitter users. By being careful, we went from taking roughly 7 seconds to intersect that data in Redis, to taking 6 milliseconds to intersect it with our custom software. On the memory side of things, we reduced our memory requirements by roughly 20x with our method over the standard Redis method.Let me be clear on this, Redis is a great tool. I still use it every day (even though I've left Adly for ChowNow, and I'm writing Redis in Action for Manning Publications (I should be working on it now, but I wanted to blog on a Friday night). But for some problems, like the one we had, custom software can be a huge win. On with the questions!
Question: Why not use the high-performance C++ STL set intersection code?
Updated Answer: The STL is a pain to work with, and my experience on the C++ side of things (6 months professionally), told me that I would find neither performance nor simplicity by using it.On top of all of that, we didn't actually need the list, we only needed the count, so even if we had spent the pain of writing it in C++, we'd get a result that was unnecessary, and have to write our own intersection code in the end anyway.
I originally started with code very similar to the C++ STL set intersection code and was dissatisfied with the results. By observing that the longer sequences will skip more items than the shorter sequences, I wrote a 2-level loop that skips over items in the longer sequence before performing comparisons with the shorter sequence. That got me roughly a 2x performance advantage over the STL variant.
There have been a few comments about my STL concerns about performance, which I wanted to respond to.Probably mostly wrong portion of the original answer:
My original post mentioned performance issues with using vtables in modern C++. My experience from ~4 years ago was using a combination of the STL and other C++ classes. I may have been incorrectly attributing the relative low performance of some operations to method dispatch, instead of some of the quality of containers we were using. Let me explain.
In 2004-2005, I had built a hash table as part of a search engine in C, which was used as a set (no values, only keys). It used 64 bit hashes and 64 bit values on a 32 bit machine. Using this hash set, I was able to add/update/remove items at a rate of 8 million items/second on a mobile 1.2 ghz Intel Core Solo.
In 2007-2008, I had used std::hash_set to store a set of 64 bit integers, and I was surprised to find that it could only add/update/remove 3 million items/second, despite running on a 3.0 ghz Intel P4 (hyperthreading enabled/disabled only affected performance by 5-10% on that problem). I had shrugged it off and called it a vtable problem*, which was a naive decision that I'd not revisited until now. Rethinking it, it was more likely the combination of both the efficiency of the P4 architecture (specifically with respect to pipeline depths; 11-13 with core solo vs. 20+ with P4) and the specifics of the hash table itself.
* I had been willing to quickly jump to the conclusion that it was the language/compiler because I was used to that on the Python side of things. No really. Why is standard Python slow compared to C/C++/Java? It's simply a matter of time and money not having been spent to research and implement JIT and/or static compilers. Look at Boo (which uses static types), PyPy (which uses a restricted version of Python), or Unladen Swallow (which uses an LLVM backed JIT); once time and money has been spent in an effort to improve Python performance, it happens. Incidentally, that's also the only reason why the JVM is so fast: $Billions have been spent on making it faster in the last 20 years.
Have you ever used the STL? I did for about 6 months professionally, and I'm glad that I could go back to C and Python. More seriously, object orientation in C++ doesn't come for free. Aside from the pain of the syntax of using C++ templates (and difficult to parse error messages), all of that object orientation hides the complexity of method dispatch in what are known as vtables. If your system has to call them to dispatch, you are wasting time in that processing when you could be doing something else. For most software, that doesn't matter. But when we are talking about nanoseconds (and we will shortly), every bit of work costs you.
Even ignoring the vtable cost, we'll pretend that all of the set code was inlined. But now, all of your manipulation and boundary checking is in your core intersection loop, which will introduce more branches, more branch mispredictions, which will ultimately slow down the intersection code.
Question: Why not use a bloom filter?
Answer: Bloom filters are slower and use more memory.DRAM is actually really slow. When you make a request to read DRAM, it returns a 64 byte cache line from that region in about 15 nanoseconds (in the best case). That's around 4.3 gigs/second random access. If you are accessing memory in a predictable pattern, the second block you request is returned in about 10 nanoseconds, and subsequent blocks can be returned in as fast as 3-4 nanoseconds. So, after the first few "slow" requests, you can read at a peak rate of roughly 21.3 gigs/second for predictable access patterns on high end machines.
Back to our case. We intersected a set of 5 million items against a set of 7.5 million items in 6 milliseconds. Doing the math... 4 bytes * (5m + 7.5m) / 6ms -> 8.3 gigs/second. So, we were within a factor of ~2-2.5x of maxing out the memory bandwidth available to the entire system with our 6ms intersection.
Let's pretend that we had already built a bloom filter, and let's even say that the math said that we only need 7 probes (the reality would probably be closer to 16 or so). At 15ns per probe, because bloom filters have unpredictably random access patterns, 7 probes per 5 million users in our input set, my math says that the intersection would take 525 milliseconds, which is 87 times slower than our intersection method.
But here's the other nasty bit of reality; constructing the bloom filter for that 5 million entry set (because we need to have bloom filter for all sets) takes at least twice as long as querying it. Why? Because to update 1 bit in the filter, you first need to read the cache line, then you need to update the bit in the register, then you need to write the data back. I know what you are thinking, you are thinking that your write-through cache will help you. But you are wrong. Because you are writing randomly to your bloom filter, you run out of cache very quickly, and the caching layers within your processor will quickly start blocking the processor from reading or writing to memory, because it's got a backlog of data to flush to memory that is limited to 15ns per random write. So with a (lower-bounded) 525ms to intersect, that's roughly 1.05 seconds to construct the bloom filter in the best case.
Meanwhile, the slow O(nlogn) standard C sort() finishes in 1.2 seconds, which is within spitting distance of the theoretically optimal bloom filter we just described. We also only sort once, but intersect thousands of times, so even if it did offer a modest performance advantage, it wouldn't be worth the performance penalty during intersection.
Another drawback with the bloom filter is memory. I know what you are thinking, you are thinking "well, with a 7 probe bloom filter, maybe you only need 2-3 bytes per user, instead of the 4 bytes per user". But in order to perform the intersection, you still need your list of users (sorted or not) in addition to the bloom filter, leaving you with 6-7 bytes per user.
Conclusion: bloom filters would be (80-160 times) slower to intersect, would use more memory, don't offer a significant construction time performance advantage (probably closer to 2x slower for a 16 probe bloom filter), and are only approximately correct.
The lesson you should learn
The reason why our custom solution was so much faster than Redis is primarily the memory access patterns and memory efficiency. Large sets in Redis have roughly a 60-80 byte/entry overhead on a 64 bit machine, most of which needs to be read to intersect one item with another. And because of the bucket + linked list (which is how Redis builds sets), intersection in Redis could read more than 60-80 bytes to intersect one item with another. Ours did so well because we reduced our space to 4 bytes per user (from 60-80), only ever read that 4 bytes, and ensured that all of our memory access patterns were sequential. With the practical realities of modern processors and memory, there is no faster way for single-processor set intersections.If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!
I guess you use the standard c++ std::sort algorithm instead of the qsort of C.
ReplyDeleteNope, we used C's qsort.
Deletei'd try a skip list.
ReplyDeleteAnd you would perform worse. A skiplist offers basically zero advantages over other binary structured trees, at the cost of more pointers. Even in the best case, with a binary tree of one kind or another, you still have 2 pointers per value stored, that would be 12 bytes per user on a 32 bit machine, compared to our 4 bytes.
DeleteYou would need to examine effectively all of the data during the intersections (3x as much data), and the code for handling the structures itself would be slower than a simple tight loop.
I also wonder how the relation between the left and right set sizes has to be before using a binary_search to find the next element in the remaining data is faster?
ReplyDeleteThe larger set would need to be at least 15/3 * log2(n) times larger to make it worthwhile (because binary search behaves in practice like a random traversal). Basically, you would need a smaller sorted set of roughly 65k items vs. 7.5m items before binary search could theoretically win in terms of memory access times. You may or may not win on the algorithm efficiency. A good binary search actually has a pretty tight loop.
DeleteI had used this method about 7 years ago in a search engine, and IIRC, we were able to intersect lists of 100k items against each other in about 5 milliseconds on 2.4 ghz P4s (which were horrible from an efficiency standpoint). I wasn't as careful as I am now, so I wouldn't use that as a comparison to what we did, as that algorithm supported (A u B u C ...) ^ (D u E u F ...) ^ (G u H u I ...) ^ ...
This is kinda the best article I ever read these pasts two months. Thanks for dwelving that deep into C <3
ReplyDeleteYou are welcome :)
DeleteI don't get your focus on vtables when talking about drawbacks of the STL. Afaik there's hardly any polymorphism in it.
ReplyDeleteBut while we are at wailing about indirections. Using qsort introduces an inner loop function call to your sort function. std::sort can inline it.
Anyways. The take home message from the post is a valid and valuable one: Sometimes it's worth the time to custom build.
I've updated the post.
DeleteAlso, having just read http://radiospiel.org/sorting-in-c-3-times-faster-than-c , you are probably right on the sorting side of things. I'll give it a shot if I ever need to build it again. Thank you.
DeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by the author.
DeleteYour statement about Redis using linkedlist + bucket for sets is a bit incorrect, given that you are storing integers.
ReplyDeleteYou are using 32 bit numbers, so Redis would use an Int Set. An IntSet is essentially a binary search tree. The elements are stored in an array as either a 16, 32 or 64 bit integer. In your case, it would have used 32 bits per user. The overheads are exactly 8 bytes per set - of which 4 bytes are used to store the number of elements in the set.
Since IntSets are actually sorted in memory, the algorithm Redis uses for intersection is pretty much what you ended up doing.
Totally like the approach you described, but I don't think Redis' performance would have been as bad as you suggest. You are among the most active users on the Redis mailing lists; So I know I don't have to sell Redis to you -:) But perhaps you could do a PoC using intsets in Redis?
Reference : https://github.com/antirez/redis/blob/unstable/src/intset.c
First, intsets are sorted arrays of integers; exactly what we were using, but intsets weren't available in a stable Redis release when I was doing the intersection code.
DeleteSecond, intsets are typically limited to small sets; on the order of a few hundred or few thousand elements, and were primarily meant as a way of minimizing overhead for small sets; not to improve performance for large sets. When you go past that limit to the *millions* of elements like we had, Redis switches to set-based intsets.
Third, Redis' algorithm for intersection iterates over one set, then searches in the other set using binary search. See my reply to an earlier comment about using binary search instead of the intersection method I describe and when it may pay off.
Fourth, if intsets had been available and if they had been efficient to intersect large sets, we may have tried them. But again, Redis still produces the result set, so would have been slower by at least a factor of 2 over what we were able to do with our custom solution.
It seems like there's a subtle detail lurking here: you only need the count, not the list. I'm not familiar with redis but I'm guesssing the slow original operation you're describing returns the set? And the same for the STL? etc.
ReplyDeleteI thought I wasn't subtle about it. But yes, we only needed the count, not the actual items.
DeleteFew questions:
ReplyDelete1) Can you give me an example of your "sequence of integers that represents a Twitter user's followers"? Is it something like the following?
Twitter user i sequence: [follower_id1, follower_id2, ..., follower_idn]
2) "We intersected a set of 5 million items against a set of 7.5 million items"
Is each "item" a sequence, or something else?
3) I guess I'm not clear what the final end goal(s) are. Is it to cluster users with similar set of followers?
Twitter user A and B.
Deletea_followers = [fol1, fol2, fol3, ...]
b_followers = [fola, folb, folc, ...]
By "items" I meant users. So we are intersecting the sets a_followers and b_followers above.
Not even as interesting as clustering. We just wanted to find audience overlap so that we could say things like, "75% of Khloe Kardashian's followers also follow Kim Kardashian, so if you wanted to minimize your ad spend, you could just have Kim tweet for you. Or if you wanted to get double-coverage for 75% of Khloe's followers, you could add Khloe." This is the "Also Follows" metric described in http://dr-josiah.blogspot.com/2011/07/building-adlys-twitter-analytics.html
Hello from another C and python user.
ReplyDeleteI use a lot of high level languages but from time to time, I need c. I usually get from 10e4 to 10e5 better performance like you because I know what I do and I could control it.
People that only know high level languages want to use it for everything(if you only have a hammer...). It is not a 20%, 40% or 50% penalty, it is orders of magnitude when you work with a lot of data.
I also want to use high level languages for everything ;) My original intersection code was in Python, and still managed to intersect those two sorted sets in 3.5 seconds, 2x faster than the C Redis! Of course the Python was iterating over the two sorted sequences and relied on heapq.merge() for most of the heavy lifting.
DeleteThe C++ standard library algorithms and data structures aren’t “object oriented”, don’t use runtime dispatching (and hence no vtables), do not perform superfluous bounds checks and do not hide “the complexity of method dispatch”.
ReplyDeleteIn fact, the whole algorithms library was designed to offer high abstraction at *no* runtime cost. Which is exactly what they do. All dispatch is performed at compile time.
If you were satisfied with `qsort`’s performance, try C++’ `sort` with a functor comparer to get an instant performance boost.
I've updated the post to remove references to vtables in the main portion of the answer, along with an explanation.
DeleteYes please answer KevinX's questions about what you are trying to do. It's not clear, especially his question #2.
ReplyDeleteI've answered it :)
DeleteC++ syntax may be ugly but if you're worrying that the STL is slow due to vtables you're either doing it wrong or have no idea what you're talking about.
ReplyDeleteI've updated the post to remove references to vtables in the main portion of the answer, along with an explanation.
DeleteThis is a fantastic article. :) I rarely get a chance to optimize hotspots to this degree anymore. Not since matrix multiplication in my masters program.
ReplyDeleteCan you share your STL implementation? How about this one?
ReplyDeletehttps://gist.github.com/2065728
I'd be surprised if that wasn't efficient.
I don't have an STL implementation, I thought I made that clear in the article and subsequent comments. I did implement an algorithm very similar to the STL set intersection algorithm, but it only performed half as fast as the algorithm we ended up using.
DeleteI know your intset is sorted, already, but I don't see the cost of doing that calculated in, anywhere.
ReplyDeleteYou know what would be cool? If you could post two example intsets for readers to play with. I'm pretty sure that, as a set of numbers, they're anonymized quite well.
I'm also curious about the distribution of the numbers. You do have to read them from disk at some point, but perhaps the sequential IO cost is minimal enough that it doesn't matter.
"Meanwhile, the slow O(nlogn) standard C sort() finishes in 1.2 seconds, which is within spitting distance of the theoretically optimal bloom filter we just described. We also only sort once, but intersect thousands of times"
DeleteAfter constructing the unsorted intset on another machine, we would upload them to S3, spawn a task, which was then executed by the processes that handled this stuff, which would download the intsets from S3, load them into memory via mmap (typically 200-300 ms), sort them (under 1.5 seconds), then begin the process of intersecting that set with thousands of other previously sorted and mmap-loaded intsets.
I don't have access to the intsets anymore, as I no longer work for the company I did this for. Also, the data is publicly available: it's just Twitter follower lists.
any sample of the code on GitHub or something?
ReplyDeleteGreat post. I want to understand practical complexity on modern CPUs at this level. Are there any books you would recommend?
ReplyDeleteIt's been about 10 years since I picked up a book on computer/CPU architecture, but the MIT OpenCourseware course on the topic uses the same book and seems to cover more or less the same stuff I learned: http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-823-computer-system-architecture-fall-2005/
DeleteOther than that, my knowledge comes from a variety of sources including coursework in college, courses in grad school, building computers and paying attention to BIOS settings, and writing software in low-enough level languages to be able to measure actual memory and CPU performance.
hi sir
ReplyDeletei wanted to implement Bloom filter based query search in Routing protocol such as MFLOOD in ns2. Kindly can you guide me as to how to proceed. I am unable to begin the task
Email me with your requirements and I will tell you my consulting rate.
Delete