There has been a reply to the article a Critique of RDMA that I talked about last month. Unfortunately, this new article mostly speaks about specifications, without giving actual technical justifications that might show why RDMA is such a good idea. Basically, in reply to the a Critique of RDMA saying that RDMA does not do several things well, it is replied that RDMA can do these things. But there is no clue about performance or relevance for HPC, it just looks like specifications. Of course, lots of things are always possible, whatever the programming model. But, in HPC, the question is about performance, and there is no such thing is this new article.
The author insists a lot on the fact that RDMA is more than Read and Write. The point seems to be that RDMA can do Send-Recv too? Cool, that's what people have been doing for 10 years with MPI and the non-IB interconnects (RDMA people did not invent anything there...). Why so many noise about RDMA then? That's kind of what I was saying in my Truths and Lies about RDMA, RDMA means everything and nothing. Marketing is vaguish, making it impossible to know what RDMA actually is, and trying to disturb customers with words without any technical clue.
The critique of RDMA was saying that RDMA sucks in HPC because it is too different from Send-Recv, does not scale, and so on. And this new article replies that there is Send-Recv on RDMA too? So, why does IB Send-Recv performance sucks then? Probably because the design of RDMA hardware is too close to the one-sided model, making Send-Recv hard to implement on top of it (as explained in the Critique of RDMA). This is caused by people who wrote hardware specifications without ever talking to the software people (those that will have to implement MPI on top of this hardware). Then, assuming RDMA people agree that Send-Recv is good, why do RDMA developers concentrate on one-sided? That's non-sense. They are claiming they do Send-Recv for marketing purpose (because customers want Send-Recv). But, they do not actually use it because its performance sucks (because the whole model is broken). Anyway, that's another proof that the one-sided model in unusable for MPI applications (i.e. 99% of HPC).
Writing white papers and specifications is useless if you have no clue about hardware costs and software requirements. For instance, despite what this new article says, you cannot seriously think somebody smart will do RDMA for small messages. Well, he can do it, but he doesn't do it. Why? Because it kills the latency! Why? Because it's faster to write the data by PIO from the sender host to the sender's NIC than asking the sender's NIC to read it by DMA in the sender host! The shortest critical path is to have the sender pushing the data, not the sender asking the receiver to get the data. Of course, for large messages, it's different (because of the overhead). But, everybody knows for 10 years that small messages have to be implemented by PIO, not with DMA. And, the fact that the memory is pre-registered does not change anything, DMA is still slower than PIO, no need to have a Ph.D. to understand that.
At the end of the article, a look is given at the current Top500, and it is said that IB is better than Myricom in the top-end. Ok, let's see what's the first IB entry is in the Top500: 20 SGI® Altix™ 3700 superclusters, each with 512 processors. Oh great, that's a 20-nodes IB cluster, that's impressive scalability! What about the other IB clusters? Well, several are known to be replaced soon. Maybe because they do not work? Oh yes, when you want Linpack results for the Top500, you can spend one month tuning everything so that it works fine at least once. That's enough to enter the Top500. But, wait a minute, the final goal is not to just enter the Top500, it's to get production machines where users come with whatever application and it works. And IB clusters are known to be way away from being there.
Finally, the article shows that Myricom's share in the Top500 list is declining. But it forgets to say that the new generation of Myricom hardware (Myri-10G) was not widely available when the latest Top500 has been released. All the listed Myrinet entries are running Myrinet 2000 (2Gbit/s), which is the 5 years old hardware generation. And actually, these machines are often faster than IB machines even if the interconnect bandwidth is 4 times lower. The way Myricom stayed in the Top500 while its hardware was outdated means much more than IB having a couple new clusters in the top-end. And, if the point is really to have machines in the Top500 (and it is not, except for marketing people), we'll talk about it again in the future when Myri-10G clusters will be widely available.