Eight four two one, twice the cores is (almost) twice as fun
Again, I say "semi-review" because if I were going to do this right, I'd have set up both the dual-4 and the dual-8 identically, had them do the same tasks and gone back if the results were weird. However, when you're buying a $7000+ workstation you economize where you can, which means I didn't buy any new NVMe cards, bought additional rather than spare RAM, and didn't buy another GPU; the plan was always to consolidate those into the new machine and keep the old chassis, board and CPUs/HSFs as spares. Plus, I moved over the case stickers and those totally change the entire performance characteristics of the system, you dig? We'll let the Phoronix guy(s) do that kind of exacting head-to-head because I pay for this out of pocket and we've all gotta tighten our belts in these days of plague. Here's the new beast sitting beside me under my work area:
(By the way, I'm still taking candidates for #ShowUsYourTalos. If you have pictures uploaded somewhere, I'll rebroadcast them here with your permission and your system's specs. Blackbirds, POWER8s and of course any other OpenPOWER systems welcome. Post in the comments.)
This new "consolidated" system has 64GB of RAM, Raptor's BTO option Radeon WX7100 workstation GPU and two NVMe main drives on the current Talos II 1.01 board, running Fedora 31 as before. In normal usage the dual-8 runs a bit hotter than the dual-4, but this is absolutely par for the course when you've just doubled the number of processors onboard. In my low-Earth-orbit Southern California office the dual-4's fastest fan rarely got above 2100rpm while the dual-8 occasionally spins up to 2300 or 2600rpm. Similarly, give or take system load, the infrared thermometer pegged the dual-4's "exhaust" at around 95 degrees Fahrenheit; the dual-8 puts out about 110 F. However, idle power usage is only about 20W more when sitting around in Firefox and gnome-terminal (130W vs 110W), and the idle fan speeds are about the same such that overall the dual-8 isn't appreciably louder than the very quiet dual-4 was with the most current firmware (with the standard Supermicro fan assemblies, though I replaced the dual-4's PSUs with "super-quiets" a while back and those are in the dual-8 now too).
Naturally the CPUs are the most notable change. Recall that the Sforza "scale out" POWER9 CPUs in Raptor family workstations are SMT-4, i.e., each core offers four hardware threads, which appear as discrete CPUs to the operating system. My dual-4 appeared to be a 32 CPU system to Fedora; this dual-8 appears to have 64. These threads come from "slices," and SMT-4 cores have four which are paired into two "super-slices." They look like this:
Each slice has a vector-scalar unit and an address generator feeding a load-store unit. The VSU has 64-bit integer, floating point and vector ALU components; two slices are needed to get the full 128-bit width of a VMX vector, hence their pairing as super-slices. The super-slices do not have L1 cache of their own, nor do they handle branch instructions or branch prediction; all of that is per-core, which also does instruction fetch and dispatch to the slices. (This has some similar strengths and pitfalls to AMD Ryzen, for example, which also has a single branch unit and caches per core, but the Ryzen execution units are not organized in the same fashion.) The upshot of all this is that certain parallel jobs, especially those that may be competing for scarcer per-core resources like L1 cache or the branch unit, may benefit more from full cores than from threads and this is true of pretty much any SMT implementation. In a like fashion, since each POWER9 slice is not a full vector unit (only the super-slices are), heavy use of VMX would soak up to twice the execution resources though amortized over the greater efficiency vector code would offer over scalar.
The biggest task I routinely do on my T2 are frequent smoke-test builds of Firefox to make sure OpenPOWER-specific bugs are found before they get to release. This was, in fact, where I hoped I would see the most improvement. Indeed, a fair bit of it can be run parallel, so if any of my typical workloads would show benefit, I felt it would likely be this one. Before I tore down the dual-4 I put all 64GB of RAM in it for a final time run to eliminate memory pressure as a variable (and the same sticks are in the dual-8, so it's exactly the same RAM in exactly the same slot layout). These snapshots were done building the Firefox 75 source code from mozilla-release (current as of this writing) with my standard optimized .mozconfig, varying only in the number of jobs specified. I'm only reporting wall time here because frankly that's the only thing I personally cared about. All build runs were done at the text console before booting X and GNOME to further eliminate variability, and I did three runs of each configuration back to back (./mach clobber && ./mach build) to account for any caching that might have occurred. Power was measured at the UPS. Default Spectre and Meltdown mitigations were in effect.
Dual-4 (-j24)
32:22.65
31:19.17
30:49.66
average draw 170W
Dual-8 (-j48)
19:16.28
19:09.18
19:08.32
average draw 230W
Dual-8 (-j24)
19:16.46
19:13.78
19:10.14
average draw 230W
The dual-8 is approximately 40% faster than the dual-4 on this task (or, said another way, the dual-4 was about 1.6x slower), but doubling the number of make processes from my prior configuration didn't seem to yield any improvement despite being well within the 64 threads available. This surprised me, so given that the dual-8 has 16 cores, I tried 16 processes directly:
Dual-8 (-j16)
21:49.72
21:40.18
21:41.33
average draw 215W
This proves, at least for this workload, that SMT does make some difference, just not as much as I would have thought. It also argues that the sweet spot for the dual-4 might have been around -j12, but I'm not willing to tear this box back down to try it. Still, cutting down my build times by over 10 minutes is nothing to sneeze at.
For other kinds of uses, though, I didn't see a lot different in terms of performance between DD2.2 and DD2.3 and to be honest you wouldn't expect to. DD2.3 does have improved Spectre mitigations and this would help the kind of branch-heavy code that would benefit least from additional slices, but the change is relatively minor and the difference in practise indeed seemed to be minimal. On my JIT-accelerated DOSBox build the benchmarks came in nearly exactly the same, as did QEMU running Mac OS 9. Booted into GNOME as I am right now, the extra CPU resources certainly do smooth out doing more things at once, but again, that's of course more a factor of the number of cores and slices than the processor stepping.
Overall I'm pretty pleased with the upgrade, and it's a nice, welcome boost that improves my efficiency further. Here are my present observations if you're thinking about a CPU upgrade too (or are a first time buyer considering how much you should get):
- Upgrading is pretty easy with these machines: if you bought too little today, you can always drop in a beefier CPU or two tomorrow (assuming you have the dough and the sockets), and the Self-Boot Engine code is generic such that any Sforza POWER9 chip will work on any Raptor system that can accommodate it. I have been repeatedly assured no FPGA update is needed to use DD2.3 chips, even for "O.G." T2 systems. However, if you're running a Blackbird, you should think about case and cooling as well because the 8-core will run noticeably hotter than a 4-core. A lot of the more attractive slimline mATX cases are very thermally constrained, and the 8-core CPU really should be paired with the (3U) high speed fan assembly than the (2U) passive heatsink. This is a big reason why my personal HTPC Blackbird is still a little 4-core system.
- The 4 and 8-core chips are familiar to most OpenPOWER denizens but the 18 and 22-core monsters have a complicated value proposition. People have certainly run them in Blackbirds; my personal thought is this is courting an early system demise, though I am impressed by how heroic some of those efforts are. I wouldn't myself, however: between the thermal and power demands you're gonna need a bigger boat for these sharks.
The T2 is designed for them and will work fine, but after my experience here one wonders how loud and hot they would get in warmer environments. Plus, you really need to fill both of those sockets or you'll lose three slots (those serviced by the second CPU's PCIe lanes), which would make them even louder and hotter. The dual-8 setup here gets you 16 cores and all of the slots turned on, so I think it's the better workstation configuration even though it costs a little more than a single-18 and isn't nearly as performant. The dual-18 and dual-22 configurations are really meant for big servers and crazy people.
With the T2 Lite, though, these CPUs make absolute sense and it would be almost a waste to run one with anything less. The T2 Lite is just a cut-down single-socket T2 board in the same form factor, so it will also easily accommodate any CPU but more cheaply. If you need the massive thread resources of a single-18 (72 thread) or single-22 (88 thread) workstation, and you can make do with an x16 and an x8 slot, it's really the overall best option for those configurations and it's not that much more than a Blackbird board. Plus, being a single CPU configuration it's probably a lot more liveable under one's desk.
- Simply buying a DD2.3 processor to replace a DD2.2 processor of the same core count probably doesn't pay off for most typical tasks. Unless you need the features (and there are some: besides the Spectre mitigations, it also has Ultravisor support and proper hardware watchpoints), you'll just end up spending additional money for little or no observable benefit. However, if you're going to buy more cores at the same time, then you might as well get a DD2.3 chip and have those extra features just in case you'll need them later. The price difference is almost certainly worth a little futureproofing.
Thanks for the technical details and your daily life experience. Congratulations on the new machine, especially on the professionally attached stickers :-)
ReplyDeleteIt's all about classy.
DeleteThe right stickers are applied ;-) Thanks for being with us.
DeleteI wondered why you bought just the octocores (or just quads originally), since given the whole high budget of the build they seem undersized.
ReplyDeleteAbout the thermals - I think the problem here is relying on the stock heatsinks. THere is one thing Raptor could do to improve the situation - make mounting kits for better aftermarket coolers. They probably can't get renowned manufacturers like Scythe, Noctua, Thermaltake etc to make such niche coolers, but those manufacturers use modular mounting system and you would just need a few metal pieces to make some of those coolers (or a lot of them for systems like Noctua that standardize mounting).
Basically, Raptor could offer a kit like this: https://noctua.at/en/nm-i115x-mounting-kit for the Power9 socket. They could pick whatever cooler manufacturer/line to support based on what fits best and is safest to use, as long as the cooler has large capacity and good performance.
Similarly, they could try to make a mounting bracket for some AIO coolers - for example, AMD put this in Threadripper boxes at launch and solved the lack of coolers for the all-new socket that way. Personally I think AIOs/water is unreliable and needs too much maintenance, so IMHO aircoolers are better idea.
Perhaps you can pitch/forward this idea at them, since I think you communicate with the company.
I was actually pretty happy with the dual-4 setup, even, but the dual-8 is pretty great too. Still, it's mostly the heat issues that deter me from going higher since whatever I do is going to be dual chip. I like your cooler idea, though. I'll see what they think though I bet you're not the only one who's brought it up. :)
DeleteAFAIK it's much more complicated with the alternative cooling solutions. The CPU socket is designed differently than the ones in x86 world. There have been many discussions about it and seems no one sees a viable business here.
DeleteI am very tempted to buy an 8-core Blackbird or even two now that I have much more free time and money so I checked out the shop again but with sending it to Europe including taxes and VAT I end up at almost 3,000 EUR! This is without case, RAM, HD, PSU ... Damn, this is insane! Sorry Raptor but at this crazy prices it will stay a collector's box. If they sell the bundle for 1,500 USD at the max they will reach much more people but not with more than 2,100 USD. Pity because I would really like to throw out the Intel stuff.
ReplyDeleteThat is indeed crazy, considering I was able to purchase the 8-core bundle for about 1700 EUR with the introductory pricing.
DeleteWhile I'm very happy with my purchase at that price, I would never have been able to justify paying 3000 for it, especially with no local distributor or support provider available.
It seems like some of this price hike was motivated by the current economic situation, but I can't imagine Raptor are getting many orders at this point. I'm not certain this approach is the correct one. I'm not seeing any progress on the EU reseller front, either.
I'll have to hope, then, that my current Blackbird - which is actually my primary system right now - won't ever break. I wouldn't be able to afford a replacement... The thought of going back to x86 is really not a pleasant one.
Raptor here...we really need more volume on the desktop side to be able to reduce prices. Our costs have been steadily rising due to factors outside our control, and without an increase in volume to offset that increase we had no choice but to raise prices.
DeleteTwo things:
1.) We would entertain discount pricing for group buys -- i.e. orders of say 5 or more to the sane country.
2.) The competition in the open systems space isn't too far off our current pricing, but much worse specs. I'm thinking of the SiFive desktop board with PCIe 2 and relatively weak cores, for a mere $3,000 or so. Contrast with the Blackbird desktop with PCIe 4, a mature software ecosystem, and in a case for only around 10% more.
Not saying I like the increased prices, but something has to give somewhere. Get the volume up and prices will go back down...
I would like to add, it's possible to buy Raptor stuff from https://vikings.net/. You have only to mail to them.
Delete@ Raptor
DeleteThank you for your comment. I did not expect that you would come her to give a reply, so really thank you. I know it's difficult for you. I understand that you cannot produce in high numbers likes MSI and so the board is more expensive then typical x86 boards. And I'm okay paying 2x, 3x the price. But the current price is ca. 6x. That's not what enough people can pay. And saying "our price is too high, but please buy anyway and then it will be cheaper for the people who buy after you" is not helping any potential customers because no customer can benefit from it.
It's always a pleasure to read your insightful post.
ReplyDeleteI don't have a Talos II to do the #ShowUsYourTalos but I guess I could do a #ShowUsYourBlackbird
Have you tried to encode an 8K H.265 video with the dual-8? Keen to see how long / well it performs under full-load and if it would encounter throttle due to overheating (I assume you are also going with the stock 3U HSF from Raptor).
Sure, send a link. No, I haven't, but that sounds like an interesting stress test. I don't think overheating will be an issue but it might be nice to see a worst case. Yes, both SCMs have the Raptor-sold 3U HSF on them.
DeleteOK, here you go: https://imgur.com/a/vTqLf3x
ReplyDeleteSince the case has a side window, I should really bling it up some. Maybe next month.
I like the case. That's a really nice look to it. I'll collect a couple more from anyone who wants to send one in, and then I'll post them.
DeleteNice. I see that you are using FreeBSD - how is the driver support with that AMD GPU (based on Polaris, as far as I can tell)? Does it use the amdgpu driver, and if so, how well does it work on BE?
Delete