re:Joerg Moellenkamp
Well you are doing a bit to much number magic here.
Single strand going up to 5 timers faster is not the same as taking the per core performance (8 threads/strands) of the T3 and then multiplying that number with 5. It's taking the performance of one thread/strand of the T3 and multiplying that by 5.
IMHO the Maximum per thread performance of the T4 will most likely be in the range of 12-15 SPECintRate2006. (that is 5 times the performance of the max throughput of a single T3 thread (which again AFAIR is ~x2 the throughput of a single thread when all threads are running))
Which might sound really not good if you look at your examples with other architectures. But... I actually think that is kind of OK. Why ?
Because the single strand performance of a Westmere-EX with Hyperthreading enabled isn't the same as the per core performance either. And the same goes for POWER7 with SMT enabled.
Both processors when running with SMT/Hyperthreading enabled does favour throughput over single strand throughput. And there is a price to pay for that. So their single thread/strand throughput isn't equal to specintrate2006 divided by they number of cores either.
Try having a look at the specint2006 score of POWER6, where there actually are numbers that you can use, cause the specint2006 scores didn't use autopar.
IBM System p 570 (4.7 GHz, 1 core-1 Thread) 21.6 specint 2006
Now running on 2 cores with specint2006rate (4 threads) the score becomes:
IBM System p 570 (4.7 GHz, 2 core-2 Threads) 60.9 specint2006 rate.
Hence doubling the number of cores gives almost a factor of 3 in performance. So using your math this would have given POWER6@4.7GHz a specint2006 score of 30.5. (without autopar) Which it clearly doesn't have.
The difference is (at least for POWER7 I haven't read up on the latest enhancements for Hyperthreading so I'll keep from making faulty statements on that one) that POWER7 is able to allocate all the resources (execution slots) to a single thread cause it's running a fairly clever implementation of SMT, and thus be able to reach pretty close to the MAX per thread throughput it is actually capable of. The fine grained (round robin) way of the T3 can't do that trick, but from what I understand about Yosemite Falls (T4) it can do much the same thing as POWER6/POWER7 and thus get both good per thread throughput and good per chip throughput.
Now on the other hand IMHO it's still to little to late...
// Jesper