Am I missing something?
So you have a GPU chip that can notionally perform, say, 1 Tflops single-precision. If those flops were, say, FMA operations, that's 3 operands (12 bytes) in, and one result (4 bytes) out per flop. So the bandwidth requirements of just one GPU at full chat are going to be 12 terabytes/second in and 4 terabytes/second out. How on earth are you suppose to get anything like peak performance out of these things?