Is it Possible to achieve a Teraflop/s on a chip? From High Performance Algorithms to Architectures