Techniques for Characterizing the Data Movement Complexity of Computations
暂无分享,去创建一个
[1] Mohammad Zubair,et al. A unified model for multicore architectures , 2008, IFMT '08.
[2] James Demmel,et al. Perfect Strong Scaling Using No Additional Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[3] Gianfranco Bilardi,et al. A Characterization of Temporal Locality and Its Portability across Memory Hierarchies , 2001, ICALP.
[4] James Demmel,et al. Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..
[5] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[6] David Parello,et al. Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies , 2006, International Journal of Parallel Programming.
[7] Samuel H. Fuller,et al. The Future of Computing Performance: Game Over or Next Level? , 2014 .
[8] Sven Verdoolaege,et al. isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.
[9] Mohammad Zubair,et al. Cache-optimal algorithms for option pricing , 2010, TOMS.
[10] John E. Savage. Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.
[11] M. Hestenes,et al. Methods of conjugate gradients for solving linear systems , 1952 .
[12] John Shalf,et al. Exascale Computing Technology Challenges , 2010, VECPAR.
[13] A. I. Barvinok,et al. Computing the Ehrhart polynomial of a convex lattice polytope , 1994, Discret. Comput. Geom..
[14] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .
[15] Dror Irony,et al. Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..
[16] Philippe Clauss,et al. Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[17] James Demmel,et al. Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..
[18] Bowen Alpern,et al. A model for hierarchical memory , 1987, STOC.
[19] Alok Aggarwal,et al. The input/output complexity of sorting and related problems , 1988, CACM.
[20] P. Feautrier. Parametric integer programming , 1988 .
[21] J. Ramanujam,et al. On characterizing the data movement complexity of computational DAGs for parallel execution , 2014, SPAA.
[22] Leslie G. Valiant. A Bridging Model for Multi-core Computing , 2008, ESA.
[23] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.
[24] T. Tao,et al. Finite bounds for Hölder-Brascamp-Lieb multilinear inequalities , 2005, math/0505691.
[25] Sanjay V. Rajopadhye,et al. The Z-polyhedral model , 2007, PPOPP.
[26] H. Whitney,et al. An inequality related to the isoperimetric inequality , 1949 .
[27] J. Ramanujam,et al. On Using the Roofline Model with Lower Bounds on Data Movement , 2015, ACM Trans. Archit. Code Optim..
[28] Franco P. Preparata,et al. Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part II, Lower Bounds , 1999, Theory of Computing Systems.
[29] J. Ramanujam,et al. On Characterizing the Data Access Complexity of Programs , 2014, POPL.
[30] James Demmel,et al. Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations , 2014, SPAA.
[31] Michele Scquizzato,et al. Communication Lower Bounds for Distributed-Memory Computations , 2013, STACS.
[32] James Demmel,et al. Graph expansion and communication costs of fast matrix multiplication: regular submission , 2011, SPAA '11.
[33] Richard W. Vuduc,et al. A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[34] Vipin Kumar,et al. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..
[35] Franco P. Preparata,et al. Upper bounds to processor-time tradeoffs under bounded-speed message propagation , 1995, SPAA '95.
[36] Stefan Rusu,et al. A 45nm 8-core enterprise Xeon ® processor , 2009 .
[37] Desh Ranjan,et al. Upper and lower I/O bounds for pebbling r-pyramids , 2012, J. Discrete Algorithms.
[38] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[39] Andrea Pietracaprina,et al. On the Space and Access Complexity of Computation DAGs , 2000, WG.
[40] Richard W. Vuduc,et al. Balance Principles for Algorithm-Architecture Co-Design , 2011, HotPar.
[41] James Demmel,et al. Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.
[42] Ragavendar Nagapattinam Ramamurthi. Dynamic Trace-based Analysis of Vectorization Potential of Programs , 2012 .
[43] James Demmel,et al. Minimizing Communication in All-Pairs Shortest Paths , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[44] Geppino Pucci,et al. The Potential of On-Chip Multiprocessing for QCD Machines , 2005, HiPC.
[45] Desh Ranjan,et al. Strong I/O Lower Bounds for Binomial and FFT Computation Graphs , 2011, COCOON.
[46] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .
[47] Richard W. Vuduc,et al. Algorithmic Time, Energy, and Power on Candidate HPC Compute Building Blocks , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.