branch history and prediction schemes: Uht, Sindagi, Somanathan "Branch Effect Reduction Techniques" IEEE Computer May 1997 pp 71-81
Vander Wiel, Lilja "When Caches Aren't Enough: Data Prefetching Techniques: IEEE Computer July '97 pp 23-30
cache blocking: see TLB below
celefunt: Cody's accuracy test suite for FORTRAN complex math functions netlib/toms714. Quite useful in its standard form, although not written for extended precision (like Intel).
directives: "Visual KAP for OpenMP User's Manual" www.kai.com/vkomp
divide/sqrt hardware techniques:
Soderquist, Leeser "Division and Square Root ..." IEEE Micro July/Aug'97 pp 56-66
egcs: directories under ftp//egcs.cygnus.com and many mirror sites; a high quality but rapidly developing C/C++/f77 compiler gnu compiler suite.
elefunt: Accuracy test suite for FORTRAN math functions. Has some portabilityproblems (runs but results not right). Translated to C by Plauger and further modified by Prince. Copyright by Plauger, possibly available with permission
Einarsson, Shokin "Fortran 90 for the Fortran 77 Programmer"
Computational Science Education Project "Fortran 90 and Computational Science"
f90 tutorial: Metcalf http://wwwcn.cern.ch/asdoc/WWW/f90/
Patrick Corde, Herve Delouis "Cours Fortran 90" idris.fr
f95 compilers and netlib software: many listed on www.fortran.com/fortran
look for modernized versions of netlib software elsewhere
f95: FORTRAN 95 Handbook, Adams, Brainerd et al MIT Press 1997 ISBN0-262-51096-0
fused MAC effects etc:
http://http.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps Note that Kahan's quadratic code for fused MAC is not satisfactorily programmable in standard FORTRAN, but can be done reasonably in C.
g77: gnu or egcs mirror sites; CD versions tend to be out of date
Kumar "the HP PA-8000 RISC CPU" IEEE Micro Mar/Apr '97 pp 27-32
IEEE P754/854: Cody, IEEE Micro Aug. 1984 pp 84-100
Intel Pentium Pro: Papworth "Tuning the Pentium Pro.." IEEE Micro April
1996 pp 8-15; Bhandarkar and Ding "Performance Characterization of the
Pentium Pro" distributed by Internet
The following tutorial has some extremely useful information, some not applicable to Pentium, and some serious proof-reading failures:
latency and instruction level parallelism, Newton and Goldschmidt schemes: Soderquist, Leeser "Division and Square Root..." IEEE Micro July 1997 pp 56-66
Alan Miller's site for modernized netlib: http://www.ozemail.com.au/~milleraj
MIPS/SGI R10000: Yeager "The MIPS R10000.." IEEE Micro April 1996 pp28-40
pipelining: Smith, Weiss " PowerPC 601 and Alpha 21064..." IEEE Computer,June 1994 pp 46-58
profiling: see TLB
Translation Lookaside Buffer (TLB) coding issues: "Performance Tuning
for the Power Challenge.." http://gserv1.dl.ac.uk/TCSC/DisCo/Events/Workshop/sgi/R10000