branch history and prediction schemes: Uht, Sindagi, Somanathan "Branch Effect Reduction Techniques" IEEE Computer May 1997 pp 71-81

cache prefetch:
Vander Wiel, Lilja "When Caches Aren't Enough: Data Prefetching Techniques: IEEE Computer  July '97 pp 23-30

cache blocking:  see TLB below

celefunt: Cody's accuracy test suite for FORTRAN complex math functions netlib/toms714. Quite useful in its standard form, although not written for extended precision (like Intel).

directives: "Visual KAP for OpenMP User's Manual" www.kai.com/vkomp

divide/sqrt hardware techniques:
Soderquist, Leeser "Division and Square Root ..." IEEE Micro July/Aug'97 pp 56-66

egcs: directories under ftp//egcs.cygnus.com and many mirror sites; a high quality but rapidly developing C/C++/f77 compiler gnu compiler suite.

elefunt: Accuracy test suite for FORTRAN math functions. Has some portabilityproblems (runs but results not right). Translated to C by Plauger and further modified by Prince. Copyright by Plauger, possibly available with permission

f77/f90 comparison:
Einarsson, Shokin "Fortran 90 for the Fortran 77 Programmer"

Computational Science Education Project "Fortran 90 and Computational Science"

f90 tutorial: Metcalf http://wwwcn.cern.ch/asdoc/WWW/f90/
Patrick Corde, Herve Delouis "Cours Fortran 90"  idris.fr

f95 compilers and netlib software: many listed on www.fortran.com/fortran
look for modernized versions of netlib software elsewhere
e.g. http://www.vic.cmis.csiro.au/~alan

f95: FORTRAN 95 Handbook, Adams, Brainerd et al MIT Press 1997 ISBN0-262-51096-0

fused MAC effects etc:
http://http.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps Note that Kahan's quadratic code for fused MAC is not satisfactorily programmable in standard FORTRAN, but can be done reasonably in C.

g77: gnu or egcs mirror sites; CD versions tend to be out of date

HP PA-8000:
Kumar "the HP PA-8000 RISC CPU" IEEE Micro Mar/Apr '97 pp 27-32

IEEE P754/854: Cody, IEEE Micro Aug. 1984 pp 84-100

Intel Pentium Pro: Papworth "Tuning the Pentium Pro.." IEEE Micro April 1996 pp 8-15; Bhandarkar and Ding "Performance Characterization of the Pentium Pro" distributed by Internet
The following tutorial has some extremely useful information, some not applicable to Pentium, and some serious proof-reading failures:

latency and instruction level parallelism, Newton and Goldschmidt schemes: Soderquist, Leeser "Division and Square Root..." IEEE Micro July 1997 pp 56-66

Alan Miller's site for modernized netlib: http://www.ozemail.com.au/~milleraj

MIPS/SGI R10000: Yeager "The MIPS R10000.." IEEE Micro April 1996 pp28-40

pipelining: Smith, Weiss " PowerPC 601 and Alpha 21064..." IEEE Computer,June 1994 pp 46-58

profiling: see TLB

Translation Lookaside Buffer (TLB) coding issues: "Performance Tuning for the Power Challenge.."  http://gserv1.dl.ac.uk/TCSC/DisCo/Events/Workshop/sgi/R10000