Resources from draft document Reliability of Grid Computing Systems

 

This list of references is also found in the draft OGF informational document, titled Reliability of Grid Computing Systems.

 

[Abaw2004] J. H. Abawajy, J., “Fault-Tolerant Scheduling Policy for Grid Computing systems”, 18th International Parallel and Distributed Processing Symposium, April, 2004, Santa Fe, New Mexico.

 

[Acce2007] AccessGrid Home Page. http://www.accessgrid.org/, 2007.

 

[Alti2005] Altintas, I. et al., “A Framework for the Design and Reuse of Grid Workflows,” Lecture Notes on Computer Science 3458, 2005.

 

[Anan2003] Anand, S., et al., “Flow-based Multistage Co-allocation Service,” The 2003 International Conference on Communications in Computing, Las Vegas, Nevada, USA, June 2003.

 

[Andr2002] Andrzejak, A., Graupner, S., Kotov, V., and Trinks, H., “Algorithms for Self-Organization and Adaptive Service Placement in Dynamic Distributed Systems,” Hewlet Packard Corporation, HPL-2002-259, 2002.

 

[Alvi2001] Alvisi, L., and et. al. "Wrapping Server-Side TCP to Mask Connection Failures," in INFOCOM 2001, 22-26 April 2001, vol. 1, pp. 329-337.

 

[Arno1999] Arnold, K., et al, The Jini Specification, V1.0 Addison-Wesley 1999. (Latest version is 1.1 available from Sun)

 

[Aviz2004] Avizienis, A., Laprie, J., Randell, B., and Landwehr, C. “Basic Concepts and Taxonomy of Dependable and Secure Computing,” IEEE Transactions on Dependable and Secure Computing, Volume 1, Number. 1, January-March 2004.

 

[Bane2002] Banerjee, S., Bhattacharjee, B., and Kommareddy, C., "Scalable Application Layer Multicast," ACM SigComm, 2002.

 

 

[Barc2005] Barcello, M., “Evaluating High-Throughput Reliable Multicast for Grid Applications in Production Networks,” 2005 IEEE International Symposium on                                     .

[Bart2003] Bartolini, N., Presti, F.L. , and Petrioli, C.  "Optimal Dynamic Replica Placement in Content Delivery Networks," The 11th IEEE International Conference on Networks, ICON 2003, 2003, pp. 125-130.

 

[Batc2004] R. Batchu, Y. Dandass, A. Skjellum, and M. Beddhu, “MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware,” Cluster Computing, pp. 303–315, Oct. 2004.

 

[Baus2003] Bausch, W., Pautasso, C., and Alonso, G., “Programming for Dependability in a Service-based Grid,” Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID.03), 2003.

 

[Bell2002] Bell, W., et al. “Simulation of dynamic grid replication strategies in optorsim,” Proceedings of 3rd International IEEE Workshop on Grid Computing, pp. 46–57, 2002.

 

[Bell2003] Bell, W., et al., “Evaluation of an economy-based file replication strategy for a data grid,” International Workshop on Agent based Cluster and Grid Computing, pp. 120–126, 2003.

 

[Bezz2006] Bezzine, S., et al., “A Fault Tolerant and Multi-Paradigm Grid Architecture for Time Constrained Problems: Application to Option Pricing in Finance,” Second IEEE International Conference on e-Science and Grid Computing, 2006, p. 49, December 2006.

 

[Bosc2002] Bosilca, G., et al., “MPICHV: Toward a Scalable Fault Tolerant MPI for Volatile Nodes”, Proceedings of IEEE SuperComputing, November 2002.

 

[Bout2005] Bouteiller, A., et al., “MPICH-V: a Multiprotocol Automatic Fault Tolerant MPI,” International Journal of High Performance Computing and Applications, Volume 20, Issue 3, pp. 319-330, 2006.

 

 

[Bunt2007] Buntinas, D., et al., “Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI,” Accepted for publication in Future Generation Computer Systems, Elsevier Press, 2007

 

 

[Chen2002a] Chen, M., Kiciman, E., Fratkin, E., Fox, A., and Brewer, E. “Pinpoint: Problem Determination in Large, Dynamic Internet Services”, Proceedings of 2002 International Conference on Dependable Systems and Networks (DSN), IPDS track, Washington, DC, June 23-26, 2002.

 

 

 [Cher1999] Chervenak, A., et al., “The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Data Sets,” Journal of Network and Computer Applications, 2001(23): pp. 187-200.

 

[Cher2002] Chervenak, A., et al., “Giggle: A Framework for Constructing Scalable Replica Location Services,” SC2002 Conference, Baltimore, MD USA, 2002.

 

[Cher2004]. Chervenak, A.L., et al., “Performance and Scalability of a Replica Location Service,” Thirteenth IEEE Int'l Symposium High Performance Distributed Computing (HPDC-13), Honolulu, HI USA, 2004.

 

[Cher2005] Chervenak, A.,Schuler, R., Kesselman, C., Koranda, S., and Moe, B. “Wide area data replication for scientific collaborations,” Proceedings of the 6th International Workshop on Grid Computing, November 2005.

 

[Chiu1998] Chiu, D., Hurst, S., Kadansky, M., and Wesley, J., “TRAM: A Tree-based Reliable Multicast Protocol,” Sun Microsystems Laboratories. SMLI TR-98-68, 1998.

 

[Choi1999] J. Choi, M. Choi, and S. Lee. An Alarm Correlation and Fault Identification Scheme Based on OSI Managed Object Classes. In IEEE International Conference on Communications, Vancouver, BC, Canada, 1999.

 

[Chun2004] Chun, G., et al., “Benchmark Probes for Grid Assessment,” The 18th International Parallel and Distributed Processing Symposium (IPDPS'04), p. 276a, 2004.

 

[Clap2004] Clapp, G., Gannet, J., and Skoog, R., “Requirements and Design of a Dynamic Grid Networking Layer,” 2004 IEEE International Symposium on Cluster Computing and the Grid, 2004.

 

[Coll2007] Colling, D., et al., “On Quality of Service Support for Grid Computing,”

The 2nd International Workshop on Distributed Cooperative Laboratories and Instrumenting the GRID (INGRID 2007), April, 2007

 

[Cond2007] “Adding high availability to Condor Central manager,” See http://dsl.cs.technion.ac.il/projects/gozal/project_pp./ha/ha.html.

 

[Cox2002] Cox, W., et al, Web Services Transaction (WS-Transaction), 2002. See http://dev2dev.bea.com/pub/a/2004/01/ws-transaction.html.

 

[Cybo2006] Cybok, D., “A Grid workflow infrastructure,” Concurrency and Computation: Practice And Experience, Volume 18, Issue 10, pp. 1243–1254, 2006.

 

[Czaj1999] Czajkowski, K., Foster, I., and Kesselman, C., "Resource Co-Allocation in Computational Grids," IEEE International Symposium on High Performance Distributed Computing (HPDC-8), August 1999, pp. 219-228.

 

[Das2002] A. Das, I. Gupta, and A. Motivala, “Swim: Scalable weakly-consistent

infection-style process group membership protocol,” in Proc. of Intl. Conf. on Dependable Systems and Networks (DSN’02), pp.303–312, June 2002.

 

[Deel2003] Deelman, E., et al., “Mapping Abstract Complex Workflows onto Grid Environments,” Journal of Grid Computing, Volume 1, pp. 25-39, 2003.

 

[Demm1989] Demmy, W. and Petrini, A., “Statistical Process Control in Software Quality Assurance,” Proceedings of the 1989 National Aerospace and Electronics Conference, Dayton, Ohio, pp. 1585-1590, May 1989.

 

[Deri2004] Deris, M., Abawajy, J., Suzuri, H. “An efficient replicated data access approach for large-scale distributed systems,” IEEE International Symposium on Cluster Computing and the Grid, April 2004.

 

[Dull2001] Dullman, D. et al., “Models for Replica Synchronisation and Consistency in a Data Grid,” Proceedings. 10th IEEE International Symposium on High Performance Distributed Computing, pp.67-75, 2001.

 

[Duar2006] Duarte, A., Brasileiro, F., Cirne, W., an dFilho, J., “Collaborative Fault Diagnosis in Grids through Automated Tests,” Proceedings of the 20th International Conference on Advanced Information Networking and Applications (AINA’06), 2006.

 

[Dura2005] Durand, J., and Karmarkar, A., “Message Reliability Protocol Standards for Web Services : An Analysis,” The 3rd IEEE European Conference on Web Services (IEEE ECOWS 2005), November 2005, Växjö, Sweden

 

[Elno2002] Elnozahy, E., Johnson, D., and Wang, Y., “A survey of rollback recovery protocols in message-passing systems,” ACM Computing Surveys, Volume 34, Issue3, pp. 375–408, 2002.

 

[Emme2005] Emmerich, W., et al., “Grid Service Orchestration Using the Business Process Execution Language (BPEL),” Journal of Grid Computing, Volume 3, pp. 283–304, 2006.

 

[Fahr2005] Fahringer, T., et al., “ASKALON: a tool set for cluster and Grid computing,” Concurrency and Computation: Practice and Experience, Volume 17, pp. 143-169, 2005.

 

[Fang2007] Fang, C., et al. “Fault tolerant Web Services,” Journal of Systems Architecture, Volume 53, Issue 1, January 2007, pp. 21-38 (Request #45405, received /22/07)

 

[Fost2005] Foster et al., A Globus Primer Describing Globus Toolkit Version 4, Draft May 8, 2005. http://www.globus.org/toolkit/docs/4.0/key/GT4_Primer_0.6.pdf

 

[Frey2001] Frey, J., Tannenbaum, T., Livny, M., Foster, I., Tuecke, S., “Condor-G: A Computation Management Agent for Multi-Institutional Grids,” Proceedings of the Tenth IEEE International Symposium on High Performance Distributed Computing, San Francisco, CA, USA, August 7-9, 2001, pp. 55-67.

 

 [Fox2005] Fox, G., Pallickara, S., Pierce, M., and Gadgil, H., “Building Messaging Substrates for Web and Grid Applications,” Philosophical Transactions of the Royal Society: Mathematical, Physical and Engineering Sciences (Scientific Applications of Grid Computing Special Issue), Volume 363 Issue 1833, pp.1757–1773, 2005

 

[Fox2006] Fox, G., “Collaboration and Community Grids,” International Symposium on Collaborative Technologies and Systems, pp. 419- 428, May 2006.

 

[Gabr2003] Gabriel, E., Fagg, et al., "A Fault-Tolerant Communication Library for Grid Environments," Seventeenth Annual ACM International Conference on Supercomputing (ICS'03), International Workshop on Grid Computing and e-Science, San Francisco, June 2003

 

[Glob2005] Reliable File Transfer (RFT) Service, Globus Toolkit, version 4.0, http://www.globus.org/toolkit/docs/4.0/data/rft/.

 

[Grah2002] Graham, R., et al., “A Network-Failure-Tolerant Message-Passing System For Terascale Clusters,” Proceedings of the 16th international conference on Supercomputing, New York, USA, pp. 77 – 83, June 2002.

 

[Gray2004] Gray, J. and Lamport, L., “Consensus on Transaction Commit,” Microsoft Research Corporation, MSR-TR-2003-96.

 

[Grus1998] Gruschke, B., “A New Approach for Event Correlation based on Dependency Graphs,” Fifth Workshop of the OpenView University Association: OVUA’98, Rennes, France, April 1998.

 

[Gupt2001] Gupta, T. D. Chandra, and G. S. Goldszmidt, “On scalable and efficient distributed failure detectors,” Proceedings of 20th Annual ACM Symposium on Principles of Distributed Computing, pp. 170–179, 2001.

 

[Hill2005] Hillenbrand, M., Götze, J., and Müller, P., “Creating Dependable Web Services Using User-transparent Replica,” Proceedings of the International Conference on Next Generation Web Services Practices (NWeSP’05), 2005.


[Hilt2001] Hiltunen, M.A.; Schlichting, R.D.; Ugarte, C.A., “Enhancing survivability of security services using redundancy,” Proceedings of the 2001 International Conference on Dependable Systems and Networks, pp.173–182, July 2001.

 

[Hori2005] Horita, Y., Taura, K., and Chikayama, T. A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications, Grid Computing Workshop, 2005.

 

[Hosc2000] W. Hoschek, W., et al. “Data management in an international data

grid project,” Proceedings of GRID Workshop, pp. 77–90, 2000.

 

[Hued2006] Huedo, E., Montero, R. S., and Llorente, I. M. “Evaluating the reliability of computational grids from the end user's point of view. Journal of Systems Architecture, Volume 52, Issue 12, pp. 727-736, December 2006.  (request #45394, arrived 8/20)

 

[Hwan2003] Hwang, S., and Kesselman, C., GridWorkflow : A Flexible Failure Handling Framework for the Grid,” In Proceedings  of the 12th IEEE Intl. Symposium on HPDC, 2003.

 

[Iamn2000] Iamnitchi, A. and Foster, I. "A problem specific fault tolerance mechanism for asynchronous, distributed systems," in Proceedings of 2000 International Conference on Parallel Processing (29th ICPP'00), Toronto, Canada, August 2000, IEEE.

 

[Ietf1985] File Transfer Protocol, Internet Engineering Task Force (IETF), http://www.ietf.org/, RFC 959, October 1985.

 

[Ietf1995] A Border Gateway Protocol 4 (BGP-4), Internet Engineering Task Force (IETF), http://www.ietf.org/, RFC 1771, March 1995.

 

[Ietf1999] Multicast Dissemination Protocol version 2 (MDPv2)Internet Draft, Internet Engineering Task Force, October 1999.

 

[Ietf2001] Multiprotocol Label Switching Architecture, Internet Engineering Task Force (IETF), http://www.ietf.org/, RFC 3031, January 2001.

 

[Ietf2002a] Version 2 of the Protocol Operations forthe Simple Network Management Protocol (SNMP),Internet Engineering Task Force (IETF), http://www.ietf.org/, RFC 3416, December 2002.

 

[Ietf2002b] Overview and Principles of Internet Traffic Engineering, Engineering Task Force (IETF), http://www.ietf.org/, RFC 3272, May 2002.

 

[Ietf2002c] Applicability Statement for Traffic Engineering with MPLS, Internet Engineering Task Force (IETF), http://www.ietf.org/, RFC 3346, August 2002.

 

[Ietf2007a] Open Shortest Path First IGP (Interior Gateway Protocol), Internet Engineering Task Force (IETF), http://www.ietf.org/  2007.

 

[Ietf2007b] NACK-Oriented Reliable Multicast (NORM) Protocol, Internet Engineering Task Force (IETF), http://www.ietf.org/, March 2007.

 

[Jain2004] Jain, A. and Shyamasundar, R., Failure Detection and Membership Management in Grid Environments, in Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing (GRID’04), 2004.

 

[Jits2007] Jitsumoto, H., Endo, T., Matsuoka, S., "ABARIS: An Adaptable Fault Detection/Recovery Component Framework for MPIs," IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007) pp.1-8, March 2007.

 

[Jo2005] Jo, J., Seok, W., Kwak, J. and Byeon, O., “Design and Implementation of QoS Measurement and Network Diagnosing Framework for IP Multicast in Advanced Collaborative Environment,” Proceedings of the Fourth Annual ACIS International Conference on Computer and Information Science (ICIS’05), 2005.

 

[Juha2003] Juhasz, Z., Andics, A., and Szabolcs P., “Towards a Robust and Fault-Tolerant Discovery Architecture for Global Computing Grids” Scalable Computing: Practice and Experience, Volume 6, Number 2, pp. 22-33. 2003.

 

[Keya2002] Keyani, P., Larson, B., and Senhil, M. “Peer Pressure: Distributed Recovery from Attacks in Peer-to-Peer Systems”, in Web Engineering and Peer-to-Peer Computing, Gregori, E. et al. (eds.), NETWORKING 2002 Workshops, Pisa, Italy, May 19-24, 2002, Revised Papers, Lecture Notes in Computer Science 2376 Springer 2002, ISBN 3-540-44177-8, pp. 306-320.

 

[Khar2004] Kharchenko, V.,Popov, P., andRomanovsky, A., “On Dependability of Composite Web Services with Components Upgraded Online,” Proceedings of the International Conference on Dependable Systems and Networks (DSN 2004), Florence, Italy, pp. 287–291, June 2004.

 

[Koeh2003] Koehler, J., and Srivastava, B., “Web Service Composition - Current Solutions and Open Problems.” ICAPS 2003 Workshop on Planning for Web Services, pp. 28 – 35, 2003.

 

[Kola2005] Kola, G., Kosar, T., and Livny, M., "Faults in Large Distributed Systems and What We Can Do About Them",  Proceedings of 11th European Conference on Parallel Processing (Euro-Par 2005), pp. 442-453, Lisbon Portugal, August 2005.


[Kris2002] Krishnan S., Wagstrom P., and von Laszewski G., “GSFL: A workflow framework for Grid services,” http://www-unix.globus.org/cog/papers/gsfl-paper.pdf, January 2004.

 

[Kuo2005] Kuo, D. and Mckeown, M., “Advance Reservation and Co-Allocation Protocol For Grid Computing,” Proceedings of the First International Conference on e-Science and Grid Computing (e-Science’05), 2005.

 

[Lac2006] Lac, C. and Ramanathan, S., "A Resilient Telco Grid Middleware," Proceedings of the 11th IEEE Symposium on Computers and Communications (ISCC'06), pp. 306-311, 2006

 

[Lame2002] Lamehamedi, H., Szymanski, B., Shentu, Z., Deelman, E. , “Data replication strategies in grid environments,” Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002, pp. 378- 383. 

 

[Lamp2001] L. Lamport, L.,  Paxos made simple. ACM SIGACT News (Distributed Computing Column), Volume 32, Number 4, pp. 18-25, 2001.

[

Lan2002] Lan, J. Cache Consistency Techniques for Peer-to-Peer File Sharing Networks, Master’s Thesis, Department of Computer Science, University of Massachusetts Amherst, June 2002.

 

[Lanf2002] Lanfermann, G., Allen, G., Radke, T., and Seidel, E., "Nomadic Migration: Fault Tolerance in a Disruptive Grid Environment," Second IEEE/ACM International Symposium Cluster Computing and the Grid, 2002, pp. 280, May 2002.

 

[Leau2006] Leai, K., Tan, L., Turner, K. “Orchestrating Grid Services using BPEL and Globus Toolkit 4,” Proceedings of the 7th PGNet Symposium, pp. 31-36, 2006.

 

[Lean2004] Leangsuksun, C., et al., “A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster,” The Fifth LCI International Conference on Linux Clusters: the HPC Revolution 2004, Austin TX USA, May 2004.

 

[Lee2001] Lee, B. and Weissman, J. B.  "Dynamic Replica Management in the Service

Grid," in IEEE 2nd International Workshop on Grid Computing, November, 2001.

 

[Lee2003] Lee H., and et. al.,"Grid Fault Tolerance Service for Quality of Service", The 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2003), 2003.

 

[Lei2007] Lei, M.; Vrbsky, S.V.; Zijie, Q., “Online Grid Replication Optimizers to Improve System Reliability,” IEEE International Symposium on Parallel and Distributed Processing Symposium, pp. 26-30 March 2007

 

[Li2006] Li, Q., Xu, M., and Zhang, H., “A Root-fault Detection System of Grid Based on Immunology. Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC 2006), Changsha, China, October 2006. pp. 369-373

 

[Lim2004] Lim, S., Fox, G., Pallickara, S., and Pierce, M., "Web Service Robust GridFTP", The 2004 International MultiConference in Computer Science and Computer Engineering, Las Vegas, NV USA, June 2004.

 

[Lima2005a] K. Limaye, C. B. Leangsuksun, et. al, “Job-Site Level Fault Tolerance for Cluster and Grid environments”, The 2005 IEEE Cluster Computing, Boston, MA, September, 2005.

 

[Lima2005b] Limaye, K. Tikotekar, A., and Leangsuksun, B. “Fault tolerance-enabled HPC resource management with HA-OSCAR framework,” High Availability and Performance Computing Workshop, Santa Fe, NM USA, October 2005.

 

[Liu2004] Liu, X., Xia, H., and Chien, A., “Validating and Scaling the MicroGrid: A Scientific Instrument for Grid Dynamics,” Journal of Grid Computing, Volume 2, Number 2, pp. 141-161, 2004.

 

[Liu2005] Liu, Y. Leangsuksun, C., Song, H., and Scott, S., "Reliability-aware Checkpoint/Restart Scheme: A Performability Trade-off," Proceedings of IEEE International Conference on Cluster Computing, September 2005

 

[Look2004a] Looker, N., Munro, M., and Xu, J., “Practical Dependability Analysis of SOAP Based Systems,” Proceedings of the UK e-Science All Hands Meeting, Nottigham, UK, pp. 1126–1129, August, 2004.

 

[Look2004b] Looker, N., Munro, M., and Xu, J., “WS-FIT: A Tool for Dependability Analysis of Web Services,” Proceedings of the 28th Annual International Computer Software and Applications Conference (COMPSAC), Hong Kong, pp. 120–123, September, 2004.

 

[Look2005] Looker, N., Burd, S., Drummond, M., and Munro, M., "Pedagogic Data as a Basis for Web Service Fault Models," IEEE International Workshop on Service-Oriented System Engineering, Beijing, China, October 20-21, 2005.

 

[Look2007] Looker, N., Munro, M., and Xu, J., "Determining the Dependability of Service-Oriented Architectures," Submitted to the International Journal of Simulation and Process Modelling, 2007.

 

[Loug2002] Loughran, Making Web Services that Work, HP Laboratories, Hewlet-Packard Corporation, HPL-2002-274, 2002.

 

[Louc1998] Louca, S., Neophytou, N., Lachanas, A., and Evripidou, P., “MPI-FT: A portable fault tolerance scheme for MPI,” Proceedings of the PDPTA ’98 International Conference, Las Vegas, Nevada 1998.

 

[Lowe2003] Lowekamp, B., et al., “Enabling Network Measurement Portability Through a Hierarchy of Characteristics,” Proceedings of the Fourth International Workshop on Grid Computing (GRID’03), 2003.

 

 [Lui2006]  Lui, P. and Wu, J. J., “Optimal Replica Placement Strategy for Hierarchical Data Grid Systems,” Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), pp. 417-420, 2006

 

[Macl2006] MacLaren, J., Keown, M., and Pickles,S., “Co-Allocation, Fault Tolerance and Grid Computing,” Proceedings of the UK e-Science All Hands Meeting pp. 155–162, 2006.

 

[Marc2001] Marchetti, C., Virgillito, A., and Baldoni, R. “Design of an Interoperable FT-CORBA Compliant Infrastructure,” Proceedings of the European Research Seminar on Advances in Distributed Systems (ERSADS), 2001.

 

[Matt2006] Mattmann, C., et al., “A Classification and Evaluation of Data Movement Technologies for the Delivery of Highly Voluminous Scientific Data Products,” National Aeronautics and Space Administration, Document 20060044153, 2006.

 

[Milo2000] Milojicic, D., Douglis, F., Paindaveine, Y., Wheeker, R., and Zhou, S. "Process Migration Survey," ACM Computing Surveys, September, 2000.

 

[Mill2006] Mills, K. and Dabrowski, C. “Investigating Global Behavior in Computing Grids.” Self-Organizing Systems, Lecture Notes in Computer Science, Vol. 4124, pp. 120-136, 2006.

 

[Mill2007] Mills, K. and Dabrowski, C., “Can Economics-based Resource Allocation Prove Effective in a Computation Marketplace?” accepted for publication to the Journal of Grid Computing, 2007.

 

[Mogi2006] Mogilevsky, D., Koenig, G., Yurcik. W., “Byzantine Anomaly Testing for Charm++: Providing Fault Tolerance and Survivability for Charm++ Empowered Clusters,” Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06), p. 30, May 2006.

 

[Mpi2003] MPI: A Message-Passing Interface Standard, Message Passing Interface Forum, http://www.mpi-forum.org/, 2003.

 

[Oasi2004a] Business Transaction Protocol (BTP) Version 1.1, Committee Draft, June 2004.

 

[Oasi2004b] Web Services Base Faults (WS-BaseFaults), OASIS, 2004.

 

[Oasi2004c] WS-Reliability 1.1, OASIS, Committee Draft 1.086, August 2006.

 

[Oasi2006a] Web Services Business Process Execution Language (WSBPEL), OASIS WS-BPEL 2.0 Committee Draft, May 2006.

 

[Oasi2006b] Web Services Reliable Messaging (WS-ReliableMessaging), Committee Draft 04, wsrm-1.1-spec-cd-04, August 2006

 

[Oasi2007] Web Services Coordination (WS-Coordination), Version 1.1 OASIS Standard, April 2007.

 

[Ogf2004a] Networking Issues for Grid Infrastructure, Open Grid Forum Informational Document, GFD-I.037, November 2004.

 

[Ogf2004b] A Hierarchy of Network Performance Characteristics for Grid Applications and Services, Open Grid Forum, GFD-R-P.023 (Proposed Recommendation), May 2004.

 

[Ogf2005a] An Architecture for Grid Checkpoint and Recovery (GridCPR) Services and a GridCPR Application Programming Interface, Draft Document, Global Grid Forum, 2005.

 

[Ogf2005b]  GridFTP v2 Protocol Description, GFD-R-P.047, Open Grid Forum, May 2005.

 

[Ogf2006a] OGSA WSRF Basic Profile 1.0, Open Grid Forum, GFD.72, September 2006.

 

[Ogf2006b]  Configuration Description, Deployment, and Lifecycle Management CDDLM Deployment API, Open Grid Forum, GFD.69, April 2006.

 

[Ogsa2006c] The Open Grid Services Architecture, Version 1.5, Open Grid Forum, GFD.80, September 2006.

 

[Ogf2007a] Use-Cases and Requirements for Grid Checkpoint and Recovery, Version 1.0, Open Grid Forum, GFD-I.92, May 2007.

 

[Ogf2007b] Web Services Agreement Specification (WS-Agreement), Open Grid Forum, GFD.107, May 2007.

 

[Natr2001] Natrajan, A., Humphrey, M., and Grimshaw, A., "Capacity and Capability Computing in Legion," The 2001 International Conference on Computational Science, May 2001

 

[Neko2005] Nekovee, M., Barcellos, M., and Daw, M., “Reliable multicast for the Grid: a case study in experimental computer science,” Philosophical Transactions of the Royal Society  A,  Volume 10, Number 1098, 2005.

 

[Pall2005] Pallickara, S., Fox, G., and Pallickara, S.L., "An Analysis of Reliable Delivery Specifications for Web Services", International Conference on Information Technology: Coding and Computing, 2005 (ITCC 2005), Volume1, pp. 360-365, April 2005.

 

[Pope2007] Popescu, A.,  Constantinescu, D., Erman, D., Ilie, D., “A Survey of Reliable Multicast Communication,” Third EuroNGI Conference on Next Generation Internet Networks, pp.111-118, May 2007.

 

[Qui2001] Qiu, L., Padmanabhan, V., and Voelker, G. "On the Placement of Web Server Replicas", Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies - INFOCOM 2001, pp. 1587-1596.

 

[Ra2005] Ra, D., et al., “Scalable Enterprise Level Workflow Manager for the Grid,” Proceedings of the Fifth International Conference on Quality Software (QSIC’05), pp. 341-348, September 2005.

 

[Rang2001] Ranganathana, K., and Foster, I. “Identifying dynamic replication strategies for a high performance data grid,” Proceedings of the International Grid Computing Workshop, pp. 75–86, 2001.

 

[Rang2002] Ranganathan K., Iamnitchi, A., and Foster, I., "Improving Data Availability through Dynamic Model-Driven Replication in Large Peer-to-Peer Communities," in Global and Peer-to-Peer Computing on Large Scale Distributed Systems Workshop, Berlin, May 2002, p. 376.

 

[Rena2006] Ranaldo, N., Tretola, G., and Zimeo, E., “Hierarchical and Reliable Multicast Communication for Grid Systems,” Current & Future Issues of High-End Computing,

Proceedings of the International Conference ParCo, pp. 137-144, 2005

 

[Ripe2002] Ripeanu, M., and Foster, I., “A Decentralized, Adaptive Replica Location Mechanism,” Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, 2002.

 

[Rti2002]  Research Triangle Institute, The Economic Impacts of Inadequate Infrastructure for Software Testing, May 2002.

 

[Sant2005] Santos, G. T., Lung, L.C., and Montez, C., “FTWeb: A Fault Tolerant Infrastructure for Web Services,” Proceedings of the 2005 Ninth IEEE International

EDOC Enterprise Computing Conference (EDOC’05).

 

[Schn2006] Schneider, J., Linnert, B., and Burchard, L., "Distributed Workflow Management for Large-Scale Grid Environments," International Symposium on Applications and the Internet (SAINT'06), 2006, pp. 229-235.

 

[Song2007] Song, C.X., Topkara, U., Woo, J., and Park, S.K., "Assessing Reliability of Grid Software Systems Using Emergent Features," The 2nd Workshop on Reliability and Robustness in Grid Computing Systems, the 19th Open Grid Forum (OGF19), Chapel Hill, NC. January, 2007

 

[Stel1999] Stelling, P., Foster, I. Kesselman, C., Lee, C., and von Laszewski, G. “A Fault Detection Service for Wide Area Distributed Computations”, Cluster Computing, Volume 2, Number 2, 1999, pp. 117-128.

 

[Stoc2001] Stockinger, H., et al., “File and object replication in data grids,” Tenth IEEE Symposium on High Performance and Distributed Computing, pp. 305–314, 2001

 

[Sun2005] N1 Grid Engine User’s Guide, Sun MicroSystems, Inc., May 2005.

 

[Tai2004] Tai, S., Mikalsen, T., and Rouvellou, I., “Using Message-Oriented Middleware for Reliable Web Services Messaging,” Lecture Notes on Computer Science, Number 3095, July 2004

 

[Taki2005] Takizawa, S. et al., “A Scalable Multi-Replication Framework for Data Grid,” Proceedings of the 2005 Symposium on Applications and the Internet Workshops (SAINT-W’05), 2005.

 

[Tann2002] Tannenbaum, T., Wright, D., Miller, K, and Livny, M. “Condor - A Distributed Job Scheduler,” In Beowulf Cluster Computing with Linux, The MIT Press, MA, USA, 2002.

 

[Tart2002] Tartanoglu, F., Issarny, V., Romanovsky, A., Levy, N., “Dependability in the Web Service Architecture.” Proceedings of the ICSE 2002 Workshop on Architecting Dependable Systems (Orlando, Florida, USA), May 2002.

 

[Tart2003] Tartanoglu, F., Issarny, V., Romanovsky, A., Levy, N., “Coordinated Forward Error Recovery for Composite Web Services,” Proceedings of the 22nd International Symposium on Reliable Distributed Systems, SRDS (Florence, Italy), October 2003.

 

[Tauf2005] Taufer, M.,Teller, P., Anderson, D., Brooks, C., “Metrics for Effective Resource Management in Global Computing Environments,” First International Conference on e-Science and Grid Computing, p. 8,  December 2005.

 

[Topk2006] Topkara, U., Song, C.X., Woo, J., and Park, S.K., "Connected in a Small World: Rapid Integration of Biological Resources", Grid Computing Environments Workshop (in conjuction with Supercomputing'06), Tampa, FL, November, 2006

 

[Town2005] Townend, P., Groth, P., Looker, N. and Xu, J. FT-Grid: A Fault-Tolerance System for e-Science, Proceedings of the UK oST e-Science Fourth All Hands Meeting (AHM05), September 2005.

 

[Turn2007]  Turner, K. and Tan, K., “Graphical Composition of Grid Services,”

Lecture Notes in Computer Science 4401, pp. 1-17, Springer, Berlin, May 2007.

 

[Yemi1996] A. Yemini and S. Kliger. High Speed and Robust Event Correlation. IEEE Communication Magazine, Volume 34 Number 5, pp. 82–90, May 1996.

 

[Urga2001] Urganonkar, B. et al. “Maintaining Mutual Consistency for Cached Web Objects”, Proceedings of the 21st International Conference on Distributed Computing Systems (ICDCS-21), Phoenix, Arizona, April 2001

 

[Valc2005] Valcarenghi, L. and Piero C.  “QoS-Aware Connection Resilience for Network-Aware Grid Computing Fault Tolerance”, Proceedings of 2005 7th International Conference on Transparent Optical Networks, July 3-7 2005, Barcelona Spain, Volume 1, pp. 417-422.

 

[Verm2003] Verma, D., and et al. “SRIRAM: A scalable resilient autonomic mesh”, IBM SYSTEMS JOURNAL, Volume 42, Number 1, pp. 19-28, 2003.

 

[vonL2004] von Laszewski, G., et al., “GridAnt: A Client-Controllable Grid Workflow System,” Argonne National Laboratory Preprint ANL/MCS-P1098-1003 and

Thirty-seventh Hawai’i International Conference on System Science, Island of Hawaii, Big Island, January 2004.

 

[Wald2006] Waldrich, O. Wieder, P. Ziegler, W., “A Meta-Scheduling Service for Co-allocating Arbitrary Types of Resources,” Lecture Notes on Computer Science 3911, pp. 782-791, 2006.

 

[Wang2006] Wang, X., Zhuang, Y., Hou, H., "Byzantine Fault Tolerance in MDS of Grid System," International Conference on Machine Learning and Cybernetics, pp.2782-2787, August 2006.

 

[Wate2004] Waters, G., Crawford, J., and Lim, S., “Optimising Multicast Structures for Grid Computing,” Computer Communications, Volume 27, pp. 1389-1400, September 2004.

 

[Woo2003] Woo, N., et al., MPICH-GF: Providing fault tolerance on grid environments", The 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid2003), May 2003.

 

 [Wsdl2001] Web Services Description Language (WSDL) 1.1, "http://www.w3.org/TR/2001/NOTE-wsdl-20010315"

 

[Wsrm2005] Web Services Reliable Messaging Protocol (WS-ReliableMessaging), BEA Systems, IBM, Microsoft Corporation, Inc, and TIBCO Software Inc., 2005.

 

[W3c2007] SOAP Version 1.2 Part 1: Messaging Framework (Second Edition), World Wide Web Consortium (W3C), W3C recommendation, April 2007.

 

[Xian2006] Xiang, Y., Li, Z., and Chen, H., “Optimizing Adaptive Checkpointing Schemes for Grid Workflow Systems,” Proceedings of the Fifth International Conference on Grid and Cooperative Computing Workshops (GCCW'06), 2006.

 

[Xie2004] Xie, M., Dai, Y., and Poh, K., Computing Systems Reliability, Kluwer Academic Publishers: New York, NY, U.S.A., 2004.

 

[Yeom2006] Yeom, H., “Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)”, presented at the GGF First Workshop of Reliability and Robustness in Grid Computing Systems, Athens, Greece, February 2006.

 

[Yosh2005] Yoshimoto, K., Kovatch, P., Andrews, P. "Co-Scheduling with User-Settable Reservations," LNCS, 3834 ed., Workshop on Job Scheduling Strategies for Parallel Processing, Jun. 2005, pp. 146-156.

 

[Yu2004] Yu, J., Buyya, R., “A Novel Architecture for Realizing Grid Workflow using Tuple Spaces,” The 5th IEEE/ACM International Workshop on Grid Computing (Grid 2004), Pittsburgh, USA, November 2004.

 

[Yu2005} Yu, J., and Buyya, R., A Taxonomy of Workflow Management Systems for Grid Computing. Technical Report GRIDS-TR-2005-1, University of Melbourne, Australia, March 10 2005.

 

[Zand1999] Zandy, V., Miller, B., and Livny, M. "Process Hijacking," The Eighth International Symposium on High Performance Distributed Computing, pp. 177-184, August 1999.

 

[Zhan2004] Zhang, X., Zagorodnov, D. Hiltunen, M., Marzullo, K. and Schlichting, R. “Fault–tolerant Grid Services Using Primary–Backup: Feasibility and Performance”, Cluster 2004, San Diego, California, September 2004.

 

[Zhan2006a] Zhang, X., Junqueira, F., Hiltunen, M., Marzullo, K. and Schlichting, R. “Replicating Nondeterministic Services on Grid Environments,” 15th IEEE International Symposium on High Performance Distributed Computing, 2006 June 2006 pp.105 – 116.

Summary: Not available.....

 

[Zhan2006b] Zhang, Q., et al., “Dynamic Replica Location Service Supporting Data Grid Systems,” Sixth IEEE International Conference on Computer and Information Technology (CIT'06), p. 61, 2006.