ods

Architecture Working Group at Fermilab November 1999

Attendees:
James Amundson, FNAL (amundson@fnal.gov)
Luis Bernardo, LBNL (LMBernardo@lbl.gov)
Andy Hanushevsky, SLAC (abh@stanford.edu)
Miron Livny, U. Wisconsin (miron@cs.wisc.edu)
David Malon, ANL (malon@anl.gov)
Arcot (Raja) Rajasekar SDSC (sekar@sdsc.edu)
Chuck Salisbury, ANL (salisbur@mcs.anl.gov)
Alex Sim, LBNL (asim@lbl.gov)
Arie Shoshani, LBNL (shoshani@lbl.gov)
Igor Terekhov, FNAL (terekhov@fnal.gov)
Richard Wellner, FNAL (wellner@fnal.gov)
Vicky White, FNAL (white@fnal.gov)

Background
During the meeting, we referred to the architecture design for the request manager developed in the previous meeting, and modified it slightly to reflect our thinking. This architecture is attached as a PowerPoint figure (11.3.Request.mgr.arch.ppt). We'll refer to it below.

Terminology
The following terminology was used to refer to various components and concepts.

Request Manager - the component responsible for accepting estimation and execution requests from the user application at a particular site. This component manages multiple requests concurrently, and uses the other services to request file transfers.

Matchmaking Service - the component that makes decisions on the best way to execute a request. That is, it decides the site from which to access each file of a given request based on information it finds in the metadata catalog

Metadata Catalog - the component that keep the updated information on where files reside, as well as various statistics on network speed and availability. It also contains information on which protocols each site supports (FTP, HTTP, etc.) This component includes the "replica catalog".

Resource Manager - a component that is associated with each resource in the system, such as disk cache, robotic tape systems, network resources, etc. One instance of this is a "disk cache manager" assumed to exist with each shared disk cache on the system.

Event Component - often event data is partitioned into "components", such as the "hits" data, the "tracks" data, the "raw" data, etc. The reason for such partitioning is to avoid reading entire events data when only some of the components are needed.

File Bundle - the set of files, one for each component, needed to be available at the same time in disk cache for processing by the application. An index is used to determine the set of bundles needed for each request.

Sequence of operations as a result of a request.

The following is a sequence of operations that we foresee as a result of a typical request made by applications.

1. The application issues an estimate
2. The Request Interpreter (RI) consults the Logical index Service (properties-events-files) to get files involved in the estimate request
3. The RI Consults the Matchmaking Service for estimate on executing request
4. The application issues an "execute" request
5. The RI Consults the Logical Index Service to generate the set of bundles for the request
6. The set of {bundles:events} are passed to the Request Planner (RP)
7. The application client issues a request for next events to process (the files of at least one bundle must be in some disk cache to process events)
8. The RP Consults the Matchmaking Service to get order of file transfer preffered
9. The RP issues "get file, source, destination" request to the File Transfer Manager (FTM), one file at a time
10. The FTM issues "reserve space" request to the Resource Reservation Service (one of the Grid Services) if "destination" is remote site
11. The FTM issues "move file" request to the File Access Service (one of the Grid Services)
12. The File Access Service notifies the Matchmaking Service (or metadata catalog) of the file just moved to the disk cache
13. Application client issues commands to read events out of files

The above actions were marked by number in the architecture figure. The APIs are defined to correspond to these actions.

Summary of issues discussed, conclusions, and recommendations
1. The Request Manager architecture was described using 13 function calls shown in the architecture figure. No changes were proposed to the architecture structure, except for one of the function calls (12), see explanation next.

2. It was concluded that the action of notifying the metadata catalog of a file being replicated in some cache will be done by the component that executes the file transfer. Since this component is the Grid Storage Access Service, the API (12) was move in the architecture across the Globus line (see Figure).

3. The APIs between the Application (making the request) and the Request Manager were defined before the meeting (see enclosed APIs.doc). These were discussed. It was concluded that in addition to "priority" of a request, it is useful permit the specification of "job_type" and a "hint". For example, a job_type can be "on_line" or "batch", where batch can be treated as a background job. Examples of hints may be that the files for this request are "clustered" on tapes, or that the processing rate per event is expected to be "fast" or "slow" (specified as seconds per MB).

4. The APIs between the File Transfer Manager (previously called "Cache Manager") and the other components were also defined before the meeting. The File Transfer Manager is responsible for making reservations for resources, and scheduling file transfers or file pinning. It was agreed that performing this function a file at a time is adequate for the time being. The next point describes the other alternative of making aggregate resource reservations.

5. Next we discussed the API between the Request Manager (RM) and the Matchmaking Service (MMS). There are 2 such APIs. One for estimation of an entire request and one for the execution of a request. Both the estimation and the execution function calls pass the entire set of bundles to the MMS. For estimation, the MMS returns various estimation measures, such as the number of files found in various caches, the number of files to be transferred from one site to another, the number of files to be read from some tape system, the average rate of file streaming, and the total estimated time.

6. For execution, the MMS checks with the metadata catalog and the various resource managers to determine how much of the request (a set of bundles) can be performed at that time. According to the available resources, it returns a subset of the files that it recommends for processing. We refer to this recommendation as a task, to indicate that
a request execution may be split into multiple tasks. The files in this task are ordered to indicate the preferred order of execution of file transfers from the MMS's point of view. There is no obligation on the part of the RM to adhere to the recommendation or to execute the entire task at once. Each file in the task has a specification of source and
destination, file size, and other properties (such as network bandwidth and intermediate cache requirement). The file list should maximize the number of bundles (at least one bundle is required) but other files that may not participate in bundles can be added to the task. These additional files will be needed later, and it is best to get them at the same time as the bundle files. The functionality of this API was defined during the meeting and it is included here (see API_rm-mms.doc). The API is asynchronous. First, the RM calls the MMS (API 8). After the MMS figures out the plan it calls back (API 8').

7. The API from RM to MMS also includes a function call "did_not_work", which notifies MMS that one of the file transfer failed (most likely because its replica was removed in the meantime). The MMS can then come up with another recommendation for that file.

8. The API described above permits the MMS to be stateless, i.e. it does not have to remember what tasks were recommended or completed. After the RM performs part or all of a task, it can then issue another call with the remaining bundles. Such a request can even be made while it is executing the current task. In our discussions, we concluded that it might be beneficial if the MMS does maintain the state, since it may notice that some bundles have been cached for another request and may want to notify the RM even without being asked for a plan. We realized that in this case the mms-rm API in identical to the one we already have, so we decided to permit the MMS to call the RM multiple times with the same plan_token. However, this is only an option for a possible future implementation of the MMS.

9. The issue of who should make the reservations was discussed. It makes sense that the MMS will make the reservations for the task it recommends. However, we decided that we want the RM to have the freedom to execute any part of the task in order to benefit the application. For example, it may want to get at least one bundle first, so that processing can start as soon as possible, or it may want to pace the file transfers because the processing is slow and it does not want to flood its local cache with files that will not be accessed for a long time. We felt that this level of discretion should be given to site RMs, but in general RMs should try to follow the recommended plan.

10. There was a long discussion as to whether it will be beneficial to ask the Globus Storage Reservation Service for an aggregate resource reservation. For example, would it be useful to reserve the aggregate space on some cache necessary for reading all the files needed out of some tape? Of course, this is a good idea. However, if we take into account that the matchmaking service already checked that the space was available, it is sufficient to claim that space one chunk at a time. This choice simplifies the various APIs. We note that the option of aggregate resource reservation (and perhaps aggregate file transfer) may be necessary for better flow optimization.

Other recommendations
1. It is recommended that the Globus and SRB efforts will be coordinated to provide uniform interfaces to their functionality. In this way, the RM and MMS could interact with either system using the same APIs. CORBA interfaces may be preferable, using IDL, as it is an industry wide standard.

2. Similarly, it is recommended that the APIs the Globus LDAP metadata catalog, and SRB's MCAT would be coordinated and standardized.

3. It is recommended that the metadata catalog keep information about the protocols supported by the sites where storage resources reside. Using this information, the MMS can recommend a protocol (PFTP, FTP, HTTP, etc.) that both source and destination sites support.

Open issues
1. Authorization was discussed, and there was much confusion as to how it will be performed. Specifically, the following situation needs clarification. The RM supports multiple users on its site. It makes the reservations and file transfer requests on behalf of these users. Now, suppose that a user logged in and was authenticated by Globus. When the RM requests a reservation or file transfer on behalf of that user, what does it have to pass Globus for Globus to authenticate the user and permit the transactions? How does Globus perform the authorization to the remote site if the user was not previously authenticated on that site? Does it have to perform authentication and authorization for each reservation and file transfer request?

2. How do we handle application failures? One option is for a resource manager to notice that a file was not touched in a long time (set as system parameter). At that time it may ask the application "are you alive?" Another option is to require the application to let the system know it is alive every so often (set as system parameter), or it is assumed to be dead. Yet another option is to require the application to do a read every so often, or the file will be closed by the resource manager. Although this issue was not decided, it seems that the last solution is the easiest to implement as it uses an existing function (read a file). This is the solution implemented in the OOFS, the system used for BABAR.

3. Some requests may be limited as to where the files can be cached into (e.g. the application uses objectivity). How do we handle such requests? Perhaps it is enough for the catalog to know about types of sites and types of files. To handle this the requests need to give hint about their type. This issue was not resolved.

Action items
1. The API definition need to be completed. Most of the APIs are already defined. The rm-mms API needs to be fleshed out.

2. It was generally agreed that as soon as we agree on a version of the APIs, anybody who wishes to participate in quick experiments should do so. Some examples are discussed below.

2.1 Miron suggested to set up a "mock application module" in the U. of Wisconsin that can consult the Logical Index (using the IDL) of STACS at LBNL. The index will be modified to return a set bundles. The mock application module will then interact with the STACS "file transfer service" to cache files as quickly as possible to U. of Wisconsin.

2.2 The same as above could be done, except that the destination of the file transfers will be ANL.

2.3 The same as above could be done interfacing to SAM at Fermi. An index service and "file transfer service" at Fermi will be accessed instead of LBNL.

2.4 A similar scenario can be set up with the application running at ANL.

2.5 A future goal may be achieved when the STACS "file transfer service" is made available under Globus. Then, the file transfer request could be made directly to Globus.

2.6 If the above is successful, access from a combination of sites can be experimented with.

These simple scenarios can provide us with much insight on the performance of a large number of file transfers and the behavior of the network. Some monitoring tools will need to be developed to measure the performance.