Thursday Notes
by David Bernholdt
Size 7.1 kB - File type text/plainFile contents
ESG AHM MEETING NOTES 2008-05-01 <Abandoning planned agenda> GAP ANALYSIS VO-LEVEL AUTHZ? Use case: User homed on gateway A requests access to collection B1 homed at gateway B. Authorization by collection B1 admin must be propagated to gateway A. HOW ARE WE FEDERATING SEARCH METADATA? <Candidate breakout> Use case: User wishes to request n variables and a spatio-temporal subset for a given experiment with a given forcing, that has representative data spread across and managed by multiple gateways, some of it on archival storage. Is search metadata adequate? Ability to deal with archival storage is challenging and critical -- we keep getting asked for this. Philosophy should be that everything works, but might be slower. LAS needs to understand archival storage Distinguish between capability and policy Need to talk with mass storage admins, not assume they won't let us do things. Provenance issues must also be addressed for resulting data PUBLISHING What is the flow of metadata/catalog information on the data node and then into the representative gateway? Will there by dynamic access? Have enough feedback to go off and make detailed plans and come back to group. Roland, Eric, Bob, others? LUCA INPUT Data Search - Review search facets - How to exchange search metadata among gateways Data Publishing - Should we enforce a data hierarchy? (useful for browsing) <Much discussion> Who sets the hierarchy: publishers? automatic tools? Peter: "Preface" tool for visual browsing of hierarchies. Don also familiar with such tools - Add search metadata fields to publishing app - Finalize diagram for publishing app -- LAS/TDS -- gateway interactions Metrics - how to propagate metrics from data nodes to gateways Security - how to share policy attributes Next priorities - Use publishing app to publish DyCore data at NCAR as a test of data ingestion mechanism - Setup data node at PCMDI tied to gateway at NCAR Jeff: fault tolerance? Don: Downtime O(~2days) tolerable. Gateways must be independent. Recent discussions have drifted from original use cases for fault tolerance. Never had a user complain about ESG being down. Data Management: versioning, DOIs, etc. RACHANA: SECURITY If gw is down, user homed at that gw cannot access ESG resources owned by that gw What does SSO buy us? Steve: why should data nodes be "owned" by specific gateway nodes? Arie: two separate issues: gw outage effects (1) user, (2) data User policy and group membership is relatively small and could be replicated. Could be in RDF (some is). Luca: need to figure out what technology will give us the replication we need Nate: Maybe this will also solve need to replicate RDF Peter: possible privacy issues. Nate: group membership not sensitive, but for metrics generally need more. Either replicate additional info or go back to home institutions for info when processing metrics. Sites may have different release policies. May need formal VO-level agreement. We expect that users will have to provide some registration to ESG (as opposed to IdP). Should document how personal information is used, replicated. Privacy policy. Tagging and annotation have privacy issues. Ann: should look at OSG and TG privacy policies, also international (EGGE) Frank: if user attributes are replicated, then any gateway could be an authZ server. Rachana: should contribute to OpenID4Java rather than rolling our own. Need to add callouts for attributes. CHERVENAK: FEDERATED METADATA What technology is LIGO using to replicate its metadata? Ann will check. Currently adding 8-10 users per day. Need "instant" updates. Slide 2 Nate: Given Postgress, some tools are available. PGcluster for Postgress can do global service model. Provides fault tolerance/recovery. Complication of database replication at NCAR because supporting many projects. Luca: search metadata not determined Treat attribute data separately from search metadata -- different requirements, different solutions. Frank: other VOs dealing with similar issues. Two parts to policy. Most policy can be replicated more slowly and cached, etc. Then add blacklist service to immediately deny problem users. Blacklist is small data, easily replicated, etc. Gary: propagation time of ~hour is acceptable for user attributes Update topology: Gateways shouldn't be disabled if other gateways go down. Data publication happens ~quarterly (Gary) ~daily (Bob, IPCC peak) Bloom filters won't work because it is lossy. Not clear yet whether compression is really needed or not. Peer to peer aspect of RLS technology is more interesting. BREAKOUT SUMMARIES PUBLICATION & METADATA 1) Svc on dn: <missed> 2) Svc on gw: receives URLs for changes to harvest. Returns success/failure 3) Publishing app tells TDS to reinitialize itself 4) GW tells LAS to reinitialize itself Rachana: How do groups get synchronized? Manually for now. THREDDS catalog will be flat. Data provider/publisher will provide virtual hierarchy to gw if desired. Each leaf dataset will have XML file with "extra" (beyond THREDDS) metadata: search terms, ordering (hierarchy), authorization, checksums. Why two documents when THREDDS allows it to be done in one? Eric: Leaning towards GW UI interrogating LAS servers to build up THE USE CASE Sampling frequencies are of interest to users ("I want 6h data") Also interested in specific time frames ("1960-1969"). How does this fit with faceted search? Spatial domain? Luca: yes, we need to decide what the important facets are! Nate: maybe focusing too much on (short) query response times. What are alternatives from user standpoint? Estimation of the size of results is important. People may not know how large a request is. What is reasonable may also depend on file format being returned (i.e. 64k columns in Excel). Will eventually need a requuest planning capability to figure out (most efficient) way to carry out request covering multiple sites, etc. METADATA REPLICATION Security MD: replicate membership and resource auth. Try Java Messaging Service to implement. Includes transaction support. How does metrics reporting work if dns aren't explicitly owned by gws <something about personal information? Search MD: NCAR serves multiple VOs from same gw, what is data sharing policy? More discussion needed. Nate: Replication has advantages, especially in terms of bringing up new gateways. Hard to get in consistent state with messaging approach. Security of JMS? Depends on implementation. Probably use SSL. Nate: worried about whether complexity of replication is going to introduce potential faults that are worse than gateway downtimes we're trying to compensate for. Concerns about direct replication and database schema changes. Require down time for entire system. Don: Look at WMO work and how they exchange metadata Distributed searches as an alternative to replication? Redux of past agreement: Queries should not return partial results