Document Actions

Thursday Notes

by David Bernholdt — last modified 2008-05-01 12:49
by David Bernholdt
Size 7.1 kB - File type text/plain
File contents

ESG AHM MEETING NOTES 2008-05-01

<Abandoning planned agenda>

GAP ANALYSIS

VO-LEVEL AUTHZ? 

Use case: User homed on gateway A requests access to collection B1
homed at gateway B.  Authorization by collection B1 admin must be
propagated to gateway A.

HOW ARE WE FEDERATING SEARCH METADATA?  <Candidate breakout>

Use case: User wishes to request n variables and a spatio-temporal
subset for a given experiment with a given forcing, that has
representative data spread across and managed by multiple gateways,
some of it on archival storage. 

Is search metadata adequate?  

Ability to deal with archival storage is challenging and critical --
we keep getting asked for this.  Philosophy should be that everything
works, but might be slower.

LAS needs to understand archival storage

Distinguish between capability and policy

Need to talk with mass storage admins, not assume they won't let us do
things.

Provenance issues must also be addressed for resulting data

PUBLISHING

What is the flow of metadata/catalog information on the data node and
then into the representative gateway?  Will there by dynamic access?

Have enough feedback to go off and make detailed plans and come back
to group.

Roland, Eric, Bob, others?

LUCA INPUT

Data Search
- Review search facets
- How to exchange search metadata among gateways

Data Publishing
- Should we enforce a data hierarchy? (useful for browsing) <Much
  discussion> Who sets the hierarchy: publishers? automatic tools?
  Peter: "Preface" tool for visual browsing of hierarchies.  Don also
  familiar with such tools
- Add search metadata fields to publishing app
- Finalize diagram for publishing app -- LAS/TDS -- gateway
  interactions

Metrics
- how to propagate metrics from data nodes to gateways

Security
- how to share policy attributes

Next priorities
- Use publishing app to publish DyCore data at NCAR as a test of data
  ingestion mechanism
- Setup data node at PCMDI tied to gateway at NCAR

Jeff: fault tolerance?  Don: Downtime O(~2days) tolerable.  Gateways
must be independent.  Recent discussions have drifted from original
use cases for fault tolerance.

Never had a user complain about ESG being down.

Data Management: versioning, DOIs, etc.

RACHANA: SECURITY

If gw is down, user homed at that gw cannot access ESG resources owned
by that gw

What does SSO buy us?

Steve: why should data nodes be "owned" by specific gateway nodes?

Arie: two separate issues: gw outage effects (1) user, (2) data

User policy and group membership is relatively small and could be
replicated.  Could be in RDF (some is).

Luca: need to figure out what technology will give us the replication
we need

Nate: Maybe this will also solve need to replicate RDF

Peter: possible privacy issues.

Nate: group membership not sensitive, but for metrics generally need
more.  Either replicate additional info or go back to home
institutions for info when processing metrics.

Sites may have different release policies.  May need formal VO-level
agreement.

We expect that users will have to provide some registration to ESG (as
opposed to IdP).

Should document how personal information is used, replicated.  Privacy
policy.

Tagging and annotation have privacy issues.

Ann: should look at OSG and TG privacy policies, also international
(EGGE)

Frank: if user attributes are replicated, then any gateway could be an
authZ server.

Rachana: should contribute to OpenID4Java rather than rolling our own.
Need to add callouts for attributes.

CHERVENAK: FEDERATED METADATA

What technology is LIGO using to replicate its metadata?  Ann will
check.

Currently adding 8-10 users per day.  Need "instant" updates.

Slide 2

Nate: Given Postgress, some tools are available.  PGcluster for
Postgress can do global service model.  Provides fault
tolerance/recovery.

Complication of database replication at NCAR because supporting many
projects.

Luca: search metadata not determined

Treat attribute data separately from search metadata -- different
requirements, different solutions.

Frank: other VOs dealing with similar issues.  Two parts to policy.
Most policy can be replicated more slowly and cached, etc.  Then add
blacklist service to immediately deny problem users.  Blacklist is
small data, easily replicated, etc.

Gary: propagation time of ~hour is acceptable for user attributes

Update topology: Gateways shouldn't be disabled if other gateways go
down.

Data publication happens ~quarterly (Gary) ~daily (Bob, IPCC peak)

Bloom filters won't work because it is lossy.  Not clear yet whether
compression is really needed or not.

Peer to peer aspect of RLS technology is more interesting.

BREAKOUT SUMMARIES

PUBLICATION & METADATA

1) Svc on dn: <missed>

2) Svc on gw: receives URLs for changes to harvest.  Returns
   success/failure

3) Publishing app tells TDS to reinitialize itself

4) GW tells LAS to reinitialize itself

Rachana: How do groups get synchronized? Manually for now.

THREDDS catalog will be flat.  Data provider/publisher will provide
virtual hierarchy to gw if desired.  Each leaf dataset will have XML
file with "extra" (beyond THREDDS) metadata: search terms, ordering
(hierarchy), authorization, checksums.

Why two documents when THREDDS allows it to be done in one?

Eric: Leaning towards GW UI interrogating LAS servers to build up

THE USE CASE

Sampling frequencies are of interest to users ("I want 6h data")

Also interested in specific time frames ("1960-1969").  How does this
fit with faceted search?

Spatial domain?

Luca: yes, we need to decide what the important facets are!

Nate: maybe focusing too much on (short) query response times.  What
are alternatives from user standpoint?

Estimation of the size of results is important.  People may not know
how large a request is.  What is reasonable may also depend on file
format being returned (i.e. 64k columns in Excel).

Will eventually need a requuest planning capability to figure out
(most efficient) way to carry out request covering multiple sites,
etc.

METADATA REPLICATION

Security MD: replicate membership and resource auth.  Try Java
Messaging Service to implement.  Includes transaction support.

How does metrics reporting work if dns aren't explicitly owned by gws

<something about personal information?

Search MD: 

NCAR serves multiple VOs from same gw, what is data sharing policy?

More discussion needed.

Nate: Replication has advantages, especially in terms of bringing up
new gateways.  Hard to get in consistent state with messaging
approach.

Security of JMS?  Depends on implementation.  Probably use SSL.

Nate: worried about whether complexity of replication is going to
introduce potential faults that are worse than gateway downtimes we're
trying to compensate for.

Concerns about direct replication and database schema changes.
Require down time for entire system.

Don: Look at WMO work and how they exchange metadata

Distributed searches as an alternative to replication?

Redux of past agreement: Queries should not return partial results
ESG

Sections

Personal tools

Document Actions

Thursday Notes

File contents