Welcome to the SAMGrid / LCG integration project


The bridge between SAMGrid and LCG Grid community.

The goal of the project is to achieve full interoperability with LCG and sustain consistent production efficiency of the LCG resource usage by utilizing SAMGrid job management infrastructure via forwarding (bridge) node technology.

Current WBS can be found at : LCG-To-Production-WBS.doc
Current mailing list : d0_lcg_integration@fnal.gov  (list owner harenberg@physik.uni-wuppertal.de, abaranov@fnal.gov)

Currently participating list of sites.

site name
job manager address
ip range
status
contact person(s)
CPU available
Clermont Ferrand
clrlcgce01.in2p3.fr:2119/jobmanager-lcgpbs-dzero
clrlcgce02.in2p3.fr:2119/jobmanager-lcgpbs-dzero

stress test MC requests run successfully. kurca@in2p3.fr
lebrun@in2p3..fr
160
IN2P3
cclcgceli02.in2p3.fr:2119/jobmanager-bqs-short
cclcgceli02.in2p3.fr:2119/jobmanager-bqs-medium
cclcgceli02.in2p3.fr:2119/jobmanager-bqs-long

 stress test MC requests run successfully.
kurca@in2p3.fr
lebrun@in2p3..fr
1500
*shared
NIKHEF
tbn20.nikhef.nl:2119/jobmanager-pbs-qshort
tbn20.nikhef.nl:2119/jobmanager-pbs-qlong
192.16.186.128
|
192.16.186.256
stress test MC requests run successfully.
Observed data access speed to in2p3 was 10x slower than from CF and IN2P3.
a03@nikhef.nl
templon@nikhef.nl
200
*shared
Imperial
gw39.hep.ph.ic.ac.uk:2119/jobmanager-lcgpbs-dzero 148.88.81.99
148.88.81.100 ,
155.198.216.111
|
155.198.216.149
Jobs failed due to lack of scratch space. f.villeneuve@imperial.ac.uk
d.colling@imperial.ac.uk
56
Manchester
bohr0001.tier2.hep.man.ac.uk:2119/jobmanager-lcgpbs-dzero 195.194.104.0/24,
195.194.105.0/24,   195.194.106.0/24,
195.194.107.0/24,
195.194.108.0/24,
195.194.109.0/24,
195.194.110.0/24,
194.36.3.0/24
trying to contact Sabah to get more information on cluster status.
sabah@fnal.gov
2000
Lancaster
fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-dzero

stress test MC requests run successfully. p.love@lancaster.ac.uk 394
Prague
golias25.farm.particle.cz:2119/jobmanager-lcgpbs-lcgd0prod -
upgrade scheduled on Dec 12.
svecj@fzu.cz, kurca@in2p3.fr
100
Wuppertal
grid-ce.physik.uni-wuppertal.de:2119/jobmanager-lcgpbs-an_long
grid-ce.physik.uni-wuppertal.de:2119/jobmanager-lcgpbs-an_med
grid-ce.physik.uni-wuppertal.de:2119/jobmanager-lcgpbs-an_shrt
-
started testing
meder@physik.uni-wuppertal.de
512
Monitoring link : https://cic.in2p3.fr/index.php?id=rc&subid=rc_activity&js_status=2 (need a valid certificate to access this page)

Milestones

Milestone name
Status
Expected date
Complete extension of the test bed to production size.
Completed stress test with 330 jobs running
at CF, In2p3 , NIKHEF clusters. Substress tests for
Prague , Lancaster.
downloads/OctSAMGridLCGStressTest.html
Mid Nov.
Consistent efficiency of the Montecarlo production jobs.

Mid Jan.
Continuous usage of the infrastracture by experiment operators



* Lower number - higher priority

Task list

taks name
description
status
start date
tantative. release date
software releases/deliverable
contributor
Pri.*
Station polling interfaces -> production
Improve pilot implementation of  the SAM polling interfaces to support better diagnostics and to increase tolerance with respect to central system failures (name server)

Mid Oct
Mid Dec

Andrew
4
Scalability. New DB server
Decouple LCG and SAMGrid production activities. The premise is based on the increased cost of diagnostics of the LCG submitted jobs.
done
current
End Oct

Andrew, Steeve White
2
Mission critical.MC merge output storage selection
Be able to store merged MC data using SAMGrid storage selection configuration.
done End Oct
Start Nov

Andrew
1
Managebility.cred. management.  Integration with MyProxy Avoid shiping user proxy with the job. Re-use MyProxy solution to delegate the task.
 done
End Nov
End Dec

Sudhamsh,Jeff Templon, sites
4
Prod quality. regression testing. certification.
High level SAMGrid support model seems to call for centralized suport center that needs tools to certify,test individual sites to resolve on-going operational claims. Also important to keep site status up to date with respect to changes in LCG and SAMGrid software.
jim_stats package is about to be released with changes that enable automatic profiling of the subitted jobs. Mid Oct.
Mid November

Sudhamsh/Andrew/Torsten/Sites
2
Prod quality. stress testing.
Identify botlenecks in submission and data handling components. Prerequisite to procure optional deployments of the additional bridge nodes if results fail to satisfy production throughput requirements.
done.
Start Nov
Mid Nov
set of JDLs that schedule MC jobs to  all  LCG sites. Results and analysis.
OctSAMGridLCGStressTest.html
Sudhamsh/Andrew/Torsten/Sites 3
Managebility.cluster tagging.
LCG software , hardware changes/upgrades make it difficult to track LCG resource pool for D0. Need a "tag" layer to isolate LCG job manager name/address from the LCG broker string. Also need to sub-select the pre-certified sites for a particular task (MC/ merge / reco/ reco merge).
planning
Start Nov
?

Andrew/Torsten/Sites
4
passing LCG schedulling parameters from SAM JDL to LCG JDL.
Important deployment feature to enable individual regression/stress testing of the LCG site.
done
Start Oct
10/09/05
jim_client v2_1_30,
sam_batch_adapeters x47
Sudhamsh
2
verbose LCG output retrieval.
Store as much persistent diagnostics as possible for debuging/problem resolution.
done
Start Oct
10/07/05  sam_batch_adapters x47
Sudhamsh
2
sys. expansion. New station storage at ccin2p3-grid1.
Pending on the stress test results.  The additional storage bandwidth may be required to sustain production rates.
done
Start Dec
Mid Feb

Andrew/Tibor
4
Set number of LCG jobs that failed due to the LCG  output handling.
WBS 3.1.1.2
in progress
Start
End Oct.
Mid Nov.
Sudhamsh/Parag/Andrew 3
Job termination on deadline.
To ensure predictability of the recovery process LCG jobs should not be run passed certain point where it is known the recovery procedure will take over.
done
Mid Nov.
End Nov.

Sudhamsh 4
Manchester storage element.
Scale our storage elements infrastructure (currently deployed at in2p3) by adding 2Tb storage element in Manchester for the SAMGrid/LCG
done
End Oct.
Mid Feb.

Sabah,Andrew,Tibor
3
Be able to select "closest" storage element with respect to job running location.
To optimize usage of the network bandwidth , job should select storage that is "closest" to a site it is currently running at. not started
Beg. Nov End. Dec.

Andrew,Sudhamsh, Gabrielle
3
Accounting
Report LCG cluster resource usage to the collaboration.
done
Beg Nov.
End Dec.
MC_LCG_Accounting.doc
Gavin/Jeff Templon/Mike Diesburg
3
SAMGrid Production LCG forwarding node deployment.
Separate poroduction and test bed development activities  by dedicating new head node to production.
done
Mid Nov.
Mid Dec.
old forwarding node is available as "test_prd"
Torsten LCG , SAMGrid team
2

Stress test schedule, results.

Date
Location
25 Oct 2005
downloads/OctSAMGridLCGStressTest.html


Release cut. Feb cut.

Package name
release
jim_client (upgrade instructions)
v2_1_53
---- LCG forwarding node software releases, for SAMGrid/LCG admins only ----------

jim_job_managers
v2_2_82
sam_client
v1_0_66_poll
sam_fcp
v1_0_25
jim_config
v1_2_18



$Id: SAMGridLCGStatus.html,v 1.17 2006/03/09 23:13:52 abaranov Exp $,