Document Actions

Capacity Planning Meeting (06 February 2009)

Notes summarizing discussion at an ad hoc meeting on Capacity increase approaches and plans (v1.0)

Size 4.1 kB - File type text/plain

File contents

Capacity Planning Discussion
06 February 2009
RDK, AL, KC, MEV, MD, QL, LH
Rough Notes from the meeting

Suggested Topics for Discussion:

*) Time line for what we have discussed so far, one that includes a "gentle"
scale-up in March to determine best system config for larger capacity
system, then implement that for peak processing in April (or whatever
approach is best)

*) More ideas to fill the resource gap that appears May through July (after
borrowing analysis CPU ends ... before new CPU procurements ready for
production).

*) Identify and then strengthen the next likely bottleneck in the system
BEFORE scale-up in mid-March. Data Movement? Durable Storage? QUE/FWD nodes?

*) "Tape to CPU to Tape" data flow and "Grid Job to CPU to Grid Job return"
job flow to look for other bottlenecks that a scale-up may reveal.

*) In addition to technical bottleneck, what operations or monitoring
weaknesses are expected to appear as a result of this plan and how can they
be addressed.

----------------------

My Notes from discussion:

* May be able to use CDF and especially CMS CPU opportunistically... and GPFarms.
- Concern about reduced reliability after much work decoupling data production from other users.
- Concern about added administrative work due to 48 hours pre-emption/eviction since many jobs
nowadays (higher lum part of store) require more than 48 hours to complete. Could accept
some trade-off between failure/retry increase and overall increase in throughput... else
triage jobs before running them ("high lum" only run on dedicated queues).
- MD to test basic functionality of being able to run opportunistically. Change resource string.
- No quota on GPFarms for opportunistic use (as of past few weeks... whatever the case was in the past).
- Possible to allow d0reco to have longer eviction time? Now 2 days clock time.
- FermiGrid: ~530 free slots of which ~300 used by D0 MC Prod --- ie, some competition.

* Some nodes on Cabsrv1 may have inadequate scratch space (should be 10 GB, some have 2 GB).
- Ability to specify scratch space requirements in resource string is NOT there yet. Check actual nodes.
> Buy disk? isolate small-scratch nodes into a separate resource collection?

* QL: Move enhanced usage of CPU on CAB2 to early March (right after March 3rd PBS head node upgrade?).
- Pattern anticipated is that analysis will start to pick up again in April, so we cannot count on all April.
- Test scale-up as much as possible before this. Just include brief multi-day scale up plan in March.

* Suppose more D0 analysis to Grid? (freeing CPU for data production)
- "Normal analysis" - will not go to Grid. Set in ways, scared off by early Grid adopter experience
- "High Volume analysis" - no choice, so already using Grid resources.

* Manage User Expectations
- turn-around time for analysis
- job start time even if slots are free

* PBS schedule still not optimal for user analysis turn-around
- allow certain level of D0 Reco, scale-up, and prefer other jobs to start up? Ie, Reco can be blocked out
but users cannot be.
- Service Level Agreement for analysis on CAB (CAB1 in particular?)
- understand scheduler: users see their jobs sit in queue at the same time that job slots appear free.
Maui scheduler in the component in question. Fair share may be involved at times.
- QL working with JA on this.

* Dataflow diagram to help look for bottlenecks (RDK with input from RI)
- colleration of bottlenecks to worker nodes? some may only have 100Mb interfaces
- Using BlueArc? (no... only in tarball copy down)

* Submitting to 3 CEs: how to keep up with submissions? - takes time
- ~1 min to submit Reco grid job
- ~6 to 7 min to submit merge job. Lots of input files to check. @ samg_submit.
- MD: if he uses d0mino1,2,3... then can the queuing node handle this?
- (Update job flow diagram to represent this.)

Tasks to pursue in short-term
1. Understand PBS scheduled (indirect issue to keep analysis users happy as we ask to borrow CPU for chunks of time).
2. Test opportunistic usage to see if unforeseen show-stoppers.
3. Data flow diagram and next bottleneck in 2x larger system
4. Job submission rate tests

by Robert D. Kennedy — last modified 2009-02-10 11:53

CD-OPMQA

Sections

Personal tools

Document Actions

Capacity Planning Meeting (06 February 2009)

File contents

CD-OPMQA

Sections

Personal tools

Log in

Document Actions

Capacity Planning Meeting (06 February 2009)

File contents