D0 Databases Meeting: 1/6, 2003 (Mon) at 10:30 am, 9th Circle

Present: Anil Kumar, Robert Illingworth, Taka Yasuda, Stephen White,
         Margherita Vittone, Jeremy Simmons, Adam Lyons, Lee Lueking,
         Diana Bonham, Elizabeth Gallas, Chip Brock (via video)


o  DAN server deployment:
        Robert has started testing the proxy server at Imperial College. 
        His note follows:

Documentation
-------------

Generally, more detail needed

Not clear which (or how many) server names to enter (at the ups tailor
stage) - needs a list.
Tailor asks for product's enviorment varilable prefix - should be
mentioned (even if if we just say accept the default)

Installing db_server_config installs d0_config, so it doesn't need to be
explicitly listed

However, calibration_db_server was not automatically installed.

Guidence of the logfile/statistics/cache directory configuration is needed
(presumably suggest similar setup to the Femilab servers)

More explicit instructions on creating the file cache - it's stuck at the
bottom of the server configuration file page where it's easy to miss

As buildcache.py defaults to 1M cache entries, we should probably warn
anyone trying it that it takes ages to make a cache this large (I think 1
million is too large anyway - a nice extra feature would be to be able to
also specify a max disk size)

Tell the installer how to run the test scripts so they know their
installation works (or not)

A note is probably needed on certain Redhat installations with useless
/etc/hosts entries and the solution or workarounds

Probably it would help to have a single step by step list of the
whole procedure


Code bugs and problems
----------------------

os.getlogin() doesn't work everywhere (apparently tty related). It appears
in calibration_db_server CalibProxy.py and also most of the test scripts.
pwd.getpwuid(os.geteuid())[0] is supposedly more reliable

A 1 million object cache doesn't actually work at the moment - class
FileMgr in db_server_base attempts to mmap (filesize*entry length) bytes
of the file, and for a 1M object cache index this happens to be larger
than will fit in an integer and it crashes. Even if it didn't, mmaping off
the end of the file surely can't be a good thing...

There's a bug in calibration_db_server BaseExceptionMap.py - in the
function Map the final test for unknown exceptions is wrong - it should be
if type(mappedException) == NoneType. Currently an unknown exception
causes it to do a raise None, which isn't allowed and raises a TypeError
instead.

Fixing the previous problem threw up another issue - the exception map
doesn't know what to do with corba exceptions produced by the server the
proxy connects to, and so they get returned as unknown exceptions (which
would send out error emails if we had that turned on). Corba exceptions
should be packaged up and returned to the client in a more palatable form.

Proxy servant timing out - the proxy creates a servant from its parent
server and hangs on to it. If the proxy is left unused for some period of
time the servant will be closed by the parent server idle timer, and so
the next time the proxy tries to use it it will get back a
OBJECT_NOT_EXIST exception. (As the proxy resets the servant on any corba
exception, it will get a new servant the next time a proxy request is made
and so future attempts will work.) The solution would appear to be for the
proxy server to automatically get a new servant if it ever finds that the
old one has gone away.

Threads don't all die when the server is stopped. I'm puzzled why I've
never seen this while running a server direct from the command line on
Linux

Proxy connections don't produce logfile entries. This isn't a bug, but it
does mean that proxy servers have very little in their logfiles (mainly
entries saying the idle connection scanner is running) compared to servers
with a database connection. This makes it look as if they aren't doing
anything...


Not a proxy issue, but a reminder
---------------------------------

The problem with the non d0om part of SMT server where the connection
manager never gets the connections it is managing back again (and the
connection scanner doesn't return them either), so that eventually the
server runs out of connections and all the threads are left blocked
waiting for a connection to become available, and the server locks up.


        There was a meeting on dealing with statistics info generated in
        XML format and stuffed into a database just before Xmas.
        There will be a joint effort between D0 and CDF in the time scale
        of one month. There will be another meeting shortly.

        d0dbsrv4 node is not configured correctly. There are two ups
        installations. This has to be cleaned up.

        The user_prd servers that are running on d0dbsrv1 are dcoracle 1
        versions. The old ConfigDbServer runs on the node. The ups table
        file of ConfigDbServer has to be modified so that it points to
        a specific version of dcoracle. This will allow the rest of the
        servers to run with the current version of dcoracle, namely 
        DCOracle2.

        Margherita is looking into a ups option that may be used for
        fail over.


o  Sniping: Anil
        CDF has implemented "sniping" of inactive ORACLE sessions. Unlike
        CDF, D0 has a three tier system for database access that makes
        the number of client connections almost non-issue.

        Nonetheless, it was agreed in general to snipe inactive users.
        First we will only cut off inactive SQLPlus sessions after 30 min.

        Jeremy noted that the highest CPU user is SAM and SAMRead from
        d0ora3. This might be cron jobs. Lee will look into this.
        The next highest is Luminosity group.

        When the sniping is turned on and a user is sniped, an e-mail will
        be sent to the user and the d0db-snipe@fnal.gov mailing list. This
        mailing list has to be created with the archive on.


o  Trigger DB: Elizabeth
        ~100 Trigger lists in the DB have been in use. They correspond to
        ~3000 runs.

        JETSET trigger list created recently for trigger studies had last
        minute changes made by hand (hacked). This caused problems:
        1) the trigger list is not in the DB, and
        2) the luminosity is not normalizable.

        Jeremy is almost done with L2 preprocessor definitions programming.
        Next project is the interface for L2 terms and changes to the xml
        generator. ~2 weeks of work.


o  A possible new Online DB app: Jeremy
        Gordon Watts and controls group are considering to store the run 
        conditions info into a DB or into SAM as archive.


Scribe: T. Yasuda