D0 Databases Meeting: 1/6, 2003 (Mon) at 10:30 am, 9th Circle Present: Anil Kumar, Robert Illingworth, Taka Yasuda, Stephen White, Margherita Vittone, Jeremy Simmons, Adam Lyons, Lee Lueking, Diana Bonham, Elizabeth Gallas, Chip Brock (via video) o DAN server deployment: Robert has started testing the proxy server at Imperial College. His note follows: Documentation ------------- Generally, more detail needed Not clear which (or how many) server names to enter (at the ups tailor stage) - needs a list. Tailor asks for product's enviorment varilable prefix - should be mentioned (even if if we just say accept the default) Installing db_server_config installs d0_config, so it doesn't need to be explicitly listed However, calibration_db_server was not automatically installed. Guidence of the logfile/statistics/cache directory configuration is needed (presumably suggest similar setup to the Femilab servers) More explicit instructions on creating the file cache - it's stuck at the bottom of the server configuration file page where it's easy to miss As buildcache.py defaults to 1M cache entries, we should probably warn anyone trying it that it takes ages to make a cache this large (I think 1 million is too large anyway - a nice extra feature would be to be able to also specify a max disk size) Tell the installer how to run the test scripts so they know their installation works (or not) A note is probably needed on certain Redhat installations with useless /etc/hosts entries and the solution or workarounds Probably it would help to have a single step by step list of the whole procedure Code bugs and problems ---------------------- os.getlogin() doesn't work everywhere (apparently tty related). It appears in calibration_db_server CalibProxy.py and also most of the test scripts. pwd.getpwuid(os.geteuid())[0] is supposedly more reliable A 1 million object cache doesn't actually work at the moment - class FileMgr in db_server_base attempts to mmap (filesize*entry length) bytes of the file, and for a 1M object cache index this happens to be larger than will fit in an integer and it crashes. Even if it didn't, mmaping off the end of the file surely can't be a good thing... There's a bug in calibration_db_server BaseExceptionMap.py - in the function Map the final test for unknown exceptions is wrong - it should be if type(mappedException) == NoneType. Currently an unknown exception causes it to do a raise None, which isn't allowed and raises a TypeError instead. Fixing the previous problem threw up another issue - the exception map doesn't know what to do with corba exceptions produced by the server the proxy connects to, and so they get returned as unknown exceptions (which would send out error emails if we had that turned on). Corba exceptions should be packaged up and returned to the client in a more palatable form. Proxy servant timing out - the proxy creates a servant from its parent server and hangs on to it. If the proxy is left unused for some period of time the servant will be closed by the parent server idle timer, and so the next time the proxy tries to use it it will get back a OBJECT_NOT_EXIST exception. (As the proxy resets the servant on any corba exception, it will get a new servant the next time a proxy request is made and so future attempts will work.) The solution would appear to be for the proxy server to automatically get a new servant if it ever finds that the old one has gone away. Threads don't all die when the server is stopped. I'm puzzled why I've never seen this while running a server direct from the command line on Linux Proxy connections don't produce logfile entries. This isn't a bug, but it does mean that proxy servers have very little in their logfiles (mainly entries saying the idle connection scanner is running) compared to servers with a database connection. This makes it look as if they aren't doing anything... Not a proxy issue, but a reminder --------------------------------- The problem with the non d0om part of SMT server where the connection manager never gets the connections it is managing back again (and the connection scanner doesn't return them either), so that eventually the server runs out of connections and all the threads are left blocked waiting for a connection to become available, and the server locks up. There was a meeting on dealing with statistics info generated in XML format and stuffed into a database just before Xmas. There will be a joint effort between D0 and CDF in the time scale of one month. There will be another meeting shortly. d0dbsrv4 node is not configured correctly. There are two ups installations. This has to be cleaned up. The user_prd servers that are running on d0dbsrv1 are dcoracle 1 versions. The old ConfigDbServer runs on the node. The ups table file of ConfigDbServer has to be modified so that it points to a specific version of dcoracle. This will allow the rest of the servers to run with the current version of dcoracle, namely DCOracle2. Margherita is looking into a ups option that may be used for fail over. o Sniping: Anil CDF has implemented "sniping" of inactive ORACLE sessions. Unlike CDF, D0 has a three tier system for database access that makes the number of client connections almost non-issue. Nonetheless, it was agreed in general to snipe inactive users. First we will only cut off inactive SQLPlus sessions after 30 min. Jeremy noted that the highest CPU user is SAM and SAMRead from d0ora3. This might be cron jobs. Lee will look into this. The next highest is Luminosity group. When the sniping is turned on and a user is sniped, an e-mail will be sent to the user and the d0db-snipe@fnal.gov mailing list. This mailing list has to be created with the archive on. o Trigger DB: Elizabeth ~100 Trigger lists in the DB have been in use. They correspond to ~3000 runs. JETSET trigger list created recently for trigger studies had last minute changes made by hand (hacked). This caused problems: 1) the trigger list is not in the DB, and 2) the luminosity is not normalizable. Jeremy is almost done with L2 preprocessor definitions programming. Next project is the interface for L2 terms and changes to the xml generator. ~2 weeks of work. o A possible new Online DB app: Jeremy Gordon Watts and controls group are considering to store the run conditions info into a DB or into SAM as archive. Scribe: T. Yasuda