Minos-Oracle Status Reports/Daybook

October 2006

Applied a security patch to the test10.2 Oracle server on my desktop.
- patch on oracle site
- DBAs supplied directions for applying patch
- Oracle also supplied directions

Found a bug in the Oracle supplied ODBC driver for linux. Here is part of the problem report I sent to Sergei, he has fixed the ROOT/ODBC driver to take care of this (using a much better fix than my suggested one) and to also support DATETIMES.

Here is a description of the problem:
In TODBCStatement::StoreResult() at line 199 the following ODBC call is made:
/*line 199 of TODBCStatement.cxx*/
 SQLSetStmtAttr(fHstmt, SQL_ATTR_ROWS_FETCHED_PTR, &fNumRowsFetched, 0);

The intent of this call is to bind fNumRowsFetched to the internals of 
the odbc driver library, so that TODBCStatement can keep track of how 
many rows are fetched back.  A successful SQLFetch should always set 
fNumRowsFetched to 1, a successful SQLFetchScroll should set fNumRowsFetched 
to 1 or greater.

Unfortunately, with the Oracle supplied ODBC driver for linux, 
fNumRowsFetched remains set to 0 even after a successful fetch, which 
I think is a bug in the driver.

I have tested the following change to TODBCStatement::NextResultRow()  
with the attached  test program .  The change strikes me as a bit hackish, 
but it makes the attached root.Tstmt.BFIELDMAPDATA.C program work the 
same for ODBC as it does for OCCI :

/*line 264 of TODBCStatement.cxx*/
     SQLRETURN retcode = SQLFetchScroll(fHstmt,SQL_FETCH_NEXT,0);
/*add the following two lines to get around problem w/ 
Oracle supplied ODBC driver bug*/
     if(retcode != SQL_NO_DATA && fNumRowsFetched==0)
       fNumRowsFetched=1;

August-September 2006

I evaluate the new ROOT SQL interfaces.

Dec 8 2005

Recently fixed and tested

Oracle odbc driver altered so that integers with > 10 decimal digits report back to RDBC as BIGINT and read correctly into 8 byte (long long)
Oracle odbc driver altered so that BINARY_FLOAT and BINARY_DOUBLE (ieee 754) read back correctly
Perl_dbd_oracle v5_8_10 -f Linux+2.4 in KITS, handles BINARY_FLOAT and BINARY_DOUBLE correctly. Earlier perl_dbd_oracle clients do not
inserted a lot of rows that were reported as missing in http://www-numi.fnal.gov/minwork/computing/oracle_checksum.txt on Dec 7. For some reason they don't show up as fixed on Dec 8 report, spot checked with sql prompt and they seem to be there.

Near term plans

build oracle_odbc_driver using oracle 10 and proper glibc so it runs everywhere. There is a backwards compatibility problem with glibc so that oracle software built on the wrong machine (too late of a glibc) wont run on earlier glibc machines. Unfortunately the network with the two machines I can reproduce this on is down so I cant tell you which glibc versions this is, but I have tripped you up on it before by distributing libraries built on the wrong machine only to have you complain that they wont run.
Distribute oracle odbc driver libraries built as per 1) , oracle 10 libraries, and perl_dbd_oracle v5_8_10 onto minos machines. This seems to be a mixture of KITS installs (that I do) and minos-specific library copies that I usually have Liz do.
Drop Oracle tables on minosdev which use old style floats, rebuild using BINARY_FLOAT and BINARY_DOUBLE, refresh, and test. Get help from collaboration to test more extensively.
I still need to modify the minos tutorial examples to report back everything they read from the database. This would have saved me much embarrasment in reporting that everything worked with fine with oracle 10 when in fact it did not.

To Be Decided

Do I check in the code for handling database errors using std::exception, which works (at least to my understanding) or do I keep messing around with signal/slots to see if I can get that working?

Dec 2 2005

Status Report

Oracle odbc driver modified to faithfully support kBIGINT (-5), sidestepping the unsigned problem. Oracle NUMBER(x,0) where x>=10 now reported as SQL_BIGINT which makes RDBC happy in the following way:



SQL> desc int_test
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 I                                                  NUMBER(38)

SQL> select * from int_test
  2  ;

         I
----------
2147483643
2147483644
2147483645
2147483646
2147483647
2147483648
2147483649
2147483650

8 rows selected.

(phoenix.fnal.gov:dbox)% loon
loon [0] .x TestTSQL2.C("test.dat");
Processing commands in test.dat
Opening connection to url:oracle:odbc:oraodbc://phoeninx.fnal.gov:1521/test10.2 user: minos_writer passwd:XXXXX
Executing statement select * from int_test 
Reading back results
Result set has 1 columns
I: type:-5: precision:38
I: 2147483643 
I: 2147483644 
I: 2147483645 
I: 2147483646 
I: 2147483647 
I: 2147483648 
I: 2147483649 
I: 2147483650 
loon [1] q

Oracle odbc driver (NOW) somewhat reliably reads Oracle 10g BINARY_FLOAT and BINARY_INT types, which it did not despite my earlier repeated claims. Note there is still a problem where precision is returned as '0' when it is actually 32 bits and 64 bits:



SQL> desc bin_float_test
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 F1                                                 BINARY_FLOAT
 D1                                                 BINARY_DOUBLE
 OF1                                                FLOAT(63)
 OD1                                                FLOAT(126)
(phoenix.fnal.gov:dbox)% loon
loon [0] .x TestTSQL2.C("test.dat");
Processing commands in test.dat
Opening connection to url:oracle:odbc:oraodbc://phoeninx.fnal.gov:1521/test10.2 user: minos_writer passwd: minos_writer
Executing statement select * from bin_float_test 
Reading back results
Result set has 4 columns
F1: type:7: precision:0
D1: type:8: precision:0
OF1: type:6: precision:63
OD1: type:8: precision:126
F1: 1.234000 D1: 1.234000 OF1: 1.234000 OD1: 1.234000 
loon [1]  .quit

Perl::DBD::Oracle also had to be modified to reliably read/write BINARY_FLOAT and BINARY_DOUBLE types. The modified product is in KITS as perl_dbd_oracle v5_8_10 (perl 5.8, oracle 10). I have submitted my changes to Tim Bunce, the author of this application, and hope he includes it in the next release 1.17

October 27 2005

Status Report

Only Mysql supports an unsigned data type of all the databases( oracle, sybase, db2, postgresql)that I checked
- adding getUnsigned() is probably a bad idea
C++ std::exception useful in that it catches some problems that I still can't get signal/slot to catch
- some loon jobs in the minos tutorial area w/ database C++ std::exception errors fail noisily now where they used to fail quietly
- The main problem it seems to be catching is retrieving integers greater than ( (2^31 -1)==2147483647 ) but less than 2^32=4294967294, which mysql handles by retrieving as a signed integer and then casting back to unsigned. I can tell the oracle odbc driver to report these rows back as long long ( 8 byte) format but havent yet tested if this always returns the same value to the RDBC/Dbi interface.
- no mysql jobs in the minos tutorial area throw database exceptions.
- to check in or not? tending towards check-in but wanted to discuss one more time.....
Some mysql->oracle replication rows still failing checksum due to what looks like floating point rounding errors.
Oracle DBA's plan to have 10.g database (which has IEEE 754 floating point) on development platform available soon - target date seems to be week of Nov 7

September 29 2005

Status Report

Oracle Replication
- Seems to be running smoothly after fixing intermittent problem which caused unreliable transfer
  - multiple copies of DBMAuto replication script could run at the same time and interfere with each other - script was supposed to detect this and stop but had a bug
- Row counts now look OK -- replication script runs every 2 hours and 'catches up'
- Now trying to run and understand checksum scripts

DatabaseInterface/RDBC/Exception Handling

To Recap: Nick reported odd failure mode where Dbi failed silently to read back data from Oracle database first described here
Traced to interaction of 2 discrete problems:
- integer overflow - API attempts to stuff integer > ( (2^31 -1)==2147483647 ) into a 4 byte integer
- failure to catch C++ exception thrown when overflow occurs.
Investigation shows that ODBC and odbc++ layer handle correctly.
RDBC converts C++ exception into ROOT signal/slot signal, which I cannot catch.
Possible fixes:
- Convert signal back to c++ exception
  - Pros: works. Catches failures which now are silent but potentially important.
  - Cons: some database exceptions are harmless but not currently caught, will cause crashes until code modified to catch them.
- Modify RDBC so signal/slot mechanism actually works

September 26 2005

Nick spotted the problem with the missing packets in replication-it seems that multiple copies of DBMAuto can run at the same time, each updating the context file and leading to confusion as to where replication has left off.

I switched the initiating script, launch.csh to fix the problem.

September 17 2005

Replication failed again last night, it tried to drop and recreate the CALPMTDRIFT table (with SEQNO column/primary key change) and failed. This creates so many errors in the log (it tries to insert several thousand rows into a table which no longer exists) that its hard to see what else is wrong. I will fix this table manually and see if that clears it up.

(later) hmmm no CALPMTDRIFT priming data in the database dump area! I cant fix this one manually. Time to send mail to Nick. I turned off this table in the replication script to at least fix this error and be able to read the logs...

September 16 2005

Found an interesting symptom of the missing packets problem in the Oracle Replicant

Grepping for the dates 9-11-05 and 9-12-05 in the  FNAL_import.log , I 
see something interesting at the date boundary: on 9-11 it stopped at import 
file FNAL_00019598.dbm.gz, and on 9-12 it started importing at 
FNAL_00019609.dbm.gz, skipping 11 files!

Here is the grep extract:
----------------------------------
2005-09-11 06:22:38 Importing FNAL_00019595.dbm.gz:-
2005-09-11 06:22:38 Importing FNAL_00019596.dbm.gz:-
2005-09-11 06:22:38 Importing FNAL_00019597.dbm.gz:-
2005-09-11 06:22:38 Importing FNAL_00019598.dbm.gz:-
2005-09-11 06:22:38 Importer job terminating
                Updated on 2005-09-11 17:10:50 by minos running looter on minos-om.fnal.gov connected to minos-db1.fnal.gov
                Updated on 2005-09-11 17:10:50 by minos running looter on minos-om.fnal.gov connected to minos-db1.fnal.gov
(phoenix.fnal.gov:dbox)% grep 2005-09-12 FNAL_import.log
2005-09-12 06:07:26 Too many updates for one job, stopping after update 19618
2005-09-12 06:08:00 Importing FNAL_00019609.dbm.gz:-
2005-09-12 06:09:10 Importing FNAL_00019610.dbm.gz:-
2005-09-12 06:10:07 Importing FNAL_00019611.dbm.gz:-
2005-09-12 06:10:07 Importing FNAL_00019612.dbm.gz:-
--------------------------------------

I went back and looked more closely at my log file, and it sometimes, 
but not always, skips packet numbers at date boundaries.  So of course 
some data will be missing in the replicant.

I am wondering if I have 2 different jobs writing to the same context file, which is supposed to keep track of which packet it stopped at.

To test this theory I moved the 2 different jobs to 2 different directory trees with seperate context files.

September 15 2005

Status

DBMAuto Replication

Replication running again after earlier troubles with segvs in dbmjob

Nick has noticed a reliability problem with SPILLTIMEEND, it seems to be missing a lot of Seqnos. For instance:

Mysql:


mysql>  select SEQNO from SPILLTIMENDVLD where SEQNO > 700000990 and SEQNO < 700001000;
+-----------+
| SEQNO     |
+-----------+
| 700000991 |
| 700000992 |
| 700000993 |
| 700000994 |
| 700000995 |
| 700000996 |
| 700000997 |
| 700000998 |
| 700000999 |
+-----------+
9 rows in set (0.02 sec)

Oracle:

SQL> select SEQNO from SPILLTIMENDVLD where SEQNO > 700000990 and SEQNO < 700001000;

     SEQNO
----------
 700000991
 700000994
 700000995
 700000998
 700000999

I am looking at the logs and trying to determine how this happened....

Sep 14 and 15 replication jobs failed with the following log errors:

  Failure 256 (1,0) on dbmjob -b > /home/dbox/minos/slave_site/work_dba64/824job_file.tmp 2>&1 << EOF
import --Batch --File /home/dbox/minos/slave_ ... AG_FAR,!DCS_RPS,!PULSERDRIFT,!DBUVACHIPSPARS,!PULSERTIMEDRIFT,!PULSERGAIN,!DBUSUBRUNSUMMARY,!PULSERGAINPIN,!PULSERGAINLOG,!PULSERDRIFTPIN
quit
EOF

  Common codes: 126   = Cannot execute (permissions problem?)
                127   = Command not found
                128+N = Terminated by signal N
                139   = SegV ?
=E= Dbm 2005/09/15 01:49:56 [-1|-1] DbmModule.cxx,v1.52:1136>    700027298 failed
=E= Dbm 2005/09/15 01:49:56 [-1|-1] DbmModule.cxx,v1.52:1136>    700027299 failed
=E= Dbm 2005/09/15 01:49:56 [-1|-1] DbmModule.cxx,v1.52:1136>    700027300 failed
=E= Dbm 2005/09/15 01:49:56 [-1|-1] DbmModule.cxx,v1.52:1136>    700027301 failed
=E= Dbm 2005/09/15 01:49:56 [-1|-1] DbmModule.cxx,v1.52:1136>    700027302 failed

Digging deeper, I find the following error messages in the failed.log

 ...
TSQLException ctor reason = Error executing "DROP TABLE CALPMTDRIFTVLD;": [Easysoft][Oracle]ORA-02449: unique/primary keys in table referenced by foreign keys
 state= HY000
TSQL::Throw TSQLException message  = Error executing "DROP TABLE CALPMTDRIFTVLD;": [Easysoft][Oracle]ORA-02449: unique/primary keys in table referenced by foreign keys

Followed later by

TSQLException ctor reason = Error executing "INSERT INTO CALPMTDRIFT VALUES (700027298,1,2701264895,2181038084,89.137703,0.281358,0.408714,126,128,89.000198,37.206799,126.562);": [Easysoft][Oracle]ORA-00913: too many values
 state= HY000
TSQL::Throw TSQLException message  = Error executing "INSERT INTO CALPMTDRIFT VALUES (700027298,1,2701264895,2181038084,89.137703,0.281358,0.408714,126,128,89.000198,37.206799,126.562);": [Easysoft][Oracle]ORA-00913: too many values

Still trying to chase down why exception isn't caught in RDBC when integer overflow condition caused by reading too large of value from oracle database. Was recently advised to look for other exception handlers in the RDBC program catching odbc++ connection and handling it silently.

September 6 2005(tuesday after Labor Day)

Added an explicit /usr/krb5/bin/aklog to the replication scripts which failed to see if this fixes the permission problems. The failing jobs already obtain a kerberos ticket (allegedly) as they are run by /usr/krb5/bin/kcron .

Re-ran the failing steps of above jobs by hand, they ran just fine.

September 4 2005 (sunday)

Cron jobs failiing again. Seems to be an /afs space permissions problem.

September 3 2005 (saturday)

Restarted replication cron jobs

September 2 2005

Replication jobs seem to be running ok so I went back and started complete refreshes of the minos_dev and test10.2 databases. This will take ~24 hours so I stopped the replication cron jobs

Went back and looked carefully at how RDBC TSQLExceptions are implemented. Wrapping my RDBC test program in try{ }catch() does not seem to work where it most definitely *does* work for odbc++ programs.

In other words, the program fragment in odbc++

  } catch(SQLException& e) {
    cerr << endl << e.getMessage() << endl;
  }

while this RDBC fragment does not work

    } catch(TSQLException& e) {
        printf("caught SQLException %s\n", e.GetMessage().Data() );
        // ...
    }

even though the odbc wrappers that throw the TSQLExceptions are riddled with code like so:

  } catch(odbc::SQLException& e) {
      Throw( new TSQLException( ODBCXX_STRING_CSTR(e.getMessage()),
                                ODBCXX_STRING_CSTR(e.getSQLState()),
                                e.getErrorCode()) );
   }

more poking about with debugger, trying to see why the exception isnt caught

Single stepping until exit from function OCINumberToInt, 
which has no line number information.
ocivnu_sqlslong (row=0, ir=0x9bc056c, target=0x0, buflen=4, indi=0x9bc0718)
    at oracle_functions.c:1770
1770      if(ret)
Current language:  auto; currently c
(gdb) n
1767      ret=OCINumberToInt(ir->desc->dbc->oci_err,
(gdb) list
1762                      NULL,"ocivnu_sqlslong",
1763                      "BufferLength",buflen);

1770      if(ret)
1771        {
1772          ood_driver_error(ir->desc->stmt,ret,__FILE__,__LINE__);
1773          return SQL_ERROR;
1774        }
(gdb) n
1772          ood_driver_error(ir->desc->stmt,ret,__FILE__,__LINE__);
(gdb) n
1773          return SQL_ERROR;
(gdb) n
odbc::ResultSet::_doFetch (this=0x9bc04c8, fetchType=1, rowNum=0)
    at /afs/fnal.gov/files/code/e875/general/minossoft/packages/RDBC/HEAD/odbc++/resultset.cpp:453
453       this->_checkStmtError(hstmt_,r,"Error fetching data from datasource");
Current language:  auto; currently c++

gack. Program exits normally from here. Line 1773 above leads to an odbc++ exception which gets ignored by RDBC. I put debug statements in the TSQLException constructor and they didnt get called.

September 1 2005

Replication still seems to find interesting ways to core dump. ODBCSYSINI variable alluded to yesterday points to a file odbcinst.ini, which among other things tells odbc whether or not to log all the odbc function calls. The log file (odbc.log) grew to 2 G in size and caused dbmjob to core dump every time it ran. Cleared this up.

It turns out, after much poking about, that the oracle->mysql replication was choking because of an incorrect setting of an Update number in a context file. Hopefully I am on top of this now (the replication, not the checksumming)

August 30 2005

Replication still seems to find interesting ways to core dump. Spent most of day reconstructing environment by hand to see if I could reconstruct why dbmjob segvs (turns out to need the env variable ODBCSYSINI to exist). Fixed this in crontab scripts, verified it runs by hand, and will await output from tonights cron jobs.

August 30 2005

Working on integer overflow problem first described here.

Verified that Nicks example program generates an SQL_ERROR when it tries to read back a value greater than ( (2^32 -1)==2147483647 ) into a 4 byte integer. For some reason this SQL_ERROR doesnt get passed all the way back up through the DatabasInterface/RDBC/odbc++ layer.
Verified that oracle odbc driver correctly reads back SQL_BIGINT into C 'long long' data type. So the problem apparently is odbc++ communication with the oracle odbc driver, odbc++ expects an SQL_BIGINT and the driver reports it has a SQL_INTEGER.
What we have here, is a failure to communicate...... At least in part. I wrote an odbc++ test program that successfully catches the overflow error, at least when using getInt() call. I looked unsuccessfully for a getBigInt() call, why does this work for mysql then?
My RDBC test program does not seem to catch the sql exception, I find stuff on the MINOSSOFT mailing list from 2002 alluding to problems catching exceptions in RDBC. hmmm.

August 29 2005

Finally figured out what was wrong with replication: dbmjob was doing a core-dump-on-exit again. Fixed this with a one-line change to oracle odbc driver, ran an export job which took 8 hours to catch up with. I have moved all my replication jobs off of flxi03 onto phoenix now, it is usually faster as the jobs don't have to compete against resource intensive jobs.

August 26 2005

Started moving my replication jobs off of flxi03 on to phoenix, and upgrading oracle client software uniformly to 10.2.0.1 from the ancient 8.1.7 used on flxi03 and elsewhere.

August 23 2005

Back from vacation. Replication seems to be broken and I am trying to understand why. Exports from oracle (minos_dev) mysteriously stopped in late July and I just noticed it for instance. The email from my replication cron jobs report they exit abnormally, and I cant reproduce that error when running by hand.

On a brighter note: Oracle 10.2. binary_float and binary_double types seem to work with loon and other minos software when oracle_odbc_driver is compiled with oracle_client 10.2.x.

To get loon to work with 'dbox' 10.2 server on phoenex, source the followingon phoenix where I have built and installed oracle_clients and oracle_odbc_drivers to match:

setup oracle_client v10_2_0_1
setenv LD_LIBRARY_PATH /home/dbox/UPS/prd/oracle_odbc/Linux+2/v0_5_5/lib:${LD_LIBRARY_PATH}
setenv LD_LIBRARY_PATH /home/dbox/UPS/prd/unix_odbc/Linux+2/v2_2_11/lib:${LD_LIBRARY_PATH}
setenv ENV_TSQL_URL oracle:odbc:oraodbc://phoeninx.fnal.gov:1521/dbox;
setenv ENV_TSQL_USER minos_writer
setenv ENV_TSQL_PSWD XXXXXX
setenv TNS_ADMIN /home/dbox

From an earlier email from Nick, how to spot the checksum discrepancy

  
  setenv DBM_DEBUG 2
  setenv DBM_SELECT_TABLE ???
  setenv DBM_SELECT_SEQNO "SEQNO = nnnnnnnnn"

where ??? is the table and nnnnnnnnn is a SEQNO in conflict and then
run the checksum option you will get a gory detail dump of how it
arrived at the checksum for that entry.

August 11 2005

Had to restart Oracle database on phoenix due to reboot in preparation for loading it with a copy of minos-db1 offline data
to start oracle 'dbox' 10.2 database server on phoenix:
- log in as oracle user
- lsnrctl start (start listnr)
- dbstart (start database)
- emctl start dbconsole (start enterprise mgr web server)
- isqlplusctl start (start isql interface web server)
to stop :
- log in as oracle user
- isqlplusctl stop
- emctl stop dbconsole
- dbshut
- lsnrctl stop
Prime the 10.2 oracle database with minos-db1 data, making sure that all the new float columns are IEEE 754 floating point
- cd $SRT_LOCAL/DatabaseTables/scripts
- setup oracle_client v10_2_0_1
- make a list of tables to create. call it a_list for example
- edit MysqlToOracleFilter.pl (called by priming script prime_oracle_from_mysql.pl) to create BINARY_FLOAT columns instead of regular FLOAT columns in new oracle schema
- ./prime_oracle_from_mysql.pl minos_writer/XXX@dbox a_list
- go do something else for a few hours....

August 9 2005

Verified that minosdev database on dba64 has 4G of data and 4G of indexes so I have enough disk space here on phoenix to do a full blown refresh/test using BINARY_FLOAT data types

August 8 2005

Built a UPS oracle client 10.2.0.1 product
Recompiled oracle odbc driver using above
Ran odbc test suite successfully, no ORA-03115 errors.
Next step is to try RDBC using above clients againts 10.2 database
Still on hook for BIGINT overflow problem diagnosed Aug 4...

August 5 2005

Used Enterprise Manager URL below to generate and execute the following SQL on the new 'dbox'Oracle 10.2 server
CREATE BIGFILE TABLESPACE "MINOSDEV_DATA" DATAFILE '/usr/oracle/oradata/dbox/minosdev_data01' SIZE 1G AUTOEXTEND ON NEXT 100M MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO
CREATE BIGFILE TABLESPACE "MINOSDEV_IDX" DATAFILE '/usr/oracle/oradata/dbox/minosdev_idx01' SIZE 1G AUTOEXTEND ON NEXT 100M MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO
CREATE USER "MINOS_WRITER" PROFILE "DEFAULT" IDENTIFIED BY "*******" DEFAULT TABLESPACE "MINOSDEV_DATA" TEMPORARY TABLESPACE "TEMP" ACCOUNT UNLOCK GRANT "CONNECT" TO "MINOS_WRITER"
GRANT CONNECT, RESOURCE TO MINOS_WRITER
(Did something similar for minos_reader)

Verified that BINARY_FLOAT and BINARY_DOUBLE types work at the sqlplus prompt level of workingness:

(phoenix.fnal.gov:oracle)% sqlplus minos_writer/XXXX@dbox

SQL*Plus: Release 10.2.0.1.0 - Production on Fri Aug 5 16:33:46 2005

Copyright (c) 1982, 2005, Oracle.  All rights reserved.


Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options


SQL> create table ieee_table (bf binary_float, bd binary_double);

Table created.

SQL> desc ieee_table;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 BF                                                 BINARY_FLOAT
 BD                                                 BINARY_DOUBLE

SQL> insert into ieee_table values (1.0 , 1.0);

1 row created.

SQL> insert into ieee_table values (1234.56789, 1234.56789);

1 row created.

SQL>  select * from ieee_table;

        BF         BD
---------- ----------
  1.0E+000   1.0E+000
1.235E+003 1.235E+003

Test oracle_odbc_driver against 10.2 server
- copy over tnsnames.ora that knows about 'dbox' server
- change the oracle_odbc_driver test programs so tables created in tests create binary_float instead of float columns.
- re-compile and run test programs, see what happens
  - This is what happens (at least when odbc client compiled with 8.1.7 oracle client) when it tries to read binary_float:
  - ORA-03115: unsupported network datatype or representation
- obvious next step is recompiling oracle_odbc_driver with a 10.2 oracle client and see what happens

August 4 2005

Status report for Aug 4 Software Phone Conference

Investigating odd failure mode for RDBC/oracle odbc found by Nick, traced to integer overflows. Details.
Have had database replication flow working like this for several weeks:
```
minos-db1(mysql) --> minosdev(oracle) --> offline_rep(mysql)
```
Checksums for the 3 databases do not match, I am investigating

Discovered replication-checksumming on the oracle server fails if someone adds tables to the schema since the previous running of create_checksum.sh . Got rid of ALL_DATA_TYPES table and its friends, re-ran checksumming.
The following tables have checksum problems that I am investigating:
```
BEAMMONSWICPEDS
BFIELDCOILCURRENT 
CALPMTDRIFT
DBILOGENTRY
SPILLTIMEND
UGLIDBISTRIP
```
The following tables 'showed up' in the offline minos-db1 and did not get created on the minosdev-oracle server:
```
BEAMMONSWICREL     
BFIELDMAPDATA               
BFIELDMAPHEADER             
```

A good discussion of Oracle numeric types here. One very confusing idiosyncracy that I was not aware of is that floating point types specify precision in binary while other numeric types specify precision in decimal. This leads to the following goofiness:


SQL>   create table t1 (c1 float(12), c2 number(12));

Table created.

SQL> desc t1;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 C1                                                 NUMBER(12)
 C2                                                 NUMBER(12)



SQL>  insert into t1 values (123456.0 , 123456.0);

1 row created.

SQL> select * from t1;

        C1         C2
---------- ----------
    123500     123456

Moral: Dont specify oracle Float(N) in your table unless you are sure of what you are doing . Specifying Float or Double Precision in your table DDL will give you 126 bits of precision, which is greater than the IEEE floating point standard that the rest of the world seems to use. Oracle 10.2 has a new floating point type which is or is supposed to be the IEEE type that everyone else uses, I am experimenting with a 10.2 server to see if that is what we should be using.

August 3 2005

Found the problem for Nicks new failure mode described here, its an integer overflow. If the MY_BIGINT column in table ALL_DATA_TYPES contains a value bigger than (2^32 -1)==2147483647 the oracle odbc driver returns an overflow error which for some reason does not get propagated back up through odbc++ and then RDBC.

A workaround (for now) is not to use BIGINTs in Oracle. The fix is to get the oracle odbc driver to communicate correctly with odbc++ when the Oracle integer has more than 9 places of precision.

An oracle integer by default has 38 decimal places of precision, while a mysql integer has 9 places and a mysql bigint has 19. A perfectly reasonable 1-1 correspondence between Mysql BIGINT and Oracle NUMBER(19,0) should exist. \

August 2 2005

Unzipped and installed oracle 10.2.0.1 on phoenix.fnal.gov to investigate its usefulness for resolving floating point round off issues. Found a 'cookbook' for installing on linux at this site. Problems that I encountered, with fixes, below:

Problem: semaphores and network udp datagram buffers incorrectly sized:

Checking for semmsl=250; found semmsl=250.      Passed
Checking for semmns=32000; found semmns=32000.  Passed
Checking for semopm=100; found semopm=32.       Failed <<<<
Checking for semmni=128; found semmni=128.      Passed
Checking for shmmax=536870912; found shmmax=33554432.   Failed <<<<
Checking for shmmni=4096; found shmmni=4096.    Passed
Checking for shmall=2097152; found shmall=2097152.      Passed
Checking for file-max=65536; found file-max=203052.     Passed
Checking for VERSION=2.4.21; found VERSION=2.4.21-27.0.2.EL.    Passed
Checking for ip_local_port_range=1024 - 65000; found ip_local_port_range=32768 - 61000. Failed <<<<
Checking for rmem_default=262144; found rmem_default=65535.     Failed <<<<
Checking for rmem_max=262144; found rmem_max=131071.    Failed <<<<
Checking for wmem_default=262144; found wmem_default=65535.     Failed <<<<
Checking for wmem_max=262144; found wmem_max=131071.    Failed <<<<
Check complete. The overall result of this check is: Failed <<<<
Problem: The kernel parameters do not meet the minimum requirements (see above).
Recommendation: Perform operating system specific instructions to update the kernel parameters.

solution: add the following lines to the /etc/sysctl.conf file, which is used during the boot process:

kernel.shmmax=2147483648
kernel.sem=250 32000 100 128
fs.file-max=65536
net.ipv4.ip_local_port_range=1024 65000
net.core.rmem_default=262144
net.core.wmem_default=262144
net.core.rmem_max=262144
net.core.wmem_max=262144

Problem: swap space incorrectly sized:

=======================================================================
Checking available swap space requirements ...
Expected result: 2916MB
Actual Result: 992MB
Check complete. The overall result of this check is: Failed <<<<
Problem: The system does not have the required swap space.
Recommendation: Make more swap space available to perform the install.
=======================================================================

solution: create a new swapfile and turn it on:

To add a swap file:

Determine the size of the new swap file and multiple by 1024 to determine the block size. For example, the block size of a 64 MB swap file is 65536.
3000*1024=3072000
At a shell prompt as root, type the following command with count being equal to the desired block size:
#dd if=/dev/zero of=/usr/swap bs=1024 count=3072000
Setup the swap file with the command:
#mkswap /usr/swap
To enable the swap file immediately but not automatically at boot time:
#swapon /usr/swap
To enable it at boot time, edit /etc/fstab to include:
/swapfile swap swap defaults 0 0
The next time the system boots, it will enable the new swap file.
After adding the new swap file and enabling it, make sure it is enabled by viewing the output of the command cat /proc/swaps or free.

Problem(?) Not sure if its a problem -- the oracle install starts up a web server!!!

Output from the oracle installer:


The following J2EE Applications have been deployed and are accessible at the URLs listed below.

iSQL*Plus URL:
http://phoenix.fnal.gov:5560/isqlplus

iSQL*Plus DBA URL:
http://phoenix.fnal.gov:5560/isqlplus/dba

Enterprise Manager 10g Database Control URL:
http://phoenix.fnal.gov:1158/em

No idea if these are security risks, I need to learn how to turn them on/off just in case.

interesting tidbits from ps auxww | grep oracle, which I think are the 3 processes mentioned above:

oracle    7920  4.2  5.9 604752 117628 pts/1 S    21:17   1:02 /usr/oracle/oracle_home/jdk/bin/java -server -Xmx256M -XX:MaxPermSize=96m -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -DORACLE_HOME=/usr/oracle/oracle_home -Doracle.home=/usr/oracle/oracle_home/oc4j -Doracle.oc4j.localhome=/usr/oracle/oracle_home/phoenix.fnal.gov_dbox/sysman -DEMSTATE=/usr/oracle/oracle_home/phoenix.fnal.gov_dbox -Doracle.j2ee.dont.use.memory.archive=true -Djava.protocol.handler.pkgs=HTTPClient -Doracle.security.jazn.config=/usr/oracle/oracle_home/oc4j/j2ee/OC4J_DBConsole_phoenix.fnal.gov_dbox/config/jazn.xml -Djava.security.policy=/usr/oracle/oracle_home/oc4j/j2ee/OC4J_DBConsole_phoenix.fnal.gov_dbox/config/java2.policy -Djava.security.properties=/usr/oracle/oracle_home/oc4j/j2ee/home/config/jazn.security.props -DEMDROOT=/usr/oracle/oracle_home/phoenix.fnal.gov_dbox -Dsysman.md5password=true -Drepapi.oracle.home=/usr/oracle/oracle_home -Ddisable.checkForUpdate=true -Djava.awt.headless=true -jar /usr/oracle/oracle_home/oc4j/j2ee/home/oc4j.jar -config /usr/oracle/oracle_home/oc4j/j2ee/OC4J_DBConsole_phoenix.fnal.gov_dbox/config/server.xml
oracle    9628  0.6  2.3 261736 46076 pts/1  S    21:19   0:09 /usr/oracle/oracle_home/jdk/bin/java -Djava.awt.headless=true -Doracle.oc4j.localhome=/usr/oracle/oracle_home/sqlplus -Djava.security.properties=/usr/oracle/oracle_home/oc4j/j2ee/home/config/jazn.security.props -jar /usr/oracle/oracle_home/oc4j/j2ee/home/oc4j.jar -config /usr/oracle/oracle_home/oc4j/j2ee/isqlplus/config/server.xml
oracle    7893  0.0  0.3  7796 6252 pts/1    S    21:17   0:00 /usr/oracle/oracle_home/perl/bin/perl /usr/oracle/oracle_home/bin/emwd.pl dbconsole /usr/oracle/oracle_home/phoenix.fnal.gov_dbox/sysman/log/emdb.nohup

August 1 2005

Issues that need my urgent attention:

Nick (and me!) still unclear about what precision needed for various data types in Oracle to correspond to Mysql and Loon usages.

New failure mode for loon during a select

>  
>
>> Finally a new weird failure mode.  The symptom is that I import a new
>> table, with a single SEQNO which just has one data row into ORACLE.
>> It goes in just fine and with sqlplus I can see both the VLD entry and
>> the main entry.  Now I try to export the table again and I get the VLD
>> entry but the query to pull the main entry out returns nothing, not
>> even an error!
>>   
>
>
> Right now I have a table ALL_DATA_TYPES with a single entry 990000000
> in the main and VLD tables in the Oracle DB.  With sqlplus I can see
> them:-
>
>  SQL> select * from ALL_DATA_TYPESVLD;
>  
>       SEQNO TIMESTART TIMEEND   DETECTORMASK    SIMMASK       TASK
> AGGREGATENO
>  ---------- --------- --------- ------------ ---------- ----------
> -----------
>  CREATIOND INSERTDAT
>  --------- ---------
>   990000000 01-JAN-05 01-JAN-06          127        127          0
> -1
>  01-JAN-05 01-AUG-05
>  
>  
>  SQL> select * from ALL_DATA_TYPES  where SEQNO= 990000000;
>  
>       SEQNO MY_TINYINT MY_SMALLINT     MY_INT  MY_BIGINT   MY_FLOAT
> MY_DOUBLE
>  ---------- ---------- ----------- ---------- ---------- ----------
> ----------
>  MY_TINYTEXT
>
> ------------------------------------------------------------------------
> --------
>  MY_TEXT
>
> ------------------------------------------------------------------------
> --------
>  M MY MY_ MY_C MY_CH MY_CHA MY_CHARS_20          MY_DATETI
>  - -- --- ---- ----- ------ -------------------- ---------
>   990000000        123       12345  123456789 1.2346E+18      1.235
> 2.345679
>  abc
>  abcdefghijklmnopqrstuvwxyz
>  a ab abc abcd abcde abcdef aaaaaaaaaabbbbbbbbbb 02-FEB-05
>  
> But I can only see the VLD entry with TSQL.  To demonstrate,
> create a file dennis.C that contains:-
>
> oracle:odbc:oraodbc://dba64.fnal.gov:1521/minosdev minos_dev
> minos_dev_new
> select * from ALL_DATA_TYPESVLD ;
> select * from ALL_DATA_TYPES  where SEQNO= 990000000 ;
>
> (you really do need a space before the semi-colon!)
>
> This will be data to my test script:-
>
>  $SRT_PUBLIC_CONTEXT/DatabaseInterface/test/TestTSQL2.C
>
> The first line defines the connection and the last two are commands
> given directly to TSQL by-passing the DBI.  To run, fire up loon and type:-
>
>  .x TestTSQL2.C("dennis.dat")
>
> If I do that I see:-
>
>  loon [0] .x TestTSQL2.C("dennis.dat")
>  Processing commands in dennis.dat
>  Opening connection to
> url:oracle:odbc:oraodbc://dba64.fnal.gov:1521/minosdev user: minos_dev
> passwd: minos_dev_new
>  Executing statement select * from ALL_DATA_TYPESVLD
>  Reading back results
>  Result set has 9 columns
>  SEQNO: 990000000 TIMESTART: 2005-01-01 00:00:00 TIMEEND: 2006-01-01
> 00:00:00 DETECTORMASK: 127 SIMMASK: 127 TASK: 0 AGGREGATENO: -1
> CREATIONDATE: 2005-01-01 00:00:00 INSERTDATE: 2005-08-01 09:47:31
>  Executing statement select * from ALL_DATA_TYPES where SEQNO=
> 990000000
>  Reading back results
>  Result set has 17 columns
>  loon [1]
>  
> so again it see the VLD row but for the main table it only sees the
> meta-data.
>
> Help!
>
> Cheers,
>
> Nick.
>

Replication seems to be working, checksumming reports lots of errors.

July 7 2005

Oracle-to-mysql replication failed last night for table CALPMTDRIFT. The VLD table was created with DATE type instead of TIMESTAMP on phoenix.fnal.gov/offline_rep database. Nick is working on a fix for this as part of his migration plan listed below.

My immediate tasks:

Fix CALPMTDRIFT on phoenix/offline_rep
Fix the tables mentioned yesterday on dba64/minosdev
(from Nicks migration plan) Run daily checksums and confirm that no checksum conflicts are reported. Fix any false positives by refining the checksum algorithm (although there should not be anything left to do here).

July 6 2005

Found problem with the following tables, implemented in Oracle as varchar(1) and in mysql as char(1). The oracle representation is giving loon indigestion so I have to re-build, reprime, and retest:
- BEAMMONSWICPEDS
- PLEXPINDIODETOLED
- PLEXPIXELSPOTTOSTRIPEND
- PLEXPIXELTORAWCHANNEL
- PLEXRAWCHANNELREADOUTTYPE
- PLEXRAWCHANNELTOPINDIODE
- PLEXSTRIPENDTOLED
- PLEXVETOSHIELDMUXTOMDL
Did extensive testing of different combinations of unixODBC versions with oracle_odbc_driver versions. I earlier thought that the 'crash on exit' was a recently introduced bug into the oracle_odbc_driver which manifested on earlier versions of unixODBC and not on later ones. It turns out that I can reliably get a crash-on-exit with all unixODBC versions prior to 2.2.11, regardless of which oracle_odbc_driver I use.

July 1 2005

Discussed migration strategy with Nick, it turns out he has one published
To enable this plan, I am on the hook for
1. Implement Production Oracle Server
2. Fix 'crash on exit' bug that creeped into older unixODBC/Oracle connections. Older unixODBC's mysteriously started crashing, newer ones do not. I strongly suspect this is due to something I did to oracle_odbc and should be able to chase it down by going through CVS logs.
3. Various aspects of making sure oracle->mysql and mysql->oracle replication and checksumming work, this is fleshed out in Nicks plan above.
Reverse engineered the schema of Minos offline dev database into Oracle Designer, cleaning up now for DSG group in anticipation of creating production database. They require this for a production database.
Made estimates of database size and growth, database currently holds 46G of data, and estimated growth rate is 78G/yr. Spreadsheet with calculations is here

June 28 2005

Import from oracle into mysql still failing this morning due to ddl change in mysql database on certain tables. Ran and re-ran import on phoenix.fnal.gov by hand until all the changed tables (timestamp->time due to some database work Nick was doing) that were stopping the nightly job were found and fixed. These were: BFIELDCOILCURRENT,SPILLTIMEND,BEAMMONSWICPEDS,BEAMMONSPILL,and BEAMMONFILESUMMARY

June 24 2005

Mysql import on phoenix.fnal.gov from Oracle export turned on, appeared to work
Started getting mysterious errors on DBILOGENTRYVLD after a few days. Investitation showed that column types on this table that were originally mysql datetime type had become mysql date types, causing run time errors.

investigating, found out that Nick had issued drop/create table commands as part of some other work, this mail describes it so I dont want to rephrase it :

Hi Nick,

I understand somewhat the DBILOGENTRYVLD funniness now. Dbmjob appears to 
drop and recreate tables under certain circumstances, and doing an import 
into mysql using an oracle export  on this table results in columns like 
TIMESTART which are oracle dates and mysql datetimes, get created in 
mysql as dates and then return GetColumnType 91 instead of 93.

==========================================================
Here is a metadata export from a mysql database:
>>>>>DBILOGENTRY Metadata
create table DBILOGENTRYVLD(SEQNO integer not null primary key 
auto_increment, TIMESTART datetime not null, TIMEEND datetime not null, 
DETECTORMASK tinyint(4), SIMMASK tinyint(4), TASK integer, AGGREGATENO 
integer, CREATIONDATE datetime not null, INSERTDATE datetime not null);

And here it is from an oracle database:
>>>>>DBILOGENTRY Metadata
create table DBILOGENTRYVLD(SEQNO INTEGER not null primary key, 
TIMESTART DATE not null, TIMEEND DATE not null, DETECTORMASK INTEGER, 
SIMMASK INTEGER, TASK INTEGER, AGGREGATENO INTEGER, CREATIONDATE DATE 
not null, INSERTDATE DATE not null);
===========================================================

The question is, what to do about this?  We could slap on a filter that 
turns dates back into datetimes for mysql table creation, but the precision 
of some of the columns is wrong which could cause checksum or other problems.

I don't understand why the tables have to be dropped in the first place if 
they already exist.  Could they simply be cleared out and repopulated?

Cheers,
Dennis

Nicks Reply


Hi Dennis


>>I understand somewhat the DBILOGENTRYVLD funniness now. Dbmjob appears 
>>to drop and recreate tables under certain circumstances.


Actually I think that was me.  Recall that I deliberately deleted
DBILOGENTRY + VLD from the ORACLE server to complete my tests
using TestDbi?


>> and doing an 
>>import into mysql using an oracle export  on this table results in 
>>columns like TIMESTART which are oracle dates and mysql datetimes, get 
>>created in mysql as dates and then return GetColumnType 91 
>>instead of 93.
>>
>>==========================================================
>>Here is a metadata export from a mysql database:
>
>>>>>>> >>>>>DBILOGENTRY Metadata
>
>>create table DBILOGENTRYVLD(SEQNO integer not null primary key 
>>auto_increment, TIMESTART datetime not null, TIMEEND datetime 
>>not null, 
>>DETECTORMASK tinyint(4), SIMMASK tinyint(4), TASK integer, AGGREGATENO 
>>integer, CREATIONDATE datetime not null, INSERTDATE datetime not null);
>>
>>And here it is from an oracle database:
>
>>>>>>> >>>>>DBILOGENTRY Metadata
>
>>create table DBILOGENTRYVLD(SEQNO INTEGER not null primary key, 
>>TIMESTART DATE not null, TIMEEND DATE not null, DETECTORMASK INTEGER, 
>>SIMMASK INTEGER, TASK INTEGER, AGGREGATENO INTEGER, CREATIONDATE DATE 
>>not null, INSERTDATE DATE not null);
>>===========================================================
>>
>>The question is, what to do about this?  We could slap on a 
>>filter that 
>>turns dates back into datetimes for mysql table creation, but the 
>>precision of some of the columns is wrong which could cause 
>>checksum or 
>>other problems.


The answer comes in two parts: an easy bit and a hard bit.

A) (The easy bit). I convert DATETIME to DATE in
DbiStatement::TranslateSQL:-

 
http://www-numi.fnal.gov/offline_software/srt_public_context/WebDocs/dox
ygen/loon/html/classDbiStatement.html#d0

Isn't that the right thing to do and if not, what should I translate
that to?

B) (The hard bit).  Is this another garden rake I see before me?  I
belatedly realise that this is something that going to reoccur as
ORACLE becomes the master.  My model has always been that the SQL is
MySQL flavoured and that we just translate to ORACLE just before we
write.  However, I get the table creation SQL from the exporting
database and apply it at the importing database if required.  So this
means I need to has at least limited conversion in the reverse
direction.


>>I don't understand why the tables have to be dropped in the 
>>first place 
>>if they already exist.  Could they simply be cleared out and 
>>repopulated?


As I say, I think that in this case that's down to me but your
detective work has shown me that we have to cover the Use Case of
ORACLE -> MySQL translation.

Cheers,

Nick.

June 16 2005

Mysql export Oracle import (minos-db1 to dba64) replication turned on and appears to be working.
Oracle export from dba64 turned on, completes without errors, but reports empty update files. I have some configuration file wrong, I suspect.
Phoenix.fnal.gov assembled, fermi linux lts 3.0.4 and mysql installed. Primed mysql database ready to import from dba64 when I get export working.

June 9 2005

Mysql to Oracle replication
Replication to minosdev(Oracle) from minos-db1 (Mysql) now back in operation after hiatus where ROW_COUNTER schema change was implemented. Problems encountered in getting this up and running were the segv-on-exit bug with the oracle odbc driver and unixODBC libraries earlier than 2.2.11, it took me a while to get the LD_LIBRARY_PATH set correctly to not pick up the MINOS_EXTERNAL unixODBC 2.2.9 libraries first.
Here is the command that lists which tables are currently being replicated, the !tables either are not recommended for replication because of size/nonusefulness issuses or dont have ROW_COUNTER columns yet
--Table *,!DCS_HV,!DCS_MAG_CAL,!DCS_MAG_FAR,!DCS_RPS, !PULSERDRIFT,!DBUVACHIPSPARS,!PULSERTIMEDRIFT, !PULSERGAIN,!DBUSUBRUNSUMMARY,!PULSERGAINPIN, !PULSERGAINLOG,!PULSERDRIFTPIN
Oracle to Mysql replication
Recently acquired an adequate machine which I plan to use for this purpose (AMD semperon, 2G ram, 200G sata disk, will be named phoenix.fnal.gov) . The mysql offline database is currently about 35 G according to http://www-numi.fnal.gov/minwork/computing/database_size_report.html#by_size .

May 11 2005

Investigating some apparent number truncation from 'select count(*) from bla bla bla' type queries reported by Nick. It turns out this was an older bug, fixed in CVS but never exported to the /afs/fnal.gov/.../MINOS_EXTERNAL area.

Updating these libraries and updating unixODBC to 2.2.11 causes all of the minos tutorial examples to run to completion save 2. simple_midad.C and test_user_display_module.C both try to do something with LoadMinosPDG: and then core dump.

Apr 28 2005

Ran all the loon tutorial examples against the oracle dev database.

An example of how to do this on flxi03.fnal.gov:


#!/bin/sh -x 

#setup minos environment
. /afs/fnal.gov/files/code/e875/general/minossoft/setup/setup_minossoft_FNALU.sh

#put oracle client libs in LD_LIBRARY_PATH
setup oracle_client

#minos-oracle specific environment vars
export ENV_TSQL_URL='oracle:odbc:oraodbc://dba64.fnal.gov:1521/minosdev'
export ENV_TSQL_USER=minos_reader
export ENV_TSQL_PSWD=reader


#run them all!

for FILE in `ls  $MINOS_TUTORIAL_MACROS/*.C` ; do
    OUT=`basename $FILE `;
    loon -qb $FILE $MINOS_TUTORIAL_DATA/F00018143_0000.mdaq.root 2>>$OUT.oracle 1>>$OUT.oracle
done;

They dont run perfectly, loon_reco_job.C,simple_midad.C ,test_user_display_module.C ,time_dbi.C and write_root_file.C seg fault, which is 5 of 12 scripts. I clearly have some investigation to do. When I run the same jobs against the mysql offline database they all terminate normally.

The above script produces the following files, which I am now comparing against their mysql analogs:

Change the output from $OUT.oracle to $OUT.mysql and change the following env vars


export ENV_TSQL_URL='mysql:odbc://minos-db1.fnal.gov/offline'
export ENV_TSQL_USER=reader
export ENV_TSQL_PSWD=minos_db

And you get the following output files:

I get lots of apparent errors in both the mysql and oracle output files, showing that I don't know much about running loon. I will concentrate on getting the same errors in the oracle output that I do in the mysql.

(Later) It appears 3 scripts, write_root_file.C, time_dbi.C, and loon_reco_job.C. produce different output to stdout and stderr between the two database servers. This is something I can investigate. Also the 5 scripts that core dump, even though they seem to produce the same output.

Apr 20 2005

Came up for air from CDF/BLAST related activities. Realized that my 'latest' priming script, cp_tables.pl is not in CVS tho it very much needs to be. This will serve as a nag to put it there

(Later) added some improvements, it copies tables with single quotes in them over now. Checked in to CVS as DatabaseTables/scripts/cp_table_minosdb.pl

my current invocation of the priming script :

./cp_table_minosdb.pl -S 'reader/XXXXXXXX@DBI:mysql:offline;host=minos-db1' -T 'minos_dev/XXXXXXXX@minosdev' -l sortedtabs2

where sortedtabs2 is a list of tables reverse sorted from smallest to biggest in the source database available from http://www-numi.fnal.gov/minwork/computing/database_size_report.html#by_size

This script copies over the ROW_COUNTER data if its there, generates ROW_COUNTER data if none exists in the source database, and refuses to proceed if the target database doesnt have a ROW_COUNTER column. So I have in theory an oracle database that I can use for development .

Mar 7-8 2005

Got tired of waiting on PK migration of the offline data tables

Set up a DBI perl script which copies tables from database 'A' to 'B' and assigns a ROW_COUNTER if it doesnt already exist
Used this script to prime the oracle dev database (minosdev) and a mysql database on fndapl. The oracle database will be a full prime of the mysql offline database on minos-db1 when it completes. The fndapl database will use fewer tables because of space limitations.
Used a perl script supplied by Nick to modify ROOT code to know about the ROW_COUNTER column, and built a loon and dbmjob which *should* work with them.
My plan is to use dbmjob/DBMAuto to replicate between minosdev as the source and fndapl as the target.

Week ending Jan 14 2005

Waiting to see if the PK proposed below (seqno, row_counter) is adopted by Nick. He has some reservations, it turns out that Mysql does not necessarily use its pk to retrieve data in sorted order even when the PK is the ORDER BY clause.

See http://dev.mysql.com/doc/mysql/en/ORDER_BY_optimization.html for official documentation that this is the way its supposed to work, and a discussion by one of the developers about why it doesnt actually work this way at http://forums.mysql.com/list.php?24 . Oy.

Jan 6-7 2005

A count of duplicate rows in the database dumps from minos-db1.fnal.gov

These duplicate rows would not exist with proper enforcement of primary keys on the data tables.

NB getting the actual rows that are duplicates so they can be cleaned up has taken far more time than it should, the DCS_HV table is so large that I have made several false starts, running out of disk space . The good news out of this is I have a better tool set for policing these tables for this sort of error and should be able to run them from cron jobs until we get primary keys on the tables

file

table number of duplicate rows link

BFIELDCOILCURRENT one file

CALADCTOPE 2880 file

CALDETCERRANGE 14 file

CALDETCERTIMEWIN 9

CALMAPPERFITS 484 file

DBUVACHIPPEDS 704712 file

DBUVACHIPSPARS 375366 file

DCS_HV 58967 file

PLEXPIXELSPOTTOSTRIPEND 2 file

PLEXPIXELTORAWCHANNEL 2 file

PULSERRAWDRIFTPIN 268 file

PULSERREFDRIFTPIN 2 file

table	number of duplicate rows	link
BFIELDCOILCURRENT	one	file
CALADCTOPE	2880	file
CALDETCERRANGE	14	file
CALDETCERTIMEWIN	9
CALMAPPERFITS	484	file
DBUVACHIPPEDS	704712	file
DBUVACHIPSPARS	375366	file
DCS_HV	58967	file
PLEXPIXELSPOTTOSTRIPEND	2	file
PLEXPIXELTORAWCHANNEL	2	file
PULSERRAWDRIFTPIN	268	file
PULSERREFDRIFTPIN	2	file

Jan 3 2005

Back from xmas vacation. I tried sorting the mysqldumps prior to assigning ROW_COUNTERs over the break, this worked for every table except DCS_HV which is so large that the job runs out of disk space even in my huge /afs/fnal.gov/files/data/minos/validation/ work area.

I can work around this by grepping out individual SEQNOS from the mysqldump file.

Dec 21-23 2004

The primary key problem for data tables has turned out to be a bigger problem than I originally thought. Nick & I agreed via email that a good 'artificial' primary key is the one described below on Dec17-20 entry. The hope was that this would result in an oracle database with the same row ordering as the mysql offline database. I no longer think this is the case, the oracle database is primed from ascii files generated by mysqldump. Mysqldump does not feel compelled to dump the rows from the table in the same order as they appear, ordered by seqno. Even if they were ordered by seqno this does not define the sort order of all rows with the same seqno. I have *nearly* got the oracle database reprimed after realizing that I have to sort the mysqldump output before inserting into oracle, I am willing to bet money that this pre-sorting step will result in a different row ordering in Oracle than in Mysql.

Dec 17-20 2004

Several false starts on creating an artificial primary key for data tables. The current plan is to create an artificial primary key, row_counter which incements for each row inserted.

Dec 16 2004

Priming job failed again last night on DCS_HV table. I suspect I ran out of space in the database as I set the pk index to be on all 13 rows to eliminate duplicates. DCS_HV had 256 duplicate rows (not too bad out of 13 million, but it ) DBUVACHIPSPARS had 4590 exact duplicate rows in addition to the 'nearly duplicate' rows noted yesterday.

Here is an example duplicate row from the mysql offline database:

mysql> select * from DCS_HV  where SEQNO=200020421 and HV_LOGICAL=8178 and HV_GEOMETRICAL=4715 and HV_MAINFRAME=8 and HV_CARD=11 and HV_CHANNEL=2 and HV_PLANE=471;
+-----------+------------+----------------+--------------+---------+------------+----------+----------------+---------+--------------+---------------+----------------+----------+
| SEQNO     | HV_LOGICAL | HV_GEOMETRICAL | HV_MAINFRAME | HV_CARD | HV_CHANNEL
| HV_PLANE | HV_SUPERMODULE | HV_SIDE | HV_PMTNUMBER | HV_SETVOLTAGE | HV_READVOLTAGE | HV_ERROR |
+-----------+------------+----------------+--------------+---------+------------+----------+----------------+---------+--------------+---------------+----------------+----------+
| 200020421 |       8178 |           4715 |            8 |      11 |          2
|      471 |              2 |       1 |            2 |          -898 |
 -898 |        0 |
| 200020421 |       8178 |           4715 |            8 |      11 |          2
|      471 |              2 |       1 |            2 |          -898 |
 -898 |        0 |
+-----------+------------+----------------+--------------+---------+------------+----------+----------------+---------+--------------+---------------+----------------+----------+
2 rows in set (0.02 sec)

Much email sent back and forth with Nick about duplicate data and primary keys. It boils down to a two pronged approach I think,
1. artificial primary keys in oracle in the short term (auto increment row counters)
2. cleanup of mysql database, identify natural primary keys, delete duplicate rows, and the big camunga, distribute the cleanups to all the replicas via dbmjob.

Dec 15 2004

Spent some time reviewing with Nelly how Locally Managed Tabelspaces work. Had her turn off archive logging so I could load and re-load the offline tables into oracle.

Spent much more time than I intended to working on getting new minosdev development instance re-primed. One of the requirements for moving the offline database into production,and its a reasonable one, is that data tables have a primary key. The problem I am encountering is that some rows of large tables appear to be repeats, and some appear to be garbage data that should have been caught by primary key constraints. Many rows in DBUVACHIPSPARS and DBUVACHIPPEDS are like the following:


mysql> select * from DBUVACHIPSPARS where SEQNO=200015442 and CRATE=7 and MODEFLAGS=0 and VARC=0 and VMM=0 and VAADCSEL=0 and VACHIp=2;
+-----------+-------+-----------+------+------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| SEQNO     | CRATE | MODEFLAGS | VARC | VMM  | VAADCSEL | VACHIP | CHAN00 | CHAN01 | CHAN02 | CHAN03 | CHAN04 | CHAN05 | CHAN06 | CHAN07 | CHAN08 | CHAN09 | CHAN10 | CHAN11 | CHAN12 | CHAN13 | CHAN14 | CHAN15 | CHAN16 | CHAN17 | CHAN18 | CHAN19 | CHAN20 | CHAN21 |
+-----------+-------+-----------+------+------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| 200015442 |     7 |         0 |    0 |    0 |        0 |      2 |  16383 |  16383 |     16 |     17 |     18 |     18 |     18 |     17 |     15 |     18 |     19 |     18 |     18 |     18 |     18 |     19 |     21 |     18 |     50  | 16383 |  16383 |  16383 |
| 200015442 |     7 |         0 |    0 |    0 |        0 |      2 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |  16383 |
+-----------+-------+-----------+------+------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
2 rows in set (0.02 sec)
 
mysql>

I.E. I suspect that 16383 is equivalent to 'NULL' and the second row is garbage that should not have been read in in the first place. As it is I am afraid I may have to make a primary key of all 28 rows? Looking at the data the first 7 seems reasonable. Time to ask Nick.

I change the priming script to make a primary key of the first 2 columns (seqno and whatervers next) unless it has an entry in an associative array, I made guesses for 28 tables based on a look at the table structure. I suspect this list will have to be tweaked.

Dec 14 2004

The DBMAuto problem is known by Nick, he has sent a workaround to minos_software_discussion list. It looks like a bit of work to figure out exactly what needs doing and then doing it correctly.
The gory details are here.
Update: well that was a fiasco. I took this opportunity to do a full refresh, as I knew there were problems with many of the indexes from bugs in previous refresh scripts. My first intimation of trouble was seemingly random error messages like so:
```
ORA-01653: unable to extend table MINOS_DEV.BFIELDCOILCURRENT by 8 in tablespace MINOS_DEV_DATA
```
It turns out that the continuous dropping, creating and inserting via sqlldr had two bad effects
1. Severely fragmenting the tablespace, so that even though Nelly thought there were plenty of extents for Oracle to grab, the ORA-01653 errors occur.
2. Filling up the archive log which makes also makes Oracle unhappy.
I know how to solve 2, I thought 1 was supposed to be automatically solved by Locally Managed Tablespaces, at least that was the sales pitch. I think minosdevsam database is fairly messed up right now, I need Nellys help to bounce the database to turn off archiving and see if that , and dropping all the tables, allows oracle to self-coalesce its table space (it cant be done manually with LMT's)

Dec 12-13 2004

More DBMAuto replication difficulties.

DBMAuto has started producing files like these, which fill up the 2.2 G partition completely and cause everything to fail.

-rw-r--r--    1 dbox     products 620683568 Dec 13 13:08 17426_FNAL_00013415.dbm_tmp
-rw-r--r--    1 dbox     products 847322533 Dec 13 13:10 17426_FNAL_00013416.dbm_tmp
-rw-r--r--    1 dbox     products 613590765 Dec 13 13:11 17426_FNAL_00013417.dbm_tmp
-rw-r--r--    1 dbox     products 152707072 Dec 13 13:12 17426_FNAL_00013418.dbm_tmp

I have sent email to Nick about whether he knows of a workaround, I was consumed all day (12/13) with BLAST issues.

Dec 9 2004

Status for Dec 9 Phone Conference

Mysql->Oracle replication missing rows investigation continues
- Replication has to be constantly monitored, two 12+ hour interruptions have happened since last prime/refresh on Dec 3. Gory details below Dec 7-8 and Dec 4-5
TestDbi acceptance of oracle interface.
- libodbc++/odbc driver getInt() bug has been fixed. Gory details below for those who are interested
- Testing continues, several SQL-dialect issues identified yesterday. Fix is not anticipated to be difficult.
Production Oracle Database
- Plan is to have production offline database on same machine that will run SAM
- Have a checklist of conditions from DBA's for them to take ownership
- We are scheduled to start working on this checklist next week Dec 13-17
- Obviously need to get replication and software compatibility issues worked out before it is a true production environment

Dec 7-8 2004

Investigating TestDbi behavior now that libodbc++/getInt issue has been resolved.
- Nick reports the following error running TestDbi
```
 Error executing "ALTER TABLE DBILOGENTRY add CONSTRAINT DBILOGENTRY_
               FK FOREIGN KEY (SEQNO) REFERENCES DBILOGENTRYVLD(SEQNO)":

               [Easysoft][Oracle]ORA-02275: 
               such a referential constraint already exists in the table
```
  This appears to be something trivial that I havent found as of yet, I can comment out the code in DbiStatement.cxx that creates the foreign keys as a hack and move on to check other functionality.
  Checking the program tracefiles I see that TestDbi does indeed try to execute this command twice and returns an error the second time it runs.
  The problem is the table is created twice in the preparation command files, once in prepare_db_0.sql and in prepare_db_1.sql. It complains and quits the second time.
- Commenting out the code that forces a failure on the above error turns up several SQL-syntax errors that need to be corrected:
```
The following TSQL errors have occured:-
  Error executing "INSERT INTO DBIDEMODATA4 VALUES (900000002,'DbiDemoData4','NewData','[\'\' \'MyFloat\'=(double)456 \'MyInt\'=(int)123 \'MyString\'=(string)\'A string\']');": [Easysoft][Oracle]ORA-00911: invalid character
```
  This is the dreaded \' vs '' error discussed at length before. I tried filtering these out in DbiStatement.cxx without success.
  Another error:
```
The following TSQL errors have occured:-
  Error executing "update  DBIDEMODATA2VLD set INSERTDATE = "2001-09-27 01:02:03"  where SEQNO = 900000005; ": [Easysoft][Oracle]ORA-00904: invalid column name
```
  This can be fixed by changing the double quotes on the date to single quotes

Replication started failing 12/08 about 11:50 am with the following errors:

Errors:   Unable to gunzip /afs/fnal.gov/files/data/minos/dbm/rsync/FNAL_00013360.dbm.gz
 Failure 256 executing command:-
  echo 2004-12-09 12:50:12 Unable to gunzip /afs/fnal.gov/files/data/minos/dbm/rsync/FNAL_00013360.dbm.gz >> /local_home/dbox/slave_site/work/FNAL_import.log

 Please check /local_home/dbox/slave_site/work/FNAL_import_fatal.log

Errors: Failure 256 executing command:-
  echo 2004-12-09 14:50:12 Too many updates for one job, stopping after update 13369 >> /local_home/dbox/slave_site/work/FNAL_import.log

 Please check /local_home/dbox/slave_site/work/FNAL_import_fatal.log

Investigation showed that this failure was self inflicted, I vi'ed a large file from offsite, my terminal froze, I killed it, and unintentionally left a runaway vi process and a *huge* vi swap file that filled the working directory to 100% capacity.

Chances are good that the replication from the production offline mysql server is messed up to the point that a refresh is now the best option.

Dec 4-5 2004

Developments over the weekend

Monitored refresh of the 12 tables till completion, restarted replication when they were done
Replication started failing almost immediately, with the following error message:
```
dbmjob: error while loading shared libraries: libMinosObjectMap.so: cannot open
shared object file: No such file or directory
```
I obviously should not be running replication from the head of development if I want to consider it a production system, these sorts of errors will occur and cause the tables to get out of synch again.

Dec 3 2004

Discussed the production Oracle server for minos with Nelly today. I have to do a space report and schema in Oracle Designer. Here is the document they expect the schema to conform to, I had forgotten this: http://www-css.fnal.gov/dsg/external/oracle_admin/designer_report_checklist.html
Nick emailed a report of alarming inconsistencies between the mysql and oracle replica databases,
```
 65 tables are fine.
  12 have conflicts:-
```
of two types:
- Checksum errors. The ones I have found so far from his report are formatting differences, i.e. 0.44321 vs .44321 . This does not mean there are not other errors, this is just what I have found so far after hours of checking.
- Missing Rows. I obviously need to figure out whats going on with replication. Turned off replication and set a job to manually refresh these 12 tables, will turn replication back on tonight when they are done:
```
DBUSUBRUNSUMMARY
DBUVACHIPPEDS
DBUVACHIPSPARS
DCS_CHILLER_FAR
DCS_ENV_FAR
DCS_HV
DCS_MAG_CAL
DCS_MAG_FAR
DCS_RPS
PULSERDRIFT
```
  It is possible that these tables got out of synch when I manually refreshed on Nov29 and then the replication job core dumped for 24 hours because of the ROOT upgrade.

Nov 30 2004

Had to rebuild dbmauto as the version of ROOT was changed last night and replication started mysteriously core dumping. I wasted a bit of time as I was sure it was the changes to the oracle odbc driver but Nooooooooo. Stepping through dbmjob with gdb showed me it was a ROOT problem, then I looked through recent email and found this dated Nov 29:

I plan on changing the version of ROOT that "development" uses on FNALU
from v4-00-08f to v4-01-02 tonight.  This involves a fresh rebuild of
"development" and _invalidates_ any personal test release libraries.
It is important that users clean and rebuild their test releases after
the base release has been rebuilt.  Serious consequences can occur
if there is version skew.

The first replication job ran for about 4 hours per database after the above fix, they normally run just a few minutes. This was (I think) due to the big backlog of replications waiting after the ROOT change.

Nov 29 2004

Investigating the trace described on Nov28, in SQLFetch, I found this comment:
```
     if(status!=SQL_NO_DATA&&stmt->bookmark) 
	{
	  /*
	   * stmt->bookmark would be nonzero if this wasn't the first call
	   * to SQLFetch. As OCIStmtExec==SQLexecute,SQLFetch we don't want
	   * to call OCIStmtFetch for the first call of SQLFetch.
	   * if status==SQL_NO_DATA we're trying to go past the end of the
	   * result set. If we were to call OCIStmtFetch now we'd get 
	   * an error.
	   */
	  ret=OCIStmtFetch_log_stat(stmt->oci_stmt,stmt->dbc->oci_err,
				    stmt->row_array_size,OCI_FETCH_NEXT,
				    OCI_DEFAULT,ret);
```
I set up an odbc program that replicates the trace below, first calling SQLFetch with a query that wont retrieve any data and then calling it with a query that should. Stepping into SQLFetch and setting stmt->bookmark=0 in the debugger makes the program return the correct results. So I need to understand when/how to reset stmt->bookmark.
I elected to fix this by changing SQLSetStmtAttr.c so that when it is called to set the SQL_ATTR_ROWS_FETCHED_PTR attribute to 0, signalling the statement handle that it thinks it is starting over, it sets stmt->bookmark to 0 as well as stmt->rows_fetched_ptr . This makes my odbc test program which I am incorporating into the test subdirectory work, as well as the TestDbi program.

Nick brought to my attention that the CALNDMAPPERFITS table was missing on dba64, the new minos dev machine. Investigating, I found the following tables failed to refresh on Nov 23:


flxi03$ pwd
/afs/fnal.gov/files/data/minos/validation/dbox_oracle_refresh/Nov23_dba64/DatabaseTables/scripts/log_dir
flxi03$ grep Error *.log | grep Loader
CALNDMAPPERFITS_CTL.aaaa.log:SQL*Loader-941: Error during describe of table "MINOS_DEV"."CALNDMAPPERFITS"
DBUSUBRUNSUMMARY_CTL.aaaa.log:SQL*Loader-941: Error during describe of table "MINOS_DEV"."DBUSUBRUNSUMMARY"
DBUSUBRUNSUMMARY_CTL.aaab.log:SQL*Loader-941: Error during describe of table "MINOS_DEV"."DBUSUBRUNSUMMARY"
PHOTONPMTQUANTUMEFFICIENCY_CTL.aaaa.log:SQL*Loader-941: Error during describe of table "MINOS_DEV"."PHOTONPMTQUANTUMEFFICIENCY"
PHOTONPMTQUANTUMEFFICIENCYVLD_CTL.aaaa.log:SQL*Loader-941: Error during describe of table "MINOS_DEV"."PHOTONPMTQUANTUMEFFICIENCYVLD"
PLEXRAWCHANNELREADOUTTYPE_CTL.aaaa.log:SQL*Loader-941: Error during describe of
table "MINOS_DEV"."PLEXRAWCHANNELREADOUTTYPE"
PLEXRAWCHANNELREADOUTTYPEVLD_CTL.aaaa.log:SQL*Loader-941: Error during describe
of table "MINOS_DEV"."PLEXRAWCHANNELREADOUTTYPEVLD"

OK I have fixed this and checked in the changes to DatabaseTables/scripts/MysqlToOracleFilter.pl and DatabaseTables/scripts/mungeMysqlDump.pl. Now I need to turn on replication.

Replication now turned on seems to be running error free.

Nov 28 2004

here is the code that seems to be malfunctioning in DbiCascader::ReserveNextSeqNo

1   string sqlPrefix("select * from ");
2   sqlPrefix += seqnoTableName + " where TableName = \'";
3   string sql = sqlPrefix + tableName + "\'";
4   MSG("Dbi",Msg::kDebug) << seqnoTableName << " query: " << sql << endl;
5   TSQLResultSet* res = stmtDb->ExecuteQuery(sql.c_str());
6   stmtDb->PrintWarnings();
7   UInt_t seqNo = 0;
8   if ( res && res->Next() ) {
9     seqNo = res->GetInt("LASTUSEDSEQNO");
10   }
11   else {
12    
13    //  Cannot find row use default for last used SeqNo.
14    string sql = sqlPrefix + "*\'";
15    MSG("Dbi",Msg::kDebug) << seqnoTableName << " query: " << sql << endl;
16    res = stmtDb->ExecuteQuery(sql.c_str());
17    if ( stmtDb->PrintWarnings() ) {
18      MSG("Dbi",Msg::kError) << "Unable to find default SeqNo"
19			     << " due to above error" << endl;
20      return 0;
21    }
22    res->Next();
23    seqNo = res->GetInt("LASTUSEDSEQNO");

if the GLOBALSEQNO table is set so that line 8 returns true, (ie if Select * from GLOBALSEQNO where tablename='DBIDEMODATA1' returns data) then the query works.

if line 8 returns false (i.e. no such row in GLOBALSEQNO), line 22 leads to an SQLFetch which returns SQL_NO_DATA which is incorrect for the query Select * from globalseqno where tablename='*' as this row is there.

Tried to set up an odbc++ analogue of the above code, but am doing something wrong as it throws an exception [libodbc++]: Cannot re-execute; statement has an open resultset on the second executeUpdate which leads me to belive that DatabaseInterface does a bunch of things under the hood that I dont yet understand.

INVESTIGATE THIS TRACE

line 22 above triggers this in the odbc driver trace file:

[Oracle][4207]SQLFetch.c[0332]^M
                Entry:[SQL_HANDLE_STMT][0x0969a160]^M
                Status = 0^M
                num_result_rows = 1^M
^M
OCIStmtFetch(0x9713d88,0x970892c,1,2,0)=NO_DATA^M
OCIAttrGet(0x9713d88,OCI_HTYPE_STMT,0x969a1e8,(nil),9,0x970892c)=SUCCESS^M
[Oracle][4207]SQLFetch.c[0345]^M
                Exit: [SQL_NO_DATA]^M

Nov 22 2004

Installed Easysoft oracle odbc driver trial version as root on odets2.fnal.gov. This package requires installation as root, put everything in /usr/share/easysoft with a required soft link back to /usr/local/easysoft.

Here is the /etc/odbcinst.ini it created:

[ORACLE]
Description             = Easysoft ODBC Oracle Driver
Driver          = /usr/local/easysoft/oracle/libesoracle.so
Setup           = /usr/local/easysoft/oracle/libesoraclesetup.so
FileUsage               = 1

And here is the /etc/odbc.ini it created:


[ORACLE]
Driver          = ORACLE
Database                = minosdevsam
User            = minos_dev
Password                = CENSORED
METADATA_ID             = 0
ENABLE_USER_CATALOG             = 1
ENABLE_SYNONYMS         = 1

RDBC as used by MINOS ignores these .ini files as best as I can tell. It generally uses URLs to store connect information to a database. The format of the URLS is like so:

servertype:odbc:[libname]://host:port/databasename

so to use libesoracle.so from Easysoft one uses

oracle:odbc:esoracle://cdfpcb.fnal.gov:1521/cdf_val

and makes sure libesoracle.so is in the LD_LIBRARY_PATH

Going through this exercise verifies that the commercial driver from easysoft returns the correct GLOBALSEQNO while the open source one does not.

Next we try running some ODBC test cases using open source (OS) and EasySoft (ES) drivers side by side to see if we can catch why/where they act differently. Here are some excerpts from my odbc.ini and odbcinst.ini:

ODBC.INI

                                                                          
[oracle_os_cdf_val]
Driver                  = oracle_os
Database                = cdf_val
                                                                                                                                                              
[oracle_es_cdf_val]
Driver                  = oracle_es
Database                = cdf_val

ODBCINST.INI

                                                                               
[oracle_es]
Description     = Easysoft ODBC Driver
Driver          = /usr/local/easysoft/oracle/libesoracle.so
Setup           = /usr/local/easysoft/oracle/libesoraclesetup.so
FileUsage               = 1
                                                                                
[oracle_os]
Description             = open source oracle ODBC Driver
Driver          = /afs/fnal.gov/files/home/room2/dbox/odbc/lib/liboraodbc.so
Setup           = /afs/fnal.gov/files/home/room2/dbox/odbc/lib/liboraodbcS.so

This setup works with isql, the unixOdbc utility, unfortunately the Easysoft driver core dumps on SQLConnect and SQLDriverConnect in my test programs.

Nov 19 2004

Resumed testing TestDbi for GLOBALSEQNO problem. I was using an increasingly old branch (Oct 19) of development so I updated everything and - everything broke. The fix is described in the email fragment:

I returned to the GLOBALSEQNO mystery last Friday, armed with a new oracle 
odbc driver which may or may not fix the problem once I get it installed 
and working.

While running  TestDbi against oracle with the existing odbc driver, I was  
confronted with problems in the attached files, which I have fixed and 
will now describe:

File prepare_db_0.sql :: table ALL_DATA_TYPES gets constrained to have its 
SEQNO exist in ALL_DATA_TYPESVLD, unfortunately it did not have a SEQNO 
column, so the constraint made no sense and just generated errors.  I 
changed this, also  changed the order of the ALL_DATA_TYPES and its validity 
table in the file  as the existing order tried to alter properties on the 
table before it existed.

File prepare_db_1.sql::  fixed  insert statements for DBIDEMODATA4 table so 
they work with both mysql and oracle.  Changed double quotes to single quotes 
for varchar/text inserts, changed embedded single quotes in text to that 
rather goofy double single quote idiom.

Nov 16 2004

Still investigating Nicks GLOBALSEQNO problem, I now suspect memory corruption of some sort. There are a couple of asserts intended to catch memory problems in SQLColAttribute.c which have to be commented out to get TestDbi to run.
Someone tried to upgrade the 'current' oracle tnsnames in /afs space, it was broken for about 8 hours which caused all kinds of replication errors. I helped them fix it but now I get the following errors in my DBMJob logs:
```
The following TSQL errors have occured:-
  Error executing "INSERT INTO DBUSUBRUNSUMMARYVLD VALUES (202798600,'2004-11-17 00:52:05','2038-01-19 03:14:07',2,1,0,-1,'2004-11-17 01:52:29','2004-11-18 16:07:17');": [Easysoft][Oracle]Integrity Constraint Violation
```
These rows already exist in the oracle database, and normally dbmjob knows this and skips them and reports that fact but now it complains once an hour till I can figure out how to fix it.

Oracle Status 11/09/2004

manually entered the following 4 lines into minosdev and minosdevsam, note the '' quoting

INSERT INTO DBILOGENTRY VALUES (700005475,'CALMIPCALIBRATION',700000012,70000001
2,1,'hartnell','loon','flxi02.fnal.gov','minos-db1.fnal.gov','This is the MIP nu
mber for the ND based on work done by Niki comparing cosmic MC and data. I have 
adjusted Niki''s findings to account for the changes Nathaniel and I made to the
 self consistency light level tuning of the MC. This number uses through going c
osmic muons so is dependent to some extent on the muon spectrum in the ND MC bei
ng correct. Inserted to database with task=0 and 10 to allow R1.11 and developme
nt to use it.');
INSERT INTO DBILOGENTRY VALUES (700005474,'CALMIPCALIBRATION',700000011,70000001
1,1,'hartnell','loon','flxi02.fnal.gov','minos-db1.fnal.gov','The new MIP number
 to go with Phil S.''s new strip-to-strip calibration. The strip-to-strip number
s no longer introduce an offset to the sigCor scale so in future the MIP number 
should not have to be updated when new s-t-s calibrations are done. This uses th
e same code as last time so should still be considered a first pass; it is good 
to 10%. Inserted to database with task=0 and 10 to allow R1.11 and development t
o use it.');
INSERT INTO DBILOGENTRY VALUES (700005473,'CALMIPCALIBRATION',700000010,70000001
0,1,'hartnell','loon','flxi02.fnal.gov','minos-db1.fnal.gov','The new MIP number
 to go with Phil S.''s new strip-to-strip calibration. The strip-to-strip number
s no longer introduce an offset to the sigCor scale so in future the MIP number 
should not have to be updated when new s-t-s calibrations are done. This uses th
e same code as last time so should still be considered a first pass; it is good 
to 10%. Inserted to database with task=0 and 10 to allow R1.11 and development t
o use it.');
INSERT INTO DBILOGENTRY VALUES (700005476,'CALMIPCALIBRATION',700000013,70000001
3,1,'hartnell','loon','flxi02.fnal.gov','minos-db1.fnal.gov','This is the MIP nu
mber for the ND based on work done by Niki comparing cosmic MC and data. I have 
adjusted Niki''s findings to account for the changes Nathaniel and I made to the
 self consistency light level tuning of the MC. This number uses through going c
osmic muons so is dependent to some extent on the muon spectrum in the ND MC bei
ng correct. Inserted to database with task=0 and 10 to allow R1.11 and developme
nt to use it.');

re-primed the two oracle dev databases (odets2.fnal.gov/minosdev and dba64.fnal.gov/minosdevnew). Dbmjob replication running with the 11/5 code changes, the following error showed up in the logs.

The following TSQL errors have occured:-

Error executing "INSERT INTO DBILOGENTRY VALUES (700005474,'CALMIPCALIBRATION',700000011,700000011,1,'hartnell','loon','flxi02.fnal.gov','minos-db1.fnal.gov','The new MIP number to go with Phil S.\'s new strip-to-strip calibration. The strip-to-strip numbers no longer introduce an offset to the sigCor scale so in future the MIP number should not have to be updated when new s-t-s calibrations are done. This uses the same code as last time so should still be considered a first pass; it is good to 10%. Inserted to database with task=0 and 10 to allow R1.11 and development to use it.');": [Easysoft][Oracle]ORA-01756: quoted string not properly terminated

The oracle odbc driver apparently does not handle escaped single quotes in text strings Oh joy oh bliss.

Update: to escape a single quote in standard SQL, one quotes it i.e. to escape ' enter '' observe:

mysql> insert into foo values ('Mrs O''Leary''s cow');
Query OK, 1 row affected (0.00 sec)
 
mysql> select * from foo;
+-------------------+
| bar               |
+-------------------+
| Mrs O'Leary's cow |
+-------------------+
1 row in set (0.00 sec)

This works for both Oracle and Mysql

Oracle Status 11/05/2004

Fixed DbiStatement.cxx to properly replicate DBUSUBRUNSUMMARY table. From an email to Nick:

I had to make the following changes to get DbiStatement.cxx (attached) 
to properly replicate DbuSubRunSummary, which had been turned off previously:

1)move the drop public synonym statement ahead of the create table statement

2)fix a bug where char translated to varchar but varchar was 
translated to varvarchar in certain circumstances

3)change column COMMENT to RUN_COMMENT if it comes through on a 
replication command

4) Get rid of ALTER TABLE STORAGE(MAXEXTENTS UNLIMITED) command as 
it is obsoleted in Oracle 10 and just causes errors on dba64.fnal.gov server


Now that I can replicate this table properly I will re-prime, turn 
DBMAuto back on, and we can do more 'missing row' checks.

Oracle Status 10/29/2004

Liz Buckley has given me write access to /afs/fnal.gov/files/data/minos/validation which has 8 G of space, I can now do a decent refresh of the oracle databases from the mysql offline.
started a refresh about 6:50 pm, will check over the weekend

Oracle Status 10/28/2004

   1. Oracle status - questions for Dennis

http://www-pnp.physics.ox.ac.uk/~west/minos/discussions/db_current_problems.html

      Summary of Nick's 18 Oct and 21 Oct items:

      a. We have lost a lot of rows in some tables.  It might be because
         Dennis' importing job was filtering them out but that needs to
         be established.  Beyond that we also have the problem that
         ORACLE objects to presence of certain characters (e.g. \n)
         in string data so we will need to clean them out of the MySQL
         DB once we have determined exactly which characters are to blame.

The 'loss of rows' appears to be due to the input parameters to DBMAuto/dbmjob mysql->oracle replication.
- WAS set dbm_command_options = "--EnableSQL --Table *,\!DBUVA*,\!DCS*,\!PULSER*"
- CHANGED TO set dbm_command_options = "--EnableSQL --Table *"
- This doesnt automagically make the missing rows appear, have to re-prime the ORACLE database
- This brings up a problem: I need lots of disk space to do an oracle re-prime
  - flxi03:/usr/scratch/sectX is big enough, but files tend to get blown away by disk cleaning cron jobs.
  - Also, it slows down a machine people are trying to run loon jobs on.
Known mysql->oracle character translation problems:
- '\n' is not handled gracefully by my oracle import scripts. I can filter these out, but this will break checksumming. There is almost certainly a way to embed newlines in an Oracle character string, but it may not be possible with varchar which we are using now. Searching Google, I find advice like 'translate the \n to some other character, then do a REPLACE command from the sqlplus prompt'. Is this worth it? Can I just omit the newlines?
- Nulls are represented as /N in mysqldumps, these have to be properly translated to NULL in an oracle import script

      b. There remains a problem in TestDbi, first reported on 31 August,
         that it isn't able to read from GLOBALSEQNO properly.  Dennis
         thinks that something confuses RDBC when it has multiple
         connections, i.e., a cascade, as it works O.K. at that level
         when there is only a single database connection.

This one is not yet solved . Nick set up an ENV_DENNIS_TEST condition in TestDbi to direct me quickly to the problem, I can verify that it works with Mysql and does not work with Oracle.
TestDbi reads the GLOBALSEQNO table at least twice during the ENV_DENNIS_TEST test, the first time correctly in DbiValidate.cc when it looks for seqno ranges before proceeding, the second time incorrectly in DbiCascader.cxx.
The actual sql query and the RDBC code look very similar in the two cases. I keep messing with DbiCascader query to make it work and the DbiValidate query to make it stop working, but have not succeeded. The trace file from the oracle odbc driver suggests that the query succeeds in both cases, but something happens to the return value on the way back up through RDBC to the DbiCascader. Stepping through this code with a debugger has helped some, but gdb seems to get confused when it encounters a callback function pointer or a C++ exception.

Other Oracle News

Nelly has set up a new oracle server on dba64.fnal.gov, servername minosdevsam

A copy of the offline database from minos-db1.fnal.gov is being replicated to it via priming scripts and then DBMAuto/dbmjob. The plan is for minosdev database to go away and minosdevsam to become minosdev transparently at some point.

Dennis Box

Last modified: Fri Oct 20 12:35:14 CDT 2006