Here is a description of the problem: In TODBCStatement::StoreResult() at line 199 the following ODBC call is made: /*line 199 of TODBCStatement.cxx*/ SQLSetStmtAttr(fHstmt, SQL_ATTR_ROWS_FETCHED_PTR, &fNumRowsFetched, 0); The intent of this call is to bind fNumRowsFetched to the internals of the odbc driver library, so that TODBCStatement can keep track of how many rows are fetched back. A successful SQLFetch should always set fNumRowsFetched to 1, a successful SQLFetchScroll should set fNumRowsFetched to 1 or greater. Unfortunately, with the Oracle supplied ODBC driver for linux, fNumRowsFetched remains set to 0 even after a successful fetch, which I think is a bug in the driver. I have tested the following change to TODBCStatement::NextResultRow() with the attached test program . The change strikes me as a bit hackish, but it makes the attached root.Tstmt.BFIELDMAPDATA.C program work the same for ODBC as it does for OCCI : /*line 264 of TODBCStatement.cxx*/ SQLRETURN retcode = SQLFetchScroll(fHstmt,SQL_FETCH_NEXT,0); /*add the following two lines to get around problem w/ Oracle supplied ODBC driver bug*/ if(retcode != SQL_NO_DATA && fNumRowsFetched==0) fNumRowsFetched=1;
I evaluate the new ROOT SQL interfaces.
Recently fixed and tested
Near term plans
To Be Decided
SQL> desc int_test Name Null? Type ----------------------------------------- -------- ---------------------------- I NUMBER(38) SQL> select * from int_test 2 ; I ---------- 2147483643 2147483644 2147483645 2147483646 2147483647 2147483648 2147483649 2147483650 8 rows selected. (phoenix.fnal.gov:dbox)% loon loon [0] .x TestTSQL2.C("test.dat"); Processing commands in test.dat Opening connection to url:oracle:odbc:oraodbc://phoeninx.fnal.gov:1521/test10.2 user: minos_writer passwd:XXXXX Executing statement select * from int_test Reading back results Result set has 1 columns I: type:-5: precision:38 I: 2147483643 I: 2147483644 I: 2147483645 I: 2147483646 I: 2147483647 I: 2147483648 I: 2147483649 I: 2147483650 loon [1] q
SQL> desc bin_float_test Name Null? Type ----------------------------------------- -------- ---------------------------- F1 BINARY_FLOAT D1 BINARY_DOUBLE OF1 FLOAT(63) OD1 FLOAT(126) (phoenix.fnal.gov:dbox)% loon loon [0] .x TestTSQL2.C("test.dat"); Processing commands in test.dat Opening connection to url:oracle:odbc:oraodbc://phoeninx.fnal.gov:1521/test10.2 user: minos_writer passwd: minos_writer Executing statement select * from bin_float_test Reading back results Result set has 4 columns F1: type:7: precision:0 D1: type:8: precision:0 OF1: type:6: precision:63 OD1: type:8: precision:126 F1: 1.234000 D1: 1.234000 OF1: 1.234000 OD1: 1.234000 loon [1] .quit
Nick spotted the problem with the missing packets in replication-it seems that multiple copies of DBMAuto can run at the same time, each updating the context file and leading to confusion as to where replication has left off.
I switched the initiating script, launch.csh to fix the problem.
Replication failed again last night, it tried to drop and recreate the CALPMTDRIFT table (with SEQNO column/primary key change) and failed. This creates so many errors in the log (it tries to insert several thousand rows into a table which no longer exists) that its hard to see what else is wrong. I will fix this table manually and see if that clears it up.
(later) hmmm no CALPMTDRIFT priming data in the database dump area! I cant fix this one manually. Time to send mail to Nick. I turned off this table in the replication script to at least fix this error and be able to read the logs...
Found an interesting symptom of the missing packets problem in the Oracle Replicant
Grepping for the dates 9-11-05 and 9-12-05 in the FNAL_import.log , I see something interesting at the date boundary: on 9-11 it stopped at import file FNAL_00019598.dbm.gz, and on 9-12 it started importing at FNAL_00019609.dbm.gz, skipping 11 files! Here is the grep extract: ---------------------------------- 2005-09-11 06:22:38 Importing FNAL_00019595.dbm.gz:- 2005-09-11 06:22:38 Importing FNAL_00019596.dbm.gz:- 2005-09-11 06:22:38 Importing FNAL_00019597.dbm.gz:- 2005-09-11 06:22:38 Importing FNAL_00019598.dbm.gz:- 2005-09-11 06:22:38 Importer job terminating Updated on 2005-09-11 17:10:50 by minos running looter on minos-om.fnal.gov connected to minos-db1.fnal.gov Updated on 2005-09-11 17:10:50 by minos running looter on minos-om.fnal.gov connected to minos-db1.fnal.gov (phoenix.fnal.gov:dbox)% grep 2005-09-12 FNAL_import.log 2005-09-12 06:07:26 Too many updates for one job, stopping after update 19618 2005-09-12 06:08:00 Importing FNAL_00019609.dbm.gz:- 2005-09-12 06:09:10 Importing FNAL_00019610.dbm.gz:- 2005-09-12 06:10:07 Importing FNAL_00019611.dbm.gz:- 2005-09-12 06:10:07 Importing FNAL_00019612.dbm.gz:- -------------------------------------- I went back and looked more closely at my log file, and it sometimes, but not always, skips packet numbers at date boundaries. So of course some data will be missing in the replicant.
I am wondering if I have 2 different jobs writing to the same context file, which is supposed to keep track of which packet it stopped at.
To test this theory I moved the 2 different jobs to 2 different directory trees with seperate context files.
Status
Mysql:
mysql> select SEQNO from SPILLTIMENDVLD where SEQNO > 700000990 and SEQNO < 700001000; +-----------+ | SEQNO | +-----------+ | 700000991 | | 700000992 | | 700000993 | | 700000994 | | 700000995 | | 700000996 | | 700000997 | | 700000998 | | 700000999 | +-----------+ 9 rows in set (0.02 sec)
Oracle:
SQL> select SEQNO from SPILLTIMENDVLD where SEQNO > 700000990 and SEQNO < 700001000; SEQNO ---------- 700000991 700000994 700000995 700000998 700000999
I am looking at the logs and trying to determine how this happened....
Failure 256 (1,0) on dbmjob -b > /home/dbox/minos/slave_site/work_dba64/824job_file.tmp 2>&1 << EOF import --Batch --File /home/dbox/minos/slave_ ... AG_FAR,!DCS_RPS,!PULSERDRIFT,!DBUVACHIPSPARS,!PULSERTIMEDRIFT,!PULSERGAIN,!DBUSUBRUNSUMMARY,!PULSERGAINPIN,!PULSERGAINLOG,!PULSERDRIFTPIN quit EOF Common codes: 126 = Cannot execute (permissions problem?) 127 = Command not found 128+N = Terminated by signal N 139 = SegV ? =E= Dbm 2005/09/15 01:49:56 [-1|-1] DbmModule.cxx,v1.52:1136> 700027298 failed =E= Dbm 2005/09/15 01:49:56 [-1|-1] DbmModule.cxx,v1.52:1136> 700027299 failed =E= Dbm 2005/09/15 01:49:56 [-1|-1] DbmModule.cxx,v1.52:1136> 700027300 failed =E= Dbm 2005/09/15 01:49:56 [-1|-1] DbmModule.cxx,v1.52:1136> 700027301 failed =E= Dbm 2005/09/15 01:49:56 [-1|-1] DbmModule.cxx,v1.52:1136> 700027302 failed
Digging deeper, I find the following error messages in the failed.log
... TSQLException ctor reason = Error executing "DROP TABLE CALPMTDRIFTVLD;": [Easysoft][Oracle]ORA-02449: unique/primary keys in table referenced by foreign keys state= HY000 TSQL::Throw TSQLException message = Error executing "DROP TABLE CALPMTDRIFTVLD;": [Easysoft][Oracle]ORA-02449: unique/primary keys in table referenced by foreign keys
Followed later by
TSQLException ctor reason = Error executing "INSERT INTO CALPMTDRIFT VALUES (700027298,1,2701264895,2181038084,89.137703,0.281358,0.408714,126,128,89.000198,37.206799,126.562);": [Easysoft][Oracle]ORA-00913: too many values state= HY000 TSQL::Throw TSQLException message = Error executing "INSERT INTO CALPMTDRIFT VALUES (700027298,1,2701264895,2181038084,89.137703,0.281358,0.408714,126,128,89.000198,37.206799,126.562);": [Easysoft][Oracle]ORA-00913: too many values
Added an explicit /usr/krb5/bin/aklog to the replication scripts which failed to see if this fixes the permission problems. The failing jobs already obtain a kerberos ticket (allegedly) as they are run by /usr/krb5/bin/kcron .
Re-ran the failing steps of above jobs by hand, they ran just fine.
Cron jobs failiing again. Seems to be an /afs space permissions problem.
Restarted replication cron jobs
In other words, the program fragment in odbc++
} catch(SQLException& e) { cerr << endl << e.getMessage() << endl; }
while this RDBC fragment does not work
} catch(TSQLException& e) { printf("caught SQLException %s\n", e.GetMessage().Data() ); // ... }
even though the odbc wrappers that throw the TSQLExceptions are riddled with code like so:
} catch(odbc::SQLException& e) { Throw( new TSQLException( ODBCXX_STRING_CSTR(e.getMessage()), ODBCXX_STRING_CSTR(e.getSQLState()), e.getErrorCode()) ); }
more poking about with debugger, trying to see why the exception isnt caught
Single stepping until exit from function OCINumberToInt, which has no line number information. ocivnu_sqlslong (row=0, ir=0x9bc056c, target=0x0, buflen=4, indi=0x9bc0718) at oracle_functions.c:1770 1770 if(ret) Current language: auto; currently c (gdb) n 1767 ret=OCINumberToInt(ir->desc->dbc->oci_err, (gdb) list 1762 NULL,"ocivnu_sqlslong", 1763 "BufferLength",buflen); 1770 if(ret) 1771 { 1772 ood_driver_error(ir->desc->stmt,ret,__FILE__,__LINE__); 1773 return SQL_ERROR; 1774 } (gdb) n 1772 ood_driver_error(ir->desc->stmt,ret,__FILE__,__LINE__); (gdb) n 1773 return SQL_ERROR; (gdb) n odbc::ResultSet::_doFetch (this=0x9bc04c8, fetchType=1, rowNum=0) at /afs/fnal.gov/files/code/e875/general/minossoft/packages/RDBC/HEAD/odbc++/resultset.cpp:453 453 this->_checkStmtError(hstmt_,r,"Error fetching data from datasource"); Current language: auto; currently c++
gack. Program exits normally from here. Line 1773 above leads to an odbc++ exception which gets ignored by RDBC. I put debug statements in the TSQLException constructor and they didnt get called.
Replication still seems to find interesting ways to core dump. ODBCSYSINI variable alluded to yesterday points to a file odbcinst.ini, which among other things tells odbc whether or not to log all the odbc function calls. The log file (odbc.log) grew to 2 G in size and caused dbmjob to core dump every time it ran. Cleared this up.
It turns out, after much poking about, that the oracle->mysql replication was choking because of an incorrect setting of an Update number in a context file. Hopefully I am on top of this now (the replication, not the checksumming)
Replication still seems to find interesting ways to core dump. Spent most of day reconstructing environment by hand to see if I could reconstruct why dbmjob segvs (turns out to need the env variable ODBCSYSINI to exist). Fixed this in crontab scripts, verified it runs by hand, and will await output from tonights cron jobs.
Working on integer overflow problem first described here.
Finally figured out what was wrong with replication: dbmjob was doing a core-dump-on-exit again. Fixed this with a one-line change to oracle odbc driver, ran an export job which took 8 hours to catch up with. I have moved all my replication jobs off of flxi03 onto phoenix now, it is usually faster as the jobs don't have to compete against resource intensive jobs.
Started moving my replication jobs off of flxi03 on to phoenix, and upgrading oracle client software uniformly to 10.2.0.1 from the ancient 8.1.7 used on flxi03 and elsewhere.
Back from vacation. Replication seems to be broken and I am trying to understand why. Exports from oracle (minos_dev) mysteriously stopped in late July and I just noticed it for instance. The email from my replication cron jobs report they exit abnormally, and I cant reproduce that error when running by hand.
On a brighter note: Oracle 10.2. binary_float and binary_double types seem to work with loon and other minos software when oracle_odbc_driver is compiled with oracle_client 10.2.x.
To get loon to work with 'dbox' 10.2 server on phoenex, source the followingon phoenix where I have built and installed oracle_clients and oracle_odbc_drivers to match:
setup oracle_client v10_2_0_1 setenv LD_LIBRARY_PATH /home/dbox/UPS/prd/oracle_odbc/Linux+2/v0_5_5/lib:${LD_LIBRARY_PATH} setenv LD_LIBRARY_PATH /home/dbox/UPS/prd/unix_odbc/Linux+2/v2_2_11/lib:${LD_LIBRARY_PATH} setenv ENV_TSQL_URL oracle:odbc:oraodbc://phoeninx.fnal.gov:1521/dbox; setenv ENV_TSQL_USER minos_writer setenv ENV_TSQL_PSWD XXXXXX setenv TNS_ADMIN /home/dboxFrom an earlier email from Nick, how to spot the checksum discrepancy
setenv DBM_DEBUG 2 setenv DBM_SELECT_TABLE ??? setenv DBM_SELECT_SEQNO "SEQNO = nnnnnnnnn" where ??? is the table and nnnnnnnnn is a SEQNO in conflict and then run the checksum option you will get a gory detail dump of how it arrived at the checksum for that entry.
to start oracle 'dbox' 10.2 database server on phoenix:
to stop :
Verified that minosdev database on dba64 has 4G of data and 4G of indexes so I have enough disk space here on phoenix to do a full blown refresh/test using BINARY_FLOAT data types
Used Enterprise Manager URL below to generate and execute the following SQL on the new 'dbox'Oracle 10.2 server
CREATE BIGFILE TABLESPACE "MINOSDEV_DATA" DATAFILE '/usr/oracle/oradata/dbox/minosdev_data01' SIZE 1G AUTOEXTEND ON NEXT 100M MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO
CREATE BIGFILE TABLESPACE "MINOSDEV_IDX" DATAFILE '/usr/oracle/oradata/dbox/minosdev_idx01' SIZE 1G AUTOEXTEND ON NEXT 100M MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO
CREATE USER "MINOS_WRITER" PROFILE "DEFAULT" IDENTIFIED BY "*******" DEFAULT TABLESPACE "MINOSDEV_DATA" TEMPORARY TABLESPACE "TEMP" ACCOUNT UNLOCK GRANT "CONNECT" TO "MINOS_WRITER"
GRANT CONNECT, RESOURCE TO MINOS_WRITER
(Did something similar for minos_reader)
(phoenix.fnal.gov:oracle)% sqlplus minos_writer/XXXX@dbox SQL*Plus: Release 10.2.0.1.0 - Production on Fri Aug 5 16:33:46 2005 Copyright (c) 1982, 2005, Oracle. All rights reserved. Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production With the Partitioning, OLAP and Data Mining options SQL> create table ieee_table (bf binary_float, bd binary_double); Table created. SQL> desc ieee_table; Name Null? Type ----------------------------------------- -------- ---------------------------- BF BINARY_FLOAT BD BINARY_DOUBLE SQL> insert into ieee_table values (1.0 , 1.0); 1 row created. SQL> insert into ieee_table values (1234.56789, 1234.56789); 1 row created. SQL> select * from ieee_table; BF BD ---------- ---------- 1.0E+000 1.0E+000 1.235E+003 1.235E+003
Status report for Aug 4 Software Phone Conference
minos-db1(mysql) --> minosdev(oracle) --> offline_rep(mysql)
Checksums for the 3 databases do not match, I am investigating
The following tables have checksum problems that I am investigating:
BEAMMONSWICPEDS BFIELDCOILCURRENT CALPMTDRIFT DBILOGENTRY SPILLTIMEND UGLIDBISTRIP
The following tables 'showed up' in the offline minos-db1 and did not get created on the minosdev-oracle server:
BEAMMONSWICREL BFIELDMAPDATA BFIELDMAPHEADER
SQL> create table t1 (c1 float(12), c2 number(12)); Table created. SQL> desc t1; Name Null? Type ----------------------------------------- -------- ---------------------------- C1 NUMBER(12) C2 NUMBER(12) SQL> insert into t1 values (123456.0 , 123456.0); 1 row created. SQL> select * from t1; C1 C2 ---------- ---------- 123500 123456
Moral: Dont specify oracle Float(N) in your table unless you are sure of what you are doing .
Specifying Float or Double Precision in your table DDL will give you 126 bits of precision, which is greater than the IEEE floating point standard that the rest of the world seems to use. Oracle 10.2 has a new floating point type which is or is supposed to be the IEEE type that everyone else uses, I am experimenting with a 10.2 server to see if that is what we should be using.Found the problem for Nicks new failure mode described here, its an integer overflow. If the MY_BIGINT column in table ALL_DATA_TYPES contains a value bigger than (2^32 -1)==2147483647 the oracle odbc driver returns an overflow error which for some reason does not get propagated back up through odbc++ and then RDBC.
A workaround (for now) is not to use BIGINTs in Oracle. The fix is to get the oracle odbc driver to communicate correctly with odbc++ when the Oracle integer has more than 9 places of precision.
An oracle integer by default has 38 decimal places of precision, while a mysql integer has 9 places and a mysql bigint has 19. A perfectly reasonable 1-1 correspondence between Mysql BIGINT and Oracle NUMBER(19,0) should exist. \
Unzipped and installed oracle 10.2.0.1 on phoenix.fnal.gov to investigate its usefulness for resolving floating point round off issues. Found a 'cookbook' for installing on linux at this site. Problems that I encountered, with fixes, below:
Problem: semaphores and network udp datagram buffers incorrectly sized:
Checking for semmsl=250; found semmsl=250. Passed Checking for semmns=32000; found semmns=32000. Passed Checking for semopm=100; found semopm=32. Failed <<<< Checking for semmni=128; found semmni=128. Passed Checking for shmmax=536870912; found shmmax=33554432. Failed <<<< Checking for shmmni=4096; found shmmni=4096. Passed Checking for shmall=2097152; found shmall=2097152. Passed Checking for file-max=65536; found file-max=203052. Passed Checking for VERSION=2.4.21; found VERSION=2.4.21-27.0.2.EL. Passed Checking for ip_local_port_range=1024 - 65000; found ip_local_port_range=32768 - 61000. Failed <<<< Checking for rmem_default=262144; found rmem_default=65535. Failed <<<< Checking for rmem_max=262144; found rmem_max=131071. Failed <<<< Checking for wmem_default=262144; found wmem_default=65535. Failed <<<< Checking for wmem_max=262144; found wmem_max=131071. Failed <<<< Check complete. The overall result of this check is: Failed <<<< Problem: The kernel parameters do not meet the minimum requirements (see above). Recommendation: Perform operating system specific instructions to update the kernel parameters.
solution: add the following lines to the /etc/sysctl.conf file, which is used during the boot process:
kernel.shmmax=2147483648 kernel.sem=250 32000 100 128 fs.file-max=65536 net.ipv4.ip_local_port_range=1024 65000 net.core.rmem_default=262144 net.core.wmem_default=262144 net.core.rmem_max=262144 net.core.wmem_max=262144
Problem: swap space incorrectly sized:
======================================================================= Checking available swap space requirements ... Expected result: 2916MB Actual Result: 992MB Check complete. The overall result of this check is: Failed <<<< Problem: The system does not have the required swap space. Recommendation: Make more swap space available to perform the install. =======================================================================
solution: create a new swapfile and turn it on:
To add a swap file:
3000*1024=3072000
#dd if=/dev/zero of=/usr/swap bs=1024 count=3072000
#mkswap /usr/swap
#swapon /usr/swap
/swapfile swap swap defaults 0 0
The next time the system boots, it will enable the new swap file.
Problem(?) Not sure if its a problem -- the oracle install starts up a web server!!!
Output from the oracle installer:
The following J2EE Applications have been deployed and are accessible at the URLs listed below. iSQL*Plus URL: http://phoenix.fnal.gov:5560/isqlplus iSQL*Plus DBA URL: http://phoenix.fnal.gov:5560/isqlplus/dba Enterprise Manager 10g Database Control URL: http://phoenix.fnal.gov:1158/em
No idea if these are security risks, I need to learn how to turn them on/off just in case.
interesting tidbits from ps auxww | grep oracle, which I think are the 3 processes mentioned above: oracle 7920 4.2 5.9 604752 117628 pts/1 S 21:17 1:02 /usr/oracle/oracle_home/jdk/bin/java -server -Xmx256M -XX:MaxPermSize=96m -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -DORACLE_HOME=/usr/oracle/oracle_home -Doracle.home=/usr/oracle/oracle_home/oc4j -Doracle.oc4j.localhome=/usr/oracle/oracle_home/phoenix.fnal.gov_dbox/sysman -DEMSTATE=/usr/oracle/oracle_home/phoenix.fnal.gov_dbox -Doracle.j2ee.dont.use.memory.archive=true -Djava.protocol.handler.pkgs=HTTPClient -Doracle.security.jazn.config=/usr/oracle/oracle_home/oc4j/j2ee/OC4J_DBConsole_phoenix.fnal.gov_dbox/config/jazn.xml -Djava.security.policy=/usr/oracle/oracle_home/oc4j/j2ee/OC4J_DBConsole_phoenix.fnal.gov_dbox/config/java2.policy -Djava.security.properties=/usr/oracle/oracle_home/oc4j/j2ee/home/config/jazn.security.props -DEMDROOT=/usr/oracle/oracle_home/phoenix.fnal.gov_dbox -Dsysman.md5password=true -Drepapi.oracle.home=/usr/oracle/oracle_home -Ddisable.checkForUpdate=true -Djava.awt.headless=true -jar /usr/oracle/oracle_home/oc4j/j2ee/home/oc4j.jar -config /usr/oracle/oracle_home/oc4j/j2ee/OC4J_DBConsole_phoenix.fnal.gov_dbox/config/server.xml oracle 9628 0.6 2.3 261736 46076 pts/1 S 21:19 0:09 /usr/oracle/oracle_home/jdk/bin/java -Djava.awt.headless=true -Doracle.oc4j.localhome=/usr/oracle/oracle_home/sqlplus -Djava.security.properties=/usr/oracle/oracle_home/oc4j/j2ee/home/config/jazn.security.props -jar /usr/oracle/oracle_home/oc4j/j2ee/home/oc4j.jar -config /usr/oracle/oracle_home/oc4j/j2ee/isqlplus/config/server.xml oracle 7893 0.0 0.3 7796 6252 pts/1 S 21:17 0:00 /usr/oracle/oracle_home/perl/bin/perl /usr/oracle/oracle_home/bin/emwd.pl dbconsole /usr/oracle/oracle_home/phoenix.fnal.gov_dbox/sysman/log/emdb.nohup
Issues that need my urgent attention:
> > >> Finally a new weird failure mode. The symptom is that I import a new >> table, with a single SEQNO which just has one data row into ORACLE. >> It goes in just fine and with sqlplus I can see both the VLD entry and >> the main entry. Now I try to export the table again and I get the VLD >> entry but the query to pull the main entry out returns nothing, not >> even an error! >> > > > Right now I have a table ALL_DATA_TYPES with a single entry 990000000 > in the main and VLD tables in the Oracle DB. With sqlplus I can see > them:- > > SQL> select * from ALL_DATA_TYPESVLD; > > SEQNO TIMESTART TIMEEND DETECTORMASK SIMMASK TASK > AGGREGATENO > ---------- --------- --------- ------------ ---------- ---------- > ----------- > CREATIOND INSERTDAT > --------- --------- > 990000000 01-JAN-05 01-JAN-06 127 127 0 > -1 > 01-JAN-05 01-AUG-05 > > > SQL> select * from ALL_DATA_TYPES where SEQNO= 990000000; > > SEQNO MY_TINYINT MY_SMALLINT MY_INT MY_BIGINT MY_FLOAT > MY_DOUBLE > ---------- ---------- ----------- ---------- ---------- ---------- > ---------- > MY_TINYTEXT > > ------------------------------------------------------------------------ > -------- > MY_TEXT > > ------------------------------------------------------------------------ > -------- > M MY MY_ MY_C MY_CH MY_CHA MY_CHARS_20 MY_DATETI > - -- --- ---- ----- ------ -------------------- --------- > 990000000 123 12345 123456789 1.2346E+18 1.235 > 2.345679 > abc > abcdefghijklmnopqrstuvwxyz > a ab abc abcd abcde abcdef aaaaaaaaaabbbbbbbbbb 02-FEB-05 > > But I can only see the VLD entry with TSQL. To demonstrate, > create a file dennis.C that contains:- > > oracle:odbc:oraodbc://dba64.fnal.gov:1521/minosdev minos_dev > minos_dev_new > select * from ALL_DATA_TYPESVLD ; > select * from ALL_DATA_TYPES where SEQNO= 990000000 ; > > (you really do need a space before the semi-colon!) > > This will be data to my test script:- > > $SRT_PUBLIC_CONTEXT/DatabaseInterface/test/TestTSQL2.C > > The first line defines the connection and the last two are commands > given directly to TSQL by-passing the DBI. To run, fire up loon and type:- > > .x TestTSQL2.C("dennis.dat") > > If I do that I see:- > > loon [0] .x TestTSQL2.C("dennis.dat") > Processing commands in dennis.dat > Opening connection to > url:oracle:odbc:oraodbc://dba64.fnal.gov:1521/minosdev user: minos_dev > passwd: minos_dev_new > Executing statement select * from ALL_DATA_TYPESVLD > Reading back results > Result set has 9 columns > SEQNO: 990000000 TIMESTART: 2005-01-01 00:00:00 TIMEEND: 2006-01-01 > 00:00:00 DETECTORMASK: 127 SIMMASK: 127 TASK: 0 AGGREGATENO: -1 > CREATIONDATE: 2005-01-01 00:00:00 INSERTDATE: 2005-08-01 09:47:31 > Executing statement select * from ALL_DATA_TYPES where SEQNO= > 990000000 > Reading back results > Result set has 17 columns > loon [1] > > so again it see the VLD row but for the main table it only sees the > meta-data. > > Help! > > Cheers, > > Nick. >
Oracle-to-mysql replication failed last night for table CALPMTDRIFT. The VLD table was created with DATE type instead of TIMESTAMP on phoenix.fnal.gov/offline_rep database. Nick is working on a fix for this as part of his migration plan listed below.
My immediate tasks:
To enable this plan, I am on the hook for
Hi Nick, I understand somewhat the DBILOGENTRYVLD funniness now. Dbmjob appears to drop and recreate tables under certain circumstances, and doing an import into mysql using an oracle export on this table results in columns like TIMESTART which are oracle dates and mysql datetimes, get created in mysql as dates and then return GetColumnType 91 instead of 93. ========================================================== Here is a metadata export from a mysql database: >>>>>DBILOGENTRY Metadata create table DBILOGENTRYVLD(SEQNO integer not null primary key auto_increment, TIMESTART datetime not null, TIMEEND datetime not null, DETECTORMASK tinyint(4), SIMMASK tinyint(4), TASK integer, AGGREGATENO integer, CREATIONDATE datetime not null, INSERTDATE datetime not null); And here it is from an oracle database: >>>>>DBILOGENTRY Metadata create table DBILOGENTRYVLD(SEQNO INTEGER not null primary key, TIMESTART DATE not null, TIMEEND DATE not null, DETECTORMASK INTEGER, SIMMASK INTEGER, TASK INTEGER, AGGREGATENO INTEGER, CREATIONDATE DATE not null, INSERTDATE DATE not null); =========================================================== The question is, what to do about this? We could slap on a filter that turns dates back into datetimes for mysql table creation, but the precision of some of the columns is wrong which could cause checksum or other problems. I don't understand why the tables have to be dropped in the first place if they already exist. Could they simply be cleared out and repopulated? Cheers, Dennis
Hi Dennis >>I understand somewhat the DBILOGENTRYVLD funniness now. Dbmjob appears >>to drop and recreate tables under certain circumstances. Actually I think that was me. Recall that I deliberately deleted DBILOGENTRY + VLD from the ORACLE server to complete my tests using TestDbi? >> and doing an >>import into mysql using an oracle export on this table results in >>columns like TIMESTART which are oracle dates and mysql datetimes, get >>created in mysql as dates and then return GetColumnType 91 >>instead of 93. >> >>========================================================== >>Here is a metadata export from a mysql database: > >>>>>>> >>>>>DBILOGENTRY Metadata > >>create table DBILOGENTRYVLD(SEQNO integer not null primary key >>auto_increment, TIMESTART datetime not null, TIMEEND datetime >>not null, >>DETECTORMASK tinyint(4), SIMMASK tinyint(4), TASK integer, AGGREGATENO >>integer, CREATIONDATE datetime not null, INSERTDATE datetime not null); >> >>And here it is from an oracle database: > >>>>>>> >>>>>DBILOGENTRY Metadata > >>create table DBILOGENTRYVLD(SEQNO INTEGER not null primary key, >>TIMESTART DATE not null, TIMEEND DATE not null, DETECTORMASK INTEGER, >>SIMMASK INTEGER, TASK INTEGER, AGGREGATENO INTEGER, CREATIONDATE DATE >>not null, INSERTDATE DATE not null); >>=========================================================== >> >>The question is, what to do about this? We could slap on a >>filter that >>turns dates back into datetimes for mysql table creation, but the >>precision of some of the columns is wrong which could cause >>checksum or >>other problems. The answer comes in two parts: an easy bit and a hard bit. A) (The easy bit). I convert DATETIME to DATE in DbiStatement::TranslateSQL:- http://www-numi.fnal.gov/offline_software/srt_public_context/WebDocs/dox ygen/loon/html/classDbiStatement.html#d0 Isn't that the right thing to do and if not, what should I translate that to? B) (The hard bit). Is this another garden rake I see before me? I belatedly realise that this is something that going to reoccur as ORACLE becomes the master. My model has always been that the SQL is MySQL flavoured and that we just translate to ORACLE just before we write. However, I get the table creation SQL from the exporting database and apply it at the importing database if required. So this means I need to has at least limited conversion in the reverse direction. >>I don't understand why the tables have to be dropped in the >>first place >>if they already exist. Could they simply be cleared out and >>repopulated? As I say, I think that in this case that's down to me but your detective work has shown me that we have to cover the Use Case of ORACLE -> MySQL translation. Cheers, Nick.
Replication to minosdev(Oracle) from minos-db1 (Mysql) now back in operation after hiatus where ROW_COUNTER schema change was implemented. Problems encountered in getting this up and running were the segv-on-exit bug with the oracle odbc driver and unixODBC libraries earlier than 2.2.11, it took me a while to get the LD_LIBRARY_PATH set correctly to not pick up the MINOS_EXTERNAL unixODBC 2.2.9 libraries first.
Here is the command that lists which tables are currently being replicated, the !tables either are not recommended for replication because of size/nonusefulness issuses or dont have ROW_COUNTER columns yet
--Table *,!DCS_HV,!DCS_MAG_CAL,!DCS_MAG_FAR,!DCS_RPS, !PULSERDRIFT,!DBUVACHIPSPARS,!PULSERTIMEDRIFT, !PULSERGAIN,!DBUSUBRUNSUMMARY,!PULSERGAINPIN, !PULSERGAINLOG,!PULSERDRIFTPIN
Recently acquired an adequate machine which I plan to use for this purpose (AMD semperon, 2G ram, 200G sata disk, will be named phoenix.fnal.gov) . The mysql offline database is currently about 35 G according to http://www-numi.fnal.gov/minwork/computing/database_size_report.html#by_size .
Investigating some apparent number truncation from 'select count(*) from bla bla bla' type queries reported by Nick. It turns out this was an older bug, fixed in CVS but never exported to the /afs/fnal.gov/.../MINOS_EXTERNAL area.
Updating these libraries and updating unixODBC to 2.2.11 causes all of the minos tutorial examples to run to completion save 2. simple_midad.C and test_user_display_module.C both try to do something with LoadMinosPDG: and then core dump.
Ran all the loon tutorial examples against the oracle dev database.
An example of how to do this on flxi03.fnal.gov:
#!/bin/sh -x #setup minos environment . /afs/fnal.gov/files/code/e875/general/minossoft/setup/setup_minossoft_FNALU.sh #put oracle client libs in LD_LIBRARY_PATH setup oracle_client #minos-oracle specific environment vars export ENV_TSQL_URL='oracle:odbc:oraodbc://dba64.fnal.gov:1521/minosdev' export ENV_TSQL_USER=minos_reader export ENV_TSQL_PSWD=reader #run them all! for FILE in `ls $MINOS_TUTORIAL_MACROS/*.C` ; do OUT=`basename $FILE `; loon -qb $FILE $MINOS_TUTORIAL_DATA/F00018143_0000.mdaq.root 2>>$OUT.oracle 1>>$OUT.oracle done;
They dont run perfectly, loon_reco_job.C,simple_midad.C ,test_user_display_module.C ,time_dbi.C and write_root_file.C seg fault, which is 5 of 12 scripts. I clearly have some investigation to do. When I run the same jobs against the mysql offline database they all terminate normally.
The above script produces the following files, which I am now comparing against their mysql analogs:
Change the output from $OUT.oracle to $OUT.mysql and change the following env vars
export ENV_TSQL_URL='mysql:odbc://minos-db1.fnal.gov/offline' export ENV_TSQL_USER=reader export ENV_TSQL_PSWD=minos_db
And you get the following output files:
I get lots of apparent errors in both the mysql and oracle output files, showing that I don't know much about running loon. I will concentrate on getting the same errors in the oracle output that I do in the mysql.
(Later) It appears 3 scripts, write_root_file.C, time_dbi.C, and loon_reco_job.C. produce different output to stdout and stderr between the two database servers. This is something I can investigate. Also the 5 scripts that core dump, even though they seem to produce the same output.
Came up for air from CDF/BLAST related activities. Realized that my 'latest' priming script, cp_tables.pl is not in CVS tho it very much needs to be. This will serve as a nag to put it there
(Later) added some improvements, it copies tables with single quotes in them over now. Checked in to CVS as DatabaseTables/scripts/cp_table_minosdb.pl
my current invocation of the priming script :
./cp_table_minosdb.pl -S 'reader/XXXXXXXX@DBI:mysql:offline;host=minos-db1' -T 'minos_dev/XXXXXXXX@minosdev' -l sortedtabs2
where sortedtabs2 is a list of tables reverse sorted from smallest to biggest in the source database available from http://www-numi.fnal.gov/minwork/computing/database_size_report.html#by_size
This script copies over the ROW_COUNTER data if its there, generates ROW_COUNTER data if none exists in the source database, and refuses to proceed if the target database doesnt have a ROW_COUNTER column. So I have in theory an oracle database that I can use for development .
Got tired of waiting on PK migration of the offline data tables
Waiting to see if the PK proposed below (seqno, row_counter) is adopted by Nick. He has some reservations, it turns out that Mysql does not necessarily use its pk to retrieve data in sorted order even when the PK is the ORDER BY clause.
See http://dev.mysql.com/doc/mysql/en/ORDER_BY_optimization.html for official documentation that this is the way its supposed to work, and a discussion by one of the developers about why it doesnt actually work this way at http://forums.mysql.com/list.php?24 . Oy.
A count of duplicate rows in the database dumps from minos-db1.fnal.gov
These duplicate rows would not exist with proper enforcement of primary keys on the data tables.
NB getting the actual rows that are duplicates so they can be cleaned up has taken far more time than it should, the DCS_HV table is so large that I have made several false starts, running out of disk space . The good news out of this is I have a better tool set for policing these tables for this sort of error and should be able to run them from cron jobs until we get primary keys on the tables
table | number of duplicate rows | link |
---|---|---|
BFIELDCOILCURRENT | one | file |
CALADCTOPE | 2880 | file |
CALDETCERRANGE | 14 | file |
CALDETCERTIMEWIN | 9 | file |
CALMAPPERFITS | 484 | file |
DBUVACHIPPEDS | 704712 | file |
DBUVACHIPSPARS | 375366 | file |
DCS_HV | 58967 | file |
PLEXPIXELSPOTTOSTRIPEND | 2 | file |
PLEXPIXELTORAWCHANNEL | 2 | file |
PULSERRAWDRIFTPIN | 268 | file |
PULSERREFDRIFTPIN | 2 | file |
Back from xmas vacation. I tried sorting the mysqldumps prior to assigning ROW_COUNTERs over the break, this worked for every table except DCS_HV which is so large that the job runs out of disk space even in my huge /afs/fnal.gov/files/data/minos/validation/ work area.
I can work around this by grepping out individual SEQNOS from the mysqldump file.
The primary key problem for data tables has turned out to be a bigger problem than I originally thought. Nick & I agreed via email that a good 'artificial' primary key is the one described below on Dec17-20 entry. The hope was that this would result in an oracle database with the same row ordering as the mysql offline database. I no longer think this is the case, the oracle database is primed from ascii files generated by mysqldump. Mysqldump does not feel compelled to dump the rows from the table in the same order as they appear, ordered by seqno. Even if they were ordered by seqno this does not define the sort order of all rows with the same seqno. I have *nearly* got the oracle database reprimed after realizing that I have to sort the mysqldump output before inserting into oracle, I am willing to bet money that this pre-sorting step will result in a different row ordering in Oracle than in Mysql.
Several false starts on creating an artificial primary key for data tables. The current plan is to create an artificial primary key, row_counter which incements for each row inserted.
Priming job failed again last night on DCS_HV table. I suspect I ran out of space in the database as I set the pk index to be on all 13 rows to eliminate duplicates. DCS_HV had 256 duplicate rows (not too bad out of 13 million, but it ) DBUVACHIPSPARS had 4590 exact duplicate rows in addition to the 'nearly duplicate' rows noted yesterday.
Here is an example duplicate row from the mysql offline database:
mysql> select * from DCS_HV where SEQNO=200020421 and HV_LOGICAL=8178 and HV_GEOMETRICAL=4715 and HV_MAINFRAME=8 and HV_CARD=11 and HV_CHANNEL=2 and HV_PLANE=471; +-----------+------------+----------------+--------------+---------+------------+----------+----------------+---------+--------------+---------------+----------------+----------+ | SEQNO | HV_LOGICAL | HV_GEOMETRICAL | HV_MAINFRAME | HV_CARD | HV_CHANNEL | HV_PLANE | HV_SUPERMODULE | HV_SIDE | HV_PMTNUMBER | HV_SETVOLTAGE | HV_READVOLTAGE | HV_ERROR | +-----------+------------+----------------+--------------+---------+------------+----------+----------------+---------+--------------+---------------+----------------+----------+ | 200020421 | 8178 | 4715 | 8 | 11 | 2 | 471 | 2 | 1 | 2 | -898 | -898 | 0 | | 200020421 | 8178 | 4715 | 8 | 11 | 2 | 471 | 2 | 1 | 2 | -898 | -898 | 0 | +-----------+------------+----------------+--------------+---------+------------+----------+----------------+---------+--------------+---------------+----------------+----------+ 2 rows in set (0.02 sec)
Spent some time reviewing with Nelly how Locally Managed Tabelspaces work. Had her turn off archive logging so I could load and re-load the offline tables into oracle.
Spent much more time than I intended to working on getting new minosdev development instance re-primed. One of the requirements for moving the offline database into production,and its a reasonable one, is that data tables have a primary key. The problem I am encountering is that some rows of large tables appear to be repeats, and some appear to be garbage data that should have been caught by primary key constraints. Many rows in DBUVACHIPSPARS and DBUVACHIPPEDS are like the following:
mysql> select * from DBUVACHIPSPARS where SEQNO=200015442 and CRATE=7 and MODEFLAGS=0 and VARC=0 and VMM=0 and VAADCSEL=0 and VACHIp=2; +-----------+-------+-----------+------+------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+ | SEQNO | CRATE | MODEFLAGS | VARC | VMM | VAADCSEL | VACHIP | CHAN00 | CHAN01 | CHAN02 | CHAN03 | CHAN04 | CHAN05 | CHAN06 | CHAN07 | CHAN08 | CHAN09 | CHAN10 | CHAN11 | CHAN12 | CHAN13 | CHAN14 | CHAN15 | CHAN16 | CHAN17 | CHAN18 | CHAN19 | CHAN20 | CHAN21 | +-----------+-------+-----------+------+------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+ | 200015442 | 7 | 0 | 0 | 0 | 0 | 2 | 16383 | 16383 | 16 | 17 | 18 | 18 | 18 | 17 | 15 | 18 | 19 | 18 | 18 | 18 | 18 | 19 | 21 | 18 | 50 | 16383 | 16383 | 16383 | | 200015442 | 7 | 0 | 0 | 0 | 0 | 2 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | 16383 | +-----------+-------+-----------+------+------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+ 2 rows in set (0.02 sec) mysql>
I.E. I suspect that 16383 is equivalent to 'NULL' and the second row is garbage that should not have been read in in the first place. As it is I am afraid I may have to make a primary key of all 28 rows? Looking at the data the first 7 seems reasonable. Time to ask Nick.
I change the priming script to make a primary key of the first 2 columns (seqno and whatervers next) unless it has an entry in an associative array, I made guesses for 28 tables based on a look at the table structure. I suspect this list will have to be tweaked.
The gory details are here.
ORA-01653: unable to extend table MINOS_DEV.BFIELDCOILCURRENT by 8 in tablespace MINOS_DEV_DATA
It turns out that the continuous dropping, creating and inserting via sqlldr had two bad effects
I know how to solve 2, I thought 1 was supposed to be automatically solved by Locally Managed Tablespaces, at least that was the sales pitch. I think minosdevsam database is fairly messed up right now, I need Nellys help to bounce the database to turn off archiving and see if that , and dropping all the tables, allows oracle to self-coalesce its table space (it cant be done manually with LMT's)
More DBMAuto replication difficulties.
-rw-r--r-- 1 dbox products 620683568 Dec 13 13:08 17426_FNAL_00013415.dbm_tmp -rw-r--r-- 1 dbox products 847322533 Dec 13 13:10 17426_FNAL_00013416.dbm_tmp -rw-r--r-- 1 dbox products 613590765 Dec 13 13:11 17426_FNAL_00013417.dbm_tmp -rw-r--r-- 1 dbox products 152707072 Dec 13 13:12 17426_FNAL_00013418.dbm_tmp
I have sent email to Nick about whether he knows of a workaround, I was consumed all day (12/13) with BLAST issues.
Error executing "ALTER TABLE DBILOGENTRY add CONSTRAINT DBILOGENTRY_ FK FOREIGN KEY (SEQNO) REFERENCES DBILOGENTRYVLD(SEQNO)": [Easysoft][Oracle]ORA-02275: such a referential constraint already exists in the table
This appears to be something trivial that I havent found as of yet, I can comment out the code in DbiStatement.cxx that creates the foreign keys as a hack and move on to check other functionality.
Checking the program tracefiles I see that TestDbi does indeed try to execute this command twice and returns an error the second time it runs.
The problem is the table is created twice in the preparation command files, once in prepare_db_0.sql and in prepare_db_1.sql. It complains and quits the second time.
The following TSQL errors have occured:- Error executing "INSERT INTO DBIDEMODATA4 VALUES (900000002,'DbiDemoData4','NewData','[\'\' \'MyFloat\'=(double)456 \'MyInt\'=(int)123 \'MyString\'=(string)\'A string\']');": [Easysoft][Oracle]ORA-00911: invalid character
This is the dreaded \' vs '' error discussed at length before. I tried filtering these out in DbiStatement.cxx without success.
Another error:
The following TSQL errors have occured:- Error executing "update DBIDEMODATA2VLD set INSERTDATE = "2001-09-27 01:02:03" where SEQNO = 900000005; ": [Easysoft][Oracle]ORA-00904: invalid column name
This can be fixed by changing the double quotes on the date to single quotes
Errors: Unable to gunzip /afs/fnal.gov/files/data/minos/dbm/rsync/FNAL_00013360.dbm.gz Failure 256 executing command:- echo 2004-12-09 12:50:12 Unable to gunzip /afs/fnal.gov/files/data/minos/dbm/rsync/FNAL_00013360.dbm.gz >> /local_home/dbox/slave_site/work/FNAL_import.log Please check /local_home/dbox/slave_site/work/FNAL_import_fatal.log Errors: Failure 256 executing command:- echo 2004-12-09 14:50:12 Too many updates for one job, stopping after update 13369 >> /local_home/dbox/slave_site/work/FNAL_import.log Please check /local_home/dbox/slave_site/work/FNAL_import_fatal.log
Investigation showed that this failure was self inflicted, I vi'ed a large file from offsite, my terminal froze, I killed it, and unintentionally left a runaway vi process and a *huge* vi swap file that filled the working directory to 100% capacity.
Chances are good that the replication from the production offline mysql server is messed up to the point that a refresh is now the best option.
Developments over the weekend
dbmjob: error while loading shared libraries: libMinosObjectMap.so: cannot open shared object file: No such file or directory
I obviously should not be running replication from the head of development if I want to consider it a production system, these sorts of errors will occur and cause the tables to get out of synch again.
65 tables are fine. 12 have conflicts:-of two types:
DBUSUBRUNSUMMARY DBUVACHIPPEDS DBUVACHIPSPARS DCS_CHILLER_FAR DCS_ENV_FAR DCS_HV DCS_MAG_CAL DCS_MAG_FAR DCS_RPS PULSERDRIFT
It is possible that these tables got out of synch when I manually refreshed on Nov29 and then the replication job core dumped for 24 hours because of the ROOT upgrade.
I plan on changing the version of ROOT that "development" uses on FNALU from v4-00-08f to v4-01-02 tonight. This involves a fresh rebuild of "development" and _invalidates_ any personal test release libraries. It is important that users clean and rebuild their test releases after the base release has been rebuilt. Serious consequences can occur if there is version skew.
if(status!=SQL_NO_DATA&&stmt->bookmark) { /* * stmt->bookmark would be nonzero if this wasn't the first call * to SQLFetch. As OCIStmtExec==SQLexecute,SQLFetch we don't want * to call OCIStmtFetch for the first call of SQLFetch. * if status==SQL_NO_DATA we're trying to go past the end of the * result set. If we were to call OCIStmtFetch now we'd get * an error. */ ret=OCIStmtFetch_log_stat(stmt->oci_stmt,stmt->dbc->oci_err, stmt->row_array_size,OCI_FETCH_NEXT, OCI_DEFAULT,ret);
I set up an odbc program that replicates the trace below, first calling SQLFetch with a query that wont retrieve any data and then calling it with a query that should. Stepping into SQLFetch and setting stmt->bookmark=0 in the debugger makes the program return the correct results. So I need to understand when/how to reset stmt->bookmark.
I elected to fix this by changing SQLSetStmtAttr.c so that when it is called to set the SQL_ATTR_ROWS_FETCHED_PTR attribute to 0, signalling the statement handle that it thinks it is starting over, it sets stmt->bookmark to 0 as well as stmt->rows_fetched_ptr . This makes my odbc test program which I am incorporating into the test subdirectory work, as well as the TestDbi program.
flxi03$ pwd /afs/fnal.gov/files/data/minos/validation/dbox_oracle_refresh/Nov23_dba64/DatabaseTables/scripts/log_dir flxi03$ grep Error *.log | grep Loader CALNDMAPPERFITS_CTL.aaaa.log:SQL*Loader-941: Error during describe of table "MINOS_DEV"."CALNDMAPPERFITS" DBUSUBRUNSUMMARY_CTL.aaaa.log:SQL*Loader-941: Error during describe of table "MINOS_DEV"."DBUSUBRUNSUMMARY" DBUSUBRUNSUMMARY_CTL.aaab.log:SQL*Loader-941: Error during describe of table "MINOS_DEV"."DBUSUBRUNSUMMARY" PHOTONPMTQUANTUMEFFICIENCY_CTL.aaaa.log:SQL*Loader-941: Error during describe of table "MINOS_DEV"."PHOTONPMTQUANTUMEFFICIENCY" PHOTONPMTQUANTUMEFFICIENCYVLD_CTL.aaaa.log:SQL*Loader-941: Error during describe of table "MINOS_DEV"."PHOTONPMTQUANTUMEFFICIENCYVLD" PLEXRAWCHANNELREADOUTTYPE_CTL.aaaa.log:SQL*Loader-941: Error during describe of table "MINOS_DEV"."PLEXRAWCHANNELREADOUTTYPE" PLEXRAWCHANNELREADOUTTYPEVLD_CTL.aaaa.log:SQL*Loader-941: Error during describe of table "MINOS_DEV"."PLEXRAWCHANNELREADOUTTYPEVLD"
OK I have fixed this and checked in the changes to DatabaseTables/scripts/MysqlToOracleFilter.pl and DatabaseTables/scripts/mungeMysqlDump.pl. Now I need to turn on replication.
Replication now turned on seems to be running error free.
here is the code that seems to be malfunctioning in DbiCascader::ReserveNextSeqNo
1 string sqlPrefix("select * from "); 2 sqlPrefix += seqnoTableName + " where TableName = \'"; 3 string sql = sqlPrefix + tableName + "\'"; 4 MSG("Dbi",Msg::kDebug) << seqnoTableName << " query: " << sql << endl; 5 TSQLResultSet* res = stmtDb->ExecuteQuery(sql.c_str()); 6 stmtDb->PrintWarnings(); 7 UInt_t seqNo = 0; 8 if ( res && res->Next() ) { 9 seqNo = res->GetInt("LASTUSEDSEQNO"); 10 } 11 else { 12 13 // Cannot find row use default for last used SeqNo. 14 string sql = sqlPrefix + "*\'"; 15 MSG("Dbi",Msg::kDebug) << seqnoTableName << " query: " << sql << endl; 16 res = stmtDb->ExecuteQuery(sql.c_str()); 17 if ( stmtDb->PrintWarnings() ) { 18 MSG("Dbi",Msg::kError) << "Unable to find default SeqNo" 19 << " due to above error" << endl; 20 return 0; 21 } 22 res->Next(); 23 seqNo = res->GetInt("LASTUSEDSEQNO");
if the GLOBALSEQNO table is set so that line 8 returns true, (ie if Select * from GLOBALSEQNO where tablename='DBIDEMODATA1' returns data) then the query works.
if line 8 returns false (i.e. no such row in GLOBALSEQNO), line 22 leads to an SQLFetch which returns SQL_NO_DATA which is incorrect for the query Select * from globalseqno where tablename='*' as this row is there.
Tried to set up an odbc++ analogue of the above code, but am doing something wrong as it throws an exception [libodbc++]: Cannot re-execute; statement has an open resultset on the second executeUpdate which leads me to belive that DatabaseInterface does a bunch of things under the hood that I dont yet understand.
INVESTIGATE THIS TRACE
line 22 above triggers this in the odbc driver trace file:
[Oracle][4207]SQLFetch.c[0332]^M Entry:[SQL_HANDLE_STMT][0x0969a160]^M Status = 0^M num_result_rows = 1^M ^M OCIStmtFetch(0x9713d88,0x970892c,1,2,0)=NO_DATA^M OCIAttrGet(0x9713d88,OCI_HTYPE_STMT,0x969a1e8,(nil),9,0x970892c)=SUCCESS^M [Oracle][4207]SQLFetch.c[0345]^M Exit: [SQL_NO_DATA]^M
Installed Easysoft oracle odbc driver trial version as root on odets2.fnal.gov. This package requires installation as root, put everything in /usr/share/easysoft with a required soft link back to /usr/local/easysoft.
Here is the /etc/odbcinst.ini it created:
[ORACLE] Description = Easysoft ODBC Oracle Driver Driver = /usr/local/easysoft/oracle/libesoracle.so Setup = /usr/local/easysoft/oracle/libesoraclesetup.so FileUsage = 1
And here is the /etc/odbc.ini it created:
[ORACLE] Driver = ORACLE Database = minosdevsam User = minos_dev Password = CENSORED METADATA_ID = 0 ENABLE_USER_CATALOG = 1 ENABLE_SYNONYMS = 1
RDBC as used by MINOS ignores these .ini files as best as I can tell. It generally uses URLs to store connect information to a database. The format of the URLS is like so:
servertype:odbc:[libname]://host:port/databasenameso to use libesoracle.so from Easysoft one uses
oracle:odbc:esoracle://cdfpcb.fnal.gov:1521/cdf_valand makes sure libesoracle.so is in the LD_LIBRARY_PATH
Going through this exercise verifies that the commercial driver from easysoft returns the correct GLOBALSEQNO while the open source one does not.
Next we try running some ODBC test cases using open source (OS) and EasySoft (ES) drivers side by side to see if we can catch why/where they act differently. Here are some excerpts from my odbc.ini and odbcinst.ini:
ODBC.INI
[oracle_os_cdf_val] Driver = oracle_os Database = cdf_val [oracle_es_cdf_val] Driver = oracle_es Database = cdf_val
ODBCINST.INI
[oracle_es] Description = Easysoft ODBC Driver Driver = /usr/local/easysoft/oracle/libesoracle.so Setup = /usr/local/easysoft/oracle/libesoraclesetup.so FileUsage = 1 [oracle_os] Description = open source oracle ODBC Driver Driver = /afs/fnal.gov/files/home/room2/dbox/odbc/lib/liboraodbc.so Setup = /afs/fnal.gov/files/home/room2/dbox/odbc/lib/liboraodbcS.so
This setup works with isql, the unixOdbc utility, unfortunately the Easysoft driver core dumps on SQLConnect and SQLDriverConnect in my test programs.
Resumed testing TestDbi for GLOBALSEQNO problem. I was using an increasingly old branch (Oct 19) of development so I updated everything and - everything broke. The fix is described in the email fragment:
I returned to the GLOBALSEQNO mystery last Friday, armed with a new oracle odbc driver which may or may not fix the problem once I get it installed and working. While running TestDbi against oracle with the existing odbc driver, I was confronted with problems in the attached files, which I have fixed and will now describe: File prepare_db_0.sql :: table ALL_DATA_TYPES gets constrained to have its SEQNO exist in ALL_DATA_TYPESVLD, unfortunately it did not have a SEQNO column, so the constraint made no sense and just generated errors. I changed this, also changed the order of the ALL_DATA_TYPES and its validity table in the file as the existing order tried to alter properties on the table before it existed. File prepare_db_1.sql:: fixed insert statements for DBIDEMODATA4 table so they work with both mysql and oracle. Changed double quotes to single quotes for varchar/text inserts, changed embedded single quotes in text to that rather goofy double single quote idiom.
The following TSQL errors have occured:- Error executing "INSERT INTO DBUSUBRUNSUMMARYVLD VALUES (202798600,'2004-11-17 00:52:05','2038-01-19 03:14:07',2,1,0,-1,'2004-11-17 01:52:29','2004-11-18 16:07:17');": [Easysoft][Oracle]Integrity Constraint Violation
These rows already exist in the oracle database, and normally dbmjob knows this and skips them and reports that fact but now it complains once an hour till I can figure out how to fix it.
manually entered the following 4 lines into minosdev and minosdevsam, note the '' quoting
INSERT INTO DBILOGENTRY VALUES (700005475,'CALMIPCALIBRATION',700000012,70000001 2,1,'hartnell','loon','flxi02.fnal.gov','minos-db1.fnal.gov','This is the MIP nu mber for the ND based on work done by Niki comparing cosmic MC and data. I have adjusted Niki''s findings to account for the changes Nathaniel and I made to the self consistency light level tuning of the MC. This number uses through going c osmic muons so is dependent to some extent on the muon spectrum in the ND MC bei ng correct. Inserted to database with task=0 and 10 to allow R1.11 and developme nt to use it.'); INSERT INTO DBILOGENTRY VALUES (700005474,'CALMIPCALIBRATION',700000011,70000001 1,1,'hartnell','loon','flxi02.fnal.gov','minos-db1.fnal.gov','The new MIP number to go with Phil S.''s new strip-to-strip calibration. The strip-to-strip number s no longer introduce an offset to the sigCor scale so in future the MIP number should not have to be updated when new s-t-s calibrations are done. This uses th e same code as last time so should still be considered a first pass; it is good to 10%. Inserted to database with task=0 and 10 to allow R1.11 and development t o use it.'); INSERT INTO DBILOGENTRY VALUES (700005473,'CALMIPCALIBRATION',700000010,70000001 0,1,'hartnell','loon','flxi02.fnal.gov','minos-db1.fnal.gov','The new MIP number to go with Phil S.''s new strip-to-strip calibration. The strip-to-strip number s no longer introduce an offset to the sigCor scale so in future the MIP number should not have to be updated when new s-t-s calibrations are done. This uses th e same code as last time so should still be considered a first pass; it is good to 10%. Inserted to database with task=0 and 10 to allow R1.11 and development t o use it.'); INSERT INTO DBILOGENTRY VALUES (700005476,'CALMIPCALIBRATION',700000013,70000001 3,1,'hartnell','loon','flxi02.fnal.gov','minos-db1.fnal.gov','This is the MIP nu mber for the ND based on work done by Niki comparing cosmic MC and data. I have adjusted Niki''s findings to account for the changes Nathaniel and I made to the self consistency light level tuning of the MC. This number uses through going c osmic muons so is dependent to some extent on the muon spectrum in the ND MC bei ng correct. Inserted to database with task=0 and 10 to allow R1.11 and developme nt to use it.');
re-primed the two oracle dev databases (odets2.fnal.gov/minosdev and dba64.fnal.gov/minosdevnew). Dbmjob replication running with the 11/5 code changes, the following error showed up in the logs.
The following TSQL errors have occured:-
Error executing "INSERT INTO DBILOGENTRY VALUES (700005474,'CALMIPCALIBRATION',700000011,700000011,1,'hartnell','loon','flxi02.fnal.gov','minos-db1.fnal.gov','The new MIP number to go with Phil S.\'s new strip-to-strip calibration. The strip-to-strip numbers no longer introduce an offset to the sigCor scale so in future the MIP number should not have to be updated when new s-t-s calibrations are done. This uses the same code as last time so should still be considered a first pass; it is good to 10%. Inserted to database with task=0 and 10 to allow R1.11 and development to use it.');": [Easysoft][Oracle]ORA-01756: quoted string not properly terminated
The oracle odbc driver apparently does not handle escaped single quotes in text strings Oh joy oh bliss.
Update: to escape a single quote in standard SQL, one quotes it i.e. to escape ' enter '' observe:
mysql> insert into foo values ('Mrs O''Leary''s cow'); Query OK, 1 row affected (0.00 sec) mysql> select * from foo; +-------------------+ | bar | +-------------------+ | Mrs O'Leary's cow | +-------------------+ 1 row in set (0.00 sec)
This works for both Oracle and Mysql
Fixed DbiStatement.cxx to properly replicate DBUSUBRUNSUMMARY table. From an email to Nick:
I had to make the following changes to get DbiStatement.cxx (attached) to properly replicate DbuSubRunSummary, which had been turned off previously: 1)move the drop public synonym statement ahead of the create table statement 2)fix a bug where char translated to varchar but varchar was translated to varvarchar in certain circumstances 3)change column COMMENT to RUN_COMMENT if it comes through on a replication command 4) Get rid of ALTER TABLE STORAGE(MAXEXTENTS UNLIMITED) command as it is obsoleted in Oracle 10 and just causes errors on dba64.fnal.gov server Now that I can replicate this table properly I will re-prime, turn DBMAuto back on, and we can do more 'missing row' checks.
1. Oracle status - questions for Dennis http://www-pnp.physics.ox.ac.uk/~west/minos/discussions/db_current_problems.html Summary of Nick's 18 Oct and 21 Oct items: a. We have lost a lot of rows in some tables. It might be because Dennis' importing job was filtering them out but that needs to be established. Beyond that we also have the problem that ORACLE objects to presence of certain characters (e.g. \n) in string data so we will need to clean them out of the MySQL DB once we have determined exactly which characters are to blame.
b. There remains a problem in TestDbi, first reported on 31 August, that it isn't able to read from GLOBALSEQNO properly. Dennis thinks that something confuses RDBC when it has multiple connections, i.e., a cascade, as it works O.K. at that level when there is only a single database connection.
Nelly has set up a new oracle server on dba64.fnal.gov, servername minosdevsam
A copy of the offline database from minos-db1.fnal.gov is being replicated to it via priming scripts and then DBMAuto/dbmjob. The plan is for minosdev database to go away and minosdevsam to become minosdev transparently at some point.