DBPedias

Your Database Knowledge Community

Don Seiler

  1. A NoCOUG to Remember

    This post is long overdue, as I was supposed to blog about my appearance at NoCOUG before I left (sorry, Vanessa!). However in my efforts to rehearse and adjust my presentation, blogging about it just fell to the wayside. However now that NoCOUG 2011 Summer Conference is in the books, I'd like to take a few minutes to share my experience not only as an attendee, but also as a first-time speaker.

    When I found out that NoCOUG had accepted my abstract, "Oracle 11g: Learning to Love the ADR", I was both ecstatic and terrified. This meant that I actually had to prepare the presentation and speak in front of peers. Surely they would throw me into San Francisco Bay if I didn't bring my A-game, so I set out to do just that.


    Step 1: Learning to Love the ADR

    The fact is that I chose to speak on the ADR because I knew very little about it myself. This would not only get me into the glorious world of Oracle conference speakers, complete with a lifestyle rivalled only by Caligula, but it would also ensure that I knew as much as I could about the ADR. Until this project, I had only known where to find the human-readable alert log so I could continue on with my Oracle 10g style of monitoring! I had heard tales of creating "packages" for Oracle support, but surely this was a complex, Herculean feat that would take many months of meditation to even begin to understand.

    So I scoured Oracle Support notes and the public documentation to get notes on any aspect I could about the ADR. Once I felt I had enough source material, it was time to build an outline for my slides to follow. After that, it was time to eat my own dog food, so I got busy poking around the ADR and playing with adrci on my Oracle 11.2 sandbox, while concurrently applying some new-found knowledge on client installations. The latter helped me find a bug involving listener alert log purging!

    Step 2: The Slides

    The next step was actually creating a deck of slides. Using the Pythian template and Libre Office 3.4.2, I slowly figured out how this done in the most simple format possible. This means it was all text, baby. No cool stock photos or cute puppy clip-art. That's something I'll work on for next time and/or another topic.

    Once I had a first draft of my slides and did a self-rehearsal, I enlisted the help of my Pythian colleagues in our Sydney office to be my test audience to not only judge my presenting pace & speech, but also criticize the content of my presentation. And they did the latter wonderfully, each asking great questions that I hadn't previously considered and making some great suggestions on where to provide a little more info and offering anecdotes from their own ADR experiences. I can definitely say that they made my presentation twice as good as it was before! After implementing those changes, I did another dress-rehearsal for some more colleagues around the world, got a few more notes, and I felt like it was ready.

    Step 3: The Conference

    When I got to the conference in San Ramon, I especially keen to meet Jonathan Lewis for the first time. Jonathan is one of the people in the Oracle world that I respect the most, and it was a joy not only to be in his presence but to also later share a lunch table with him. I was also delighted to finally meet my colleague Chen Shapira, one of the most brilliant DBAs we have at Pythian, which is saying quite a lot!

    Jonathan was, naturally, the keynote speaker. He gave fantastic talk on two-table joins (after all, all joins eventually become two-table joins). You can see a video of a similar two-table join optimization presentation that he gave to the Turkish OUG earlier this year. Fantastic improvements are possible when you break things down to their simplest forms. The general idea reminded me also of this great video by Stephane Faroult, but Jonathan does go into much more terrific detail.

    After the keynote, I stayed for Jonathan's second session, which was titled "A Beginner's Guide to Becoming an Expert", with emphasis on becoming, rather than being. This was another great session that emphasized on having easily reproducable tests. Jonathan noted that the script he was using could build a set of tables with over 6.5 million rows on his laptop in under a minute, showing that it doesn't take a lot of effort to have such scripts written and readily available whenever you need to test something. His scripts directory contained over 2600 SQL scripts for all manner of scenarios. One needs to be in the mindset to test repeatedly, making slight changes between tests and understanding the difference in results.

    He included a demonstration of how Oracle actually fails the ACID test, which he blogged about afterward, definitely give it a read.

    After an outdoor lunch, I decided to pass on the next session to go over my slides again and ensure that everything was working on my laptop as expected. One other session I was really interested seeing was Kyle Hailey's NFS Tuning session, but unfortunately it was at the same time as my own session.

    Step 4: Spreading the Love

    When the time came for my session, I felt very prepared. I did catch myself rushing through words once in a while, but otherwise felt my pace was good. There were a few questions from the audience that I couldn't immediately answer, but was able to find the answer for during the post-session ad-hoc demo that I went through for those that stuck around and had extra questions. Having a live instance to work on was incredibly valuable to the presentation and helped not only to reinforce the information that I was sharing, but also to test scenarios that audience members came up with that I hadn't previously illustrated.

    There was some great Q & A at the end of the session, as well as after the session. As I said earlier, I stayed around answering questions and going through new scenarios for some time afterward. Even after packing up and moving out of the room I was speaking with people who had more questions about the ADR, even some that hadn't attended my session! I spent so much time talking, in fact, that I was late for the last sessions and so decided to just catch up on email rather than burst in late and distrub the presenters. As you can see, I ended up only attending Jonathan Lewis' two sessions in addition to my own. Not a bad day altogether, except for losing my jacket (black Outdoor Research spring jacket, if anyone finds one!).

    Unfortunately I was taking the red-eye back to the midwest that night so I didn't get to do any sightseeing aside from what I got to see from the BART train. But I did enjoy this change from home:

    I could get used to the San Francisco consistency. 
    It's like a year-round Wisconsin springtime. For those interested, I did provide some twitter updates between sessions. I didn't want to live-tweet because it would have been harder for my simple mind to follow along.

    You can download the slides from my presentation, as well as a few others, from the NoCOUG website. I do plan to also record a webinar of this presentation, after I incorporate a few more changes, later this year.

    In Conclusion ...

    Overall, I think it was a very worthwhile experience. I accomplished my goal of learning about the ADR well enough to speak confidently and competently on it. I also accomplished my year's goal of submitting to 3 conferences and my career goal of becoming an Oracle speaker. I definitely look forward to speaking more on this and other topics at future conferences as well as presenting internally to the rest of my Pythian colleagues.


  2. Upgrading Standalone ASM to Oracle Grid Infrastructure 11.2.0.2? Beware Bug 10283819!

    No, this isn't a re-post of my earlier blog about bug 1233183.1. We've found a fun new bug that seems to be specific to our poor standalone ASM instances when upgrading from Oracle Grid Infrastructure 11.2.0.1 to 11.2.0.2.

    The bug was first brought to my attention about four days after completing the Grid Infrastructure upgrade. The client system administrator (SA) noticed that the disk holding the Oracle home directories was slowly filling, at the rate of about 1Gb per day. We identifed that core dump files being created under the new GRID_HOME/log/diskmon/ directory, at the rate of about 1 every 10 minutes, each one about 8M in size. That adds up to 1152M (or just over 1Gb) per 24-hour day. Add that to the 8Gb that was being held in GRID_HOME/.patch_storage (we had to rollback the 11.2.0.1 April 2010 PSU and apply the 11.2.0.1 July 2010 PSU just to upgrade to 11.2.0.2), and that put a bit of a squeeze on the free disk.

    The good ol' OTN forums led me to bug 10283819. The original poster there shared also that removing the old (11.2.0.1) grid home directory and restarting diskmon services stopped the core dump creation. The poster then went to question a second issue with increased diskmon.log writing. After a solution was found for that, Oracle Support closed the bug for some reason, without ever addressing the core dump creation.

    I can verify that removing the old 11.2.0.1 grid home (I did a tar+bz2 first) and restarting the services did stop the core dump creation, and am pushing back to Oracle support to get the bug re-opened or a new bug filed to specifically address this. In the meantime, if you are unable or unsure about removing the old grid infrastructure home, it should be safe to have a regularly scheduled script remove the diskmon core dump directories and save you a full disk surprise late some night.


  3. Using a Custom Timezone? Beware Oracle 11.2.0.2 Grid Infrastructure!

    We have a client that runs an application that, for whatever reasons, does NOT like daylight saving time. For that reason, the Oracle server is kept in Eastern Standard Time and does not change with the rest of the eastern United States when DST begins and ends every year. They accomplish this with a custom /etc/localtime file. However, they left /etc/sysconfig/clock set to "TZ=America/New_York," which would prove fateful as I shall point out. So, with the custom localtime file, the "date" command as well as selecting sysdate or systimestamp would always return the current time in Eastern Standard Time. When it is Daylight Saving Time, as it is right now, this would be one hour behind "real" time as we consider it.



    Now, we recently upgraded this client from Oracle Grid Infrastructure (for single-instance ASM) from 11.2.0.1 to 11.2.0.2. The next business day, the client alerted us that their date fields were coming back in Eastern Daylight Time. While this time was still technically right, they needed the time in the EST timezone.

    I first set about trying to duplicate the problem. I was able to see the same incorrect results when I connected to the database remotely (e.g. via sqlplus or Oracle SQL Developer over TNS), but not locally (i.e. "sqlplus / as sysdba"). Then I duplicated the problem when connecting locally via TNS, meaning I was going through the listener. So we had narrowed it down to only connections going through the listener. I hadn't considered the listener to be aware of timezones, so this was rather mind-boggling for me.

    Here is an example of the incorrect results we saw:

    SQL> select systimestamp from dual;

    SYSTIMESTAMP
    ---------------------------------------------------------------------------
    14-JUN-11 07.38.31.711902 PM -04:00

    SQL> !date
    Tue Jun 14 18:38:33 EST 2011

    The system date always returned the value we wanted, but now SYSTIMESTAMP (and SYSDATE, and any other date values stored in the database) were coming back in EDT.

    My Pythian colleague Marc Fielding found My Oracle Support document ID 1209444.1, which had these two eye-opening lines:
    1. For 11.2.0.1, shell environment variable TZ is set correctly for grid user and root.
    2. For 11.2.0.2 and above, TZ entry in $GRID_HOME/crs/install/s_crsconfig__env.txt sets to correct time zone.

    Looking in the s_crsconfig_*_env.txt file on these hosts, I saw this:

    TZ=America/New_York

    The workaround was to change the TZ value in this file to "EST5" and perform a quick restart the HAS daemon:

    # /etc/init.d/ohasd stop
    # /etc/init.d/ohasd start

    After doing this, I liked what I was seeing:

    SQL> select systimestamp from dual;

    SYSTIMESTAMP
    ---------------------------------------------------------------------------
    14-JUN-11 06.42.16.384596 PM -05:00

    SQL> !date
    Tue Jun 14 18:42:19 EST 2011


    Marc and I believe that the Grid Infrastructure installer grabs the value in /etc/sysconfig/clock when setting up the env file in question. We've asked the client to ensure that /etc/sysconfig/clock is always properly set in the future.

    We imagine that most places don't try to fight Daylight Saving Time this way, but the bug also applies if you are doing any kind of timezone slight-of-hand, like telling your database it is in US Central time when the server might be in US Pacific time. So if your organization is doing this, be sure to double check the crsconfig file after 11.2.0.2 installation!


  4. Upgrading Standalone ASM to Oracle Grid Infrastructure 11.2.0.2? Beware Bug 1233183.1!

    NOTE: This post originally appeared on the Pythian blog.

    The past four days have found me very frustrated and at wits' end while testing upgrades of standalone Oracle Grid Infrastructure (ASM) 11.2.0.1 to 11.2.0.2 on RHEL/OEL 5 VMs. The upgrade would seem to go fine, but after rebooting, I would see ASM and LISTENER running under the old (11.2.0.1) grid home directories again.

    Looking at /etc/oratab, I saw this:

    $ grep -i asm /etc/oratab
    +ASM:/u01/app/grid/product/11.2.0/grid_1:N # line added by Agent


    grid_1 is the old grid home, I expect to see grid_2. The comment about being added by Agent led me to a path where I eventually took a look at /etc/init.d/ohasd, which is basically the master script that starts everything up. I noticed that this file hadn't been updated as part of the patching, and contained this:

    $ grep -i crs_home /etc/init.d/ohasd
    ORA_CRS_HOME=/u01/app/grid/product/11.2.0/grid_1
    export ORA_CRS_HOME


    I then ran some web searches for "oracle upgrade 11.2.0.1 ohasd" and found a blog post that had the same problem. Searching My Oracle Support then turned up DocID 1233183.1, titled "Standalone GI: init.ohasd/ohasd not updated after 11201 to 11202 upgrade".

    The bug is basically what it says, those files are not being updated during upgrades of standalone grid infrastructure. This is due to a logic bug in roothas.pl. I suggest reading the document for details.

    The workaround is to manually copy those two files after the upgrade finishes. First backup the old files:

    $ mkdir ~/ohasd_init_backup
    $ cd ~/ohasd_init_backup
    $ ls /etc/init.d/*ohasd*
    /etc/init.d/init.ohasd /etc/init.d/ohasd
    $ cp /etc/init.d/*ohasd* .
    $ ls
    init.ohasd ohasd


    Then copy the 11.2.0.2 files into place (as root):

    # cd /etc/init.d/
    # cp /u01/app/grid/product/11.2.0/grid_2/crs/init/init.ohasd .
    cp: overwrite `./init.ohasd'? y
    # cp /u01/app/grid/product/11.2.0/grid_2/crs/init/ohasd .
    cp: overwrite `./ohasd'? y

    Obviously, substitute "/u01/app/grid/product/11.2.0/grid_2" with the path your 11.2.0.2 installation directory. NOTE: SLES users need to copy ohasd.sles, rather than ohasd. See the MOS document for details

    Now, a quick check to make sure the proper home is used:

    $ grep -i crs_home /etc/init.d/ohasd
    ORA_CRS_HOME=/u01/app/grid/product/11.2.0/grid_2
    export ORA_CRS_HOME

    Now ensure that the two new scripts have the same ownership and permissions as the old ones. Then reboot to ensure that everything takes effect. After the server comes back up, ensure that all services and oratab are still pointing to the new grid home. Be sure to check "srvctl config" for the asm and listener services, and check the paths in /etc/oratab.

    This is the second bug with the 11.2.0.2 upgrade process that I've encountered. The first one requires patching the 11.2.0.1 Grid Infrastructure with the July 2010 PSU just to be able to upgrade to 11.2.0.2. Let's hope for some stronger QA in the future.


  5. I Do Not Think It Means What You Think It Means

    Arup Nanda posted a great blog article about one of the most fundamental, yet misunderstood, behaviors of the Oracle database: the relationship of the COMMIT statement and buffer caches being flushed to disk. I plan to re-read this every day for month. This clip seems apropos:


    The article is titled "100 Things You Probably Didn't Know About Oracle Database," and I'm definitely looking forward to the other 99 articles.


  6. A Sub-Prime I/O Primer

    My fans on oracle-l already know that I've had a bit of a battle with Oracle I/O recently, most of the damage being self-inflicted. I'd like to give as forensic a review as possible so that those poor souls who come after me will have some hope and inspiration to carry on.

    First of all, some definitions, as they relate to Oracle:

    Direct I/O: This is I/O done without the use of the OS (or filesystem) buffer cache. Oracle already has its own buffer cache, so the filesystem cache is (usually) makes I/O take much longer than just bypassing it and relying solely on Oracle to handle the data buffer cache, which it should do much better (for it's own needs) than the filesystem.

    Asynchronous I/O: This type of I/O means that I/O commands are sent but other processing can continue before the I/O has finished.

    NOTE: These two kinds of I/O, direct and asynchronous, are completely independent of each other. For some reason that I can't explain, I had it stuck in my Coke-addled mind that direct was the opposite of asynch I/O. As many times as I read over guides and glossaries, it wouldn't shake loose.

    And now, for our feature presentation.


    Act I
    In which our hero becomes bewildered with the possibilities
    I had been idling in #oracle when hali mentioned something to someone about direct I/O. Having not dealt with I/O very much in my own travels, I inquired more. It was then that I learned of filesystemio_options and the benefits of setting them to "directio."
    Needless to say, I was sold.
    I set about with my trusty system/storage administator on mounting our vxfs partitions with the proper parameters (mincache=direct,convosync=direct) and setting filesystemio_options=directio, which I'm led to believe is redundant. Once the filesystem is mounted with the directio options, Oracle (and anything else) will always use direct I/O.

    Act II
    In which direct meets asynchronous
    Fast forward three or four weeks and I'm grasping at straws trying to figure out a problem that causes the production instance to hang. (I'm not going to address these hangs in this post, as I'm not really sure that they have to do with direct I/O.) There seemed to be a perfect storm of changes in the weeks prior to these problems happening

    1. Migrated to a 64-bit server (via datapump exp/imp)
    2. Much larger SGA (from 1.5 Gb to 16 Gb)
    3. Dramatically different datafile layout
    4. Using direct I/O
    And that's just to name the ones I can think of at this moment. I decided (after weeks of tinkering elsewhere) to look at direct I/O. It was during this second look that I realized that one probably should also have asynchronous I/O enabled.

    Act III
    Paradise Lost
    And so I set filesystemio_options=setall. There were no problems upon instance restart. However this showed up in that evenings RMAN backup:
    Starting Control File and SPFILE Autobackup at 29-NOV-07
    RMAN-00571: ===========================================================
    RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
    RMAN-00571: ===========================================================
    
    
    RMAN-03009: failure of Control File and SPFILE Autobackup command on ORA_DISK_1 channel at 11/29/2007 02:00:46 ORA-19502: write error on file "/rman/c-3171457975-20071129-00", blockno 321 (blocksize=16384) ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 14: Bad address Additional information: -1 Additional information: 1048576
    I was seeing similar ORA errors about async I/O in my alert log intermittently when redo logs were being archived. Obviously this wasn't good. Google turned up one reference, which proved to be gold. The VxFS parameter discovered_direct_iosz was set to a default of 256k. We bumped this up to 1024k and the problems disappeared, AS IF BY MAGIC! Greg Rahn also suggested setting max_direct_iosz to 1024k, as well as setting vxio:vol_maxio=2048 to allow 1MB max I/Os.

    Greg was also under the impression that VxFS without ODM would not perform asynchronous I/O. He suggested I run an "strace -c" on the LGWR pid, saying that "if you see io_submit and io_getevents in the syscall column, it is using async io on Linux. If you see pwrite64 it is not," and I was happy to see that we were indeed.
    oracle:~/sr $ strace -c -p 12605
    Process 12605 attached - interrupt to quit
    Process 12605 detached
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    80.66    1.142811         298      3831           io_getevents
    7.26    0.102820          51      2004           io_submit 
    Act IV
    Jumping to Conclusions
    As I mentioned before, I'm not convinced at all that the hangings had anything to do with the I/O issues. However I must note that since we enabled asynchronous I/O on top of the direct I/O, we haven't had any instance hanging. Where we used to get them once every other day, it has been three weeks since the last incident.

    To their end, Oracle support had analyzed our system state dumps and noted that our session cached cursors was at 100% utilization, and a lot of the wait events during the hanging had to do with cursors (cursor: pin S wait on X). Per their recommendation, we have raised the session_cached_cursors parameter, but haven't had a window yet in which to bounce the instance so the parameter can take effect.

    If there ever is a point where we figure out exactly what was causing the problem and what the solution was, I'll be sure to write about it in this space. For now I just wanted to show you what an impulsive, ignorant fool I am.
    ~ Fin ~


  7. v$sql_bind_capture Not Quite Capturing SQL Binds

    A week or so ago, I was working with a developer to find out why a SQL statement he was sending would sometimes work and sometimes return an invalid DATE format error. Part of the troubleshooting led us to examine the actual bind values being received by the server, just to make sure it was the same as being sent. I thought I was onto something when I saw that the fields of TIMESTAMP datatype actually were NULL:

    SQL> select name, datatype_string, value_string
    from v$sql_bind_capture
    where sql_id='0wp5c2a3z82jr'
    SQL> /


    NAME DATATYPE_STRING VALUE_STRING
    ---- --------------- ------------
    :1   TIMESTAMP       {NULL}
    :2   VARCHAR2(32)    3WC
    :3   VARCHAR2(32)    001
    :4   TIMESTAMP       {NULL}
    :5   NUMBER          8429721
    :6   NUMBER          206


    6 rows selected.


    Unfortunately it was like this when the query both failed and succeeded. I banged my head against the wall for 30 more minutes before concluding that it must be a bug. Metalink agreed, and Note 444551.1 details the bug, which is not yet fixed and "should be fixed in 11.2," affecting versions 9.2 to 11.1. Seems kind of long for something like this to still be out in the open, and it's very annoying when you really want to see the value of a TIMESTAMP bind variable.


  8. Moto RAZR + Bluetooth + Linux + Python = ObexCopier

    While in Schaumburg, Illinois last week for the Oracle DBA Workshop II, I was taking some photos on my Motorola RAZR camera phone, in the hopes of posting them for my 5-year-old daughter to see. I needed to clear up some misconceptions, since she was under the impression that I "sleep at the school." I soon learned, however, that transferring photos from the RAZR one-at-a-time over bluetooth to my Fedora 8 laptop became tedious, and waiting for a response from the GUI file browser was just frustrating. Enter: ObexFTP.

    Thanks to a tip from a friend, I found ObexFTP and, in my quest to force myself to learn python, set about crafting a script to do the following:

    • Transfer files based on a date (default to today).
    • Transfer all files in the directory.
    So I'm presenting to you my first stab at it. Some of the hard-codings depend on how the RAZR stores photos in the micro-SD card. If anyone wants to submit enhancements or critiques, I'm all ears. Right now it just works for what I needed it to do.

    #!/usr/bin/env python


    # Don Seiler, don@seiler.us


    import obexftp, ConfigParser, os
    from xml.etree.ElementTree import XML
    from optparse import OptionParser
    from datetime import date


    # This script is dependent on the Moto Razr convention of naming
    # pictures in an MM-DD-YYYY_XXXX.jpg format


    # Users need to create ~/.obexcopier.ini with these variables defined
    # [ObexCopier]
    # device = 1A:2B:3C:4D:5E:6F
    # channel = 6
    # source_dir = /MMC(Removable)/motorola/shared/picture
    # dest_dir = /media/pictures


    # Read config from ~/.obexcopier.ini
    config = ConfigParser.ConfigParser()
    config.read(os.path.expanduser('~/.obexcopier.ini'))


    # Probably a waste of precious memory to store these again
    device = config.get('ObexCopier','device')
    channel = config.getint('ObexCopier','channel')
    source_dir = config.get('ObexCopier','source_dir')
    dest_dir = config.get('ObexCopier','dest_dir')


    # Get today for default date
    today = date.today().strftime("%m-%d-%y")


    # Command-line handling to allow for date
    parser = OptionParser()
    parser.add_option("-d", "--date", dest="date", default=today, help="Grab pictures from this date, defaults to today [default: %default]",metavar="MM-DD-YY")
    parser.add_option("-a", "--all", action="store_true", dest="all", default=False, help="Copy all files, regardless of date [default: %default]")
    (options, args) = parser.parse_args()


    # Connect to the client
    print "Connecting to %s on channel %d" % (device, channel)
    cli = obexftp.client(obexftp.BLUETOOTH)
    cli.connect(device, channel)


    # Get the list of files from the SD card picture dir
    if options.all:
    print "Copying all files to disk"
    else:
    print "Copying files from %s" % options.date


    files_xml = cli.list(source_dir)
    folder_listing = XML(files_xml)
    files = folder_listing.findall('./file/')
    for file in files:
    # Only handle pictures taken on the specified date
    if options.all or file.get('name').startswith(options.date):
    print "Copying %s" % file.get('name')
    data = cli.get(source_dir + '/' + file.get('name'))
    localfile = open(dest_dir + '/' + file.get('name'), 'wb')
    localfile.write(data)
    localfile.close()


    # Disconnect and delete the client
    cli.disconnect()
    cli.delete


  1. 1
  2. Next ›
  3. Last »