02.10.2014

Serious Performance Problem Resolved

I had the opportunity to assist on a project to migrate the mainframe-based Mincom Information Management System (MIMS) to a set of HP UNIX servers at Cyprus-Amax Minerals, a Fortune 200 mining company based in Englewood, CO., between 1994 and 1997.  The project was expected to save Cyprus-Amax millions of dollars.  Electronic Data Systems (EDS) had the primary responsibility for the project but this later shifted to Andersen Consulting as the project was not meeting expectations while under EDS.  The project called for a gradual migration of one or two mine-sites over to the UNIX platform.  A serious performance issue surfaced after only a very few sites were transitioned.  I resolved the performance issue after identifying the root cause to be a bug in the HP-UX operating system.  Some of the credit goes to a man who introduced me to a Network General Sniffer/Analyzer.  The remaining credit goes to long tedious hours of network packet-level captures and analysis.

Follow up:

Soon after Andersen Consulting took over from EDS (circa 1996), they assembled a team who traveled to HP's Performance Center at Cupertino, CA.  HP assembled a set of four servers that would handle the load currently accomplished by an IBM mainframe.  The servers assembled were:  1 Server for an Oracle database,  1 Batch Server, and 2 Servers for OLTP.  Load testing software was used to simulate the load and stress of all mine-sites to be migrated.  The load and stress tests utilizing the four servers were successful with extra capacity available.

The performance issue was noticed after the first mine site was transitioned in September, 1996.  The issue became even more noticeable after a second site was transitioned.  HP brought in extra resources to check things over but found nothing wrong with their equipment.  It was noticed that occasionally, but not too often, the Oracle database client would report an ORA-03113 End-of-File on Communication Channel error.  This error generally indicates an underlying problem with the network.

A network expert was brought in from New Orleans, LA. to investigate the issue.  He came to the conclusion that some type of problem was occurring at the network layer but was unable to determine a root cause.  A second network expert was brought in from California.  He came to the same conclusion as the first expert and was unable to determine a root cause.

In early 1997, a third expert was brought in from Dallas, TX to investigate.  By this time, I was assigned to assist with tracking down the elusive performance problem.  The expert from Dallas introduced me to a Network General Sniffer/Analyzer and the Open Systems Interconnection (OSI) Model.

After spending many hours of studying the batch processes at Cyprus-Amax and sifting through logs, I discovered a pattern and was eventually able to create ORA-03113 Errors on demand.  This allowed us to narrow down our focus of captured Oracle network traffic and hone in on what was transpiring.  We discovered that the Oracle SQLNet client would send one or more packets containing a single SELECT statement with parameters to the Oracle Database server.  The server would process the statement and return the information via one or more network packets to the SQLNet client.  When conditions were just right, the client machine would fail to return a TCP/IP ACKnowledgement packet back to the server.  After 2 seconds of no ACK, the database server would re-transmit the data to the client machine.  After another 4 seconds of no ACK, the database server would re-transmit the data to the client machine.  The time between re-transmissions would double until a maximum of 64 seconds was reached.  If no ACK was received by the database server, an ORA-03113 Error was logged and the batch process ended in error.  However, most of the time, an ACK would eventually be created by the client machine and sent to the database server.  When this occurred, the batch process would complete successfully but took much longer than expected.  Low-level network packet analysis confirmed that transport retransmission timeouts occured far more often than Oracle ORA-03113 Errors.

Why did things go so well at the HP Performance Center and go so bad in Production?  Connected to the Fiber Distributed Data Interface (FDDI) ring in the Production environment at Cyprus was a switch.  The other side of the switch was connected to Ethernet.  When an Ethernet packet came through the switch destined for the database server via the FDDI ring, the Maximum Transmission Unit (MTU) on the FDDI ring would fall from 4,352 Bytes down to 1,500 Bytes. This in-and-of-itself is not a problem because packet fragmentation is part of the design when going from Ethernet to FDDI.  I was able to successfully capture the Internet Control Message Protocol (ICMP) Type 3, Code 4 packet and watch the MTU drop to 1,500.  After the MTU dropped, I was able to initiate certain batch jobs that caused the transport retransmissions and, in some cases, ORA-03113 Errors.  It was confirmed that a switch connecting an Ethernet network to the FDDI ring did not exist at the HP Performance Center. This explained why the problem did not occur during load and stress testing.

Cyprus had a second FDDI ring connecting the four servers intended to be used for backups. Unlike the primary FDDI ring, this ring did not have an Ethernet switch attached. Working with the Oracle DBA, we decided to change the SQLNet configuration and listener to send Oracle traffic over the second FDDI ring instead of the primary ring.  This was accomplished without taking an outage.  Successful results were discovered almost immediately.  No more performance issues and no more ORA-03113 Errors.  The remainder of mine-sites were transitioned over to the HP-UX platform with success!

Hewlett Packard requested we provide them with several TCPDumps using the primary FDDI ring along with our findings.  They eventually discovered a bug in the TCP/IP stack of their OS and issued a patch later. According to HP, the bug caused invalid checksums which prevented the ACKs from being sent to the Oracle server.