The application that I’m working on is pretty huge. I’ve touched maybe 25% of the actual code intimately in the 9 months I’ve been working with it. Over a month ago, we discovered an issue in production that had an ORA-00060 error as the primary symptom. I’m the guy that likes to fix the stuff that doesn’t really tell you whats wrong. What I want to do with this post, is provide a recap of how I tackled the problem. Hopefully it will help some others that run into this issue.
- The error from the logs pointed me in the right direction of what functionality was causing the issue.
I didn’t know exactly what was causing the issue, but I knew where to start looking. I pulled out my favorite tool for testing web applications and web services, JMeter. I wrote a test that would fully stress the parts of the system that might be causing the issue. I didn’t write the code that was causing the problem. This was actually the first time I was in this part of the system. A good first pass at a JMeter test does CRUD+L (Create, Read, Update, Delete, List) for a user. Once I had a simple test that would perform some easy “smoke” tests, I was ready to start stressing the system.
- I’m now familiar with this part of the system and have a JMeter test for coverage of the functionality.
My test worked just fine at this point. Still no clues. Next step is to increase the load. I modified my test to use a different, random user on each login attempt. Then I increased the number of threads. Immediately, I started to see the errors. For most of my tests of this system, I’m able to send through 100 threads (users) without running into other bottlenecks on my local (dev) environment. With this test, at this time, I could only get 3 successful threads. This is unacceptable. With just 4 users going through my test, I was creating “blocking sessions” according to the Oracle Enterprise Manager (OEM) reporting tool.
- Now I have a test that is replicating the problems (locally) that we are seeing in production
Our system is using JPA with Hibernate and annotations for most of our ORM. When I started to dig into code, I realized that on “inserts” we are doing 3 different commits during the transactions. In the methods, I saw multiple “commit” and “flush” calls. Instead of letting JPA/Hibernate handle the generation of IDs for (grand)child objects, there was some additional commits that were done with native queries in order to get IDs and follow the particular pattern that was chosen. I wasn’t happy with the pattern, but that was irrelevant to the problem at hand. But I couldn’t understand why we needed to do all of this extra work that I haven’t done in years.
- The multiple, native commits, within the transaction were a red flag. Clearly, the low (3 user) load threshold was caused by the poor use of the transaction. The inserts, updates, and deletes were all doing this multiple commits and multiple flush to finish each action.
I really wanted to get away from this multiple step process. My plan was to rewrite the POJOs with readable annotations. This way, the rest of the team could see what was actually happening without having to dig like I did. Quickly I realized that I couldn’t make many changes because the pattern that was chosen was used throughout the entire system. So, I got too it. I used annotations for all the objects that were affected. It took a while, to replace the hand written stuff and at the same time keep the pattern in place. Again, it was extra work, but not relevant to the problem.
- I reduced the number of required DAO objects in the system. The entire group of objects is wired with annotations. Its readable, cleaner, and still works with the rest of the system.
I figured everything would magically be fixed. It wasn’t. I used my JMeter tests and determined that the queries that were being ran against the database looked good. Everything was in order and exactly what I expected. But my changes had only increased the number of successful users from 3 to 5. It also improved the performance slightly. At the same 3 users going through the system, my changes made this section about 20% faster. But what now?
- I pulled in the database admin. The DBA told me that the database was doing cascading deletes on some of the tables.
We went around in circles for at least two days getting to a happy place. Since the cascading deletes were in place, my “delete” method was changed from using the annotations to using a native query. Luckily, this native query was just 1 call on 1 table and the oracle database did the rest. I had to believe that letting Oracle do the deletes was a better option that letting Hibernate make all the calls.
So in the end, it turns out that the problem causing my ORA-00060, was that the code was trying to do the same deletes that the Oracle database was also trying to do. This was causing the blocking sessions that were affecting the rest of the system. Before you let Hibernate/JPA handle your cascade deletes, make sure that you don’t already have cascades setup at your database tier.
- The test that I wrote the first week, now handles the 100 users that I expected the system to handle. Problem solved.
This problem wasn’t really as big as I let it become. If I had brought in the DBA earlier, and was aware of the cascade deletes at the database tier earlier, I would have fixed the problem in the original DAO objects. This would have fixed our problem sooner. On the other hand, the code for this group of functionality is now much easier to read, work with and understand. The next developer to work on this stuff should get to success much faster.
I’m actually happy that I had to get all the way to the database tier to fix this “database” problem. I was able to learn one of the patterns used in a large part of our system. I feel confident that the next person to look at this code will have a much more pleasant experience. I also have a nice test that provides coverage and can be used for stress testing this part of the system.