You Mean, Memory Is Not Infinite? We're working hard getting
MySQL Enterprise Monitor 2.0, featuring Query
Analyzer, ready for release. As part of that, we started
really ramping up the number of MySQL servers reporting in query
data to see how we could scale. Not surprising (to me, anyway),
the first efforts did not go so well. My old friend
OutOfMemoryError reared its ugly head once again.
Query Cache -- It's More Than Just Results! We're big (ab)users
of hibernate query caching, and more importantly to us the
natural id optimized query cache. Firing up
the profiler, I was not shocked to see that the second level
(let's call it L2) and query caches were hogging the majority of
memory. But, something didn't smell right...
What I was seeing was tons of object/entity referenes for our
persistent objects. However, the hibernate cache does not really
work that way. Hibernate 'dehydrates' query results and
persistent objects into their primitive components and
identifiers. It stores this decomposed data in the L2 and query
results cache, and on a cache hit, it rehydrates/recomposes them
into the requested persistent object.
But the profiler does not lie. One of our objects, for which
there are conceptually 60 some odd instances, had over 20,000
referenced instances on the heap. Yikes. Obviously, we're doing
something wrong, or are we... ?
Ok, Mr. Profiler, who is holding the references on the heap?
Drill down a bit through the query cache, past some HashMap
entries, into the... keys... keys, you say? Hrm, not values.
Interesting. Well, looky here. Hibernate's QueryKey is holding on
to a Object[] values array, which is holding the majority of our
persistent object references. Ouch. In addition to that, it has a
map inside of it whose values also contain non-trivial amount of
references to our entities.
Well, nuts. Code spelunking ensues. QueryKey is just as it sounds
-- an object that can act as a unique key for query results. This
means it includes stuff like the SQL query text itself as well as
any parameters (positional or named) that specifically identify a
unique result set.
Objects, Objects, Everywhere Now, silly me, since we are using an
object relational mapping, I was using objects for
the parameters in my HQL. Something along the lines of:
final Cat mate = ...;
final String hql = "from Cat as cat where cat.mate = ?"
final Query q = session.createQuery(hql);
q.setParameter(0, mate);
q.setCacheable(true);
In this case, the entire Cat mate (and everything he references)
would be held in perpetuity. Well, until either the query cache
exceeds his configured limits and it is evicted, or the table is
modified and the results become dirty.
Let's not forget our friends the Criteria queries, either.
Because it is only through criteria that we can get our friend
the natural id cache optimization. (and please pardon the
contrived-ness of the cat natural id example)
final String foo = "something";
final Cat mate = ...;
final Criteria crit;
final NaturalIdentifier natId;
crit = session.createCriteria(Cat.class);
natId = Restrictions.naturalId();
natId.set("mate", mate);
natId.set("foo", foo);
crit.add(natId);
crit.setCacheable(true);
In the same fashion as the HQL, this will result in hard
references to 'mate' and 'foo' held for the cache-life of the
query results.
How To Make a Bad Problem Worse Even worse, in our case, was the
fact that we would do the equivalent of load the same 'mate' over
and over again (maybe this cat is severely non-monogamous). And
whether loaded from L2 cache or directly from the database, the
mate Cat now existed as multiple instances, even though they are
hashCode()/equals() equivalent. But QueryKey in the query cache
doesn't know that. He only knows what he is handed. And he is
handed equivalent duplicates over and over and over again, and
only lets go of them on cache eviction. So, not only do we end up
with essentially unnecessary references to objects held onto by
the query keys in the cache, we instantiate and hold onto
multiple multiple instantiations of the same object and hold on
to those, too. Bear with me as I bow my head in shame...
Fix Attempt 1: Make a Smarter Cache I've been down this road
before. I tried to be smarter than Hibernate once before. It did
not end well. Unsullied by prior defeat, I resolved to attempt
being smarter than Hibernate once again!
Hibernate's query cache implementation is pluggable. So I'm going
to write my own. Ok, I'm not going to write my own -- from
scratch. My going theory is that I can at least eliminate the
duplication of the equivalent objects referenced in memory. I'm
going to decorate hibernate's StandardQueryCache and do the
following: For each QueryKey coming in a cache put(), introspect
the Object[] values (which are positional parameters to the
query). For each object in values[], see if an equivalent
canonical object has already been seen (same hashCode/equals()).
If so, use the canonical object. Else, initialize the canonical
store with this newly seen object.
Notice we only have to do this on put(). A get() can use whatever
objects already come in, as they are assumed to be
hashCode/equals equivalent. Hell, it HAS to work that way,
otherwise QueryKey would just be broken from the start. Here is
some snippets of relevant code that implement
org.hibernate.cache.QueryCache.
public boolean put(QueryKey key, Type[] returnTypes,
@SuppressWarnings("unchecked") List result, boolean isNaturalKeyLookup,
SessionImplementor session) throws HibernateException {
// duplicate natural key shortcut for space and time efficiency
if (isNaturalKeyLookup && result.isEmpty()) {
return false;
}
canonicalizeValuesInKey(key);
return queryCache.put(key, returnTypes, result, isNaturalKeyLookup,
session);
}
private void canonicalizeValuesInKey(QueryKey key) {
try {
final Field valuesField;
valuesField = key.getClass().getDeclaredField("values");
valuesField.setAccessible(true);
final Object[] values = (Object[]) valuesField.get(key);
canonicalizeValues(values);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
private void canonicalizeValues(Object[] values) {
synchronized (canonicalObjects) {
for (int i = 0; i < values.length; i++) {
Object object = values[i];
Object co = canonicalObjects.get(object);
if (co == null) {
co = object;
canonicalObjects.put(object, co);
} else if (co != object) {
// System.out.println("using pre-existing canonical object "
// + co);
}
values[i] = co;
}
}
}
It's pretty much what i described. I didn't even attempt to get
permission to post the whole thing, because it is probably not
worth my time. The only thing missing is a HashMap of the
canonical objects, an the instantiation of the StandardQueryCache
queryCache. You'll also need to implement
org.hibernate.cache.QueryCacheFactory to create this smarter
query cache factory, and then plug that into your hibernate
config.
This did work as expected. My 'outstanding' objects on the heap
were greatly reduced. Unfortunately, it was not good enough. I
still had thousands of these guys on the heap, essentially unused
except to at some point fetch their numeric id to be used by
hibernate's generated SQL. And this didn't take care of the named
parameters, which are stored in a map of string names to some
other internal hibernate class, which I no longer felt like
introspecting via reflection. So anything using named parameters
was still potentially duplicated.
Fix Attempt 2: Objects? Who Needs Objects?Hrm, the last paragraph
stirred a thought -- hibernate only needs the id's from these
objects. While the academic in me enjoyed the exercise in
decorating the query cache to be smarter about duplicate
references, the idiot in me said "well, duh. if you only need the
id, why not just use the id?" Because we're supposed to use
objects! Oh well. It occurred to me that I could rewrite any and
all HQL to reference object id's instead of the object property
reference itself. It should end up in the same SQL eventually
sent to the database. Seems like a cheesy code monkey work
around, but the theory is that hibernate QueryKey will only be
holding onto Integer and Long references instead of entire object
entities.
So, I hunt down all of our our cacheable queries. I change them
all to object id queries. The previous Cat business can now look
like this:
final Cat mate = <...>;
final String hql = "from Cat as cat where cat.mate.id = ?"
final Query q = session.createQuery(hql);
q.setParameter(0, mate.getId());
q.setCacheable(true);
cat.mate becomes cat.mate.id. The mapped parameter becomes
mate.getId(). It could be a named parameter just as well. I
didn't find a single HQL in our application that I could not
convert this way. Good.
But what about our friend Criteria? He requires objects for the
properties, right? At first, I thought this was true. And I was
consoled by the fact that my smart query cache would do its best
to keep duplication at a minimum. In fact, it was the next day
(after resting my brain), that another "I'm an idiot" moment came
to light. The Criteria API still just takes strings for property
references. Perhaps, it follows the same property resolution
rules as HQL? In other words instead of "mate", can I say
"mate.id"? And the answer is, YES, yes I can! Woo hoo! Absolutely
no more object references for me! Here is what the criteria would
like like:
final String foo = "something";
final Cat mate = ...;
final Criteria crit;
final NaturalIdentifier natId;
crit = session.createCriteria(Cat.class);
natId = Restrictions.naturalId();
natId.set("mate.id", mate.getId());
natId.set("foo", foo);
crit.add(natId);
crit.setCacheable(true);
Subtle, yes. But trust me, it makes a huge difference.
Interesting now, that Fix Attempt 2 likely alleviates the need
for Fix Attempt 1. At the worst, we end up with lots and lots of
Integer/Long object references and even duplicates of them. The
profiler says it is not very much in my limited testing. But, I
decide to leave the smarter cache in, because it appears to be
working, and it does reduce memory.
Lessons Learned If you use hibernate query caching, and actually
want to use memory for caching useful results, and waste as
little as possible with overhead, follow some simple
advice:
- Write your HQL queries to use identifiers in any
substitutable parameters.WHERE clauses, IN lists, etc. Using full
objects results in the objects being kept on the heap for the
life of the cache entry.
- Write your Criteria restrictions to use identifiers as
well.
- Use the 'smart' query cache implementation skeleton in this
article to eliminate duplicate objects used in query keys. This
helps a little if you use the ids in your HQL/Criteria, but if
you still must use objects then it helps a lot.
- If you use a smart query cache implementation, prefer '?'
positional parameters over named parameters. I know, I know. I
love named parameters too. But the smart query cache
implementation in this article only detects duplicates for
positional parameters (or params in criteria). Alternatively,
feel free to extend the example to locate duplicates in the named
parameter map and replace them with canonical ones as well.
I'm Not the Only One Who Noticed...
In doing some final research for this post, I came across
hibernate core issue HHH-3383. Let's keep our eye on it to see if
the wonderful hibernate devs can fix this down at their level, so
we don't have to change our code. Also, the issue lists that
Criteria cannot be fixed with the same 'use id' workaround. Since
I was able to, I wonder if the bug submitter did not realize you
can reference dotted property paths in criteria restrictions
exactly as you can in HQL. Perhaps I shall start a dialog with
him.