Jan 31, 2010

Kill your application with Hibernate subselect!

If you want to kill your application silently the best way is to use Hibernate 'subselect' fetching strategy. Lets discuss the issue in details. Imagine that you have entities 'Post' and 'Comment' in the domain model with corresponding tables in the database:

public class Post {
    private Long id;
    private String title;
    private String body;
    private Set<Comment> comments;
    ...
}

public class Comment {
    private Long id;
    private String text;
    ...
}


Of course you use 'one to many' relationship to map relations between posts and comments (using 'post_id' column from comments table). It doesn't matter what mapping type you prefer (annotations of XML based). All seems good at this moment and you decide to choose 'subselect' fetching strategy for comments collection. You also create a number of unit tests to make sure that you mapping is well done and works fine. Now you want to create method to find last 10 posts for your home page. You decide to use Criteria API for this purpose:

List posts = session.createCriteria(Post.class)
     .addOrder(Order.desc("id"))
     .setMaxResults(10)
     .list();


You again create some unit tests to make sure that method works as expected. But of course you don't review SQL queries in Hibernate logs to check if all is as you expect to be. Hibernate generates following SQL queries for you (example from PostgreSQL database):

select post0_.id as id0_, 
post0_.title as title0_, 
post0_.body as body0_ 
from Post post0_ 
order by
post0_.id desc
limit 10;

select
comment0_.post_id as post5_1_,
comment0_.id as id1_0_,
comment0_.text as text1_0_
from Comment comment0_ 
where comment0_.post_id in (
   select post0_.id 
   from Post post0_
); 


Pay attention to the second query, especially to its subquery part. Limitation is not included in the subquery! What is going on here? It seems that Hibernate is going to load full comments table in memory and then select from them related to top posts selected before. Crazy behavior! Initially, when database is empty or there is low amount of comments, your application will work well. But every time somebody opens your home page with top blogs shown query loads all comments in memory, so you have performance penalty. This penalty is proportional to the number of comments in the database. Once (may be after some monthes of usage in production) you will find that all memory allocated to your application is filled and garbage collector eats 100% of CPU time. Thanks god for profilers, especially for VisualVM. It is hard to believe that such small issue in Hibernate may cause such dramatic effects. There is an opened issue in Hibernate bug tracker, but it has minor priority. We need to keep living with it, so another approach should be applied. The best way to avoid these issues is to use 'batch select' fetching strategy with lazy loading (or without it depending on application needs). Be careful and develop with pleasure!

Oct 19, 2009

Defining batch size for batch fetching in Hibernate

One of the Hibernate performance tuning ways when you need to work with parent/children relationships is to use lazy loading collections with batch fetching. This way allows you to perform far less than n+1 SQL queries to initialize your entities. To enable this mode you just need to specify batch-size attribute in XML mapping for collection or mark it with @BatchSize annotation in Java code. But how to define appropriate batch size? You need to understand how it works internally to do it well.

From examples you may see that if there are 25 objects in the database and batch size is set to 10 then Hibernate will perform 3 SQL queries: 10, 10, 5 items. But it is not so simple in the case of larger batch size. Hibernate internally creates an array of batch sizes using following strategy: if batch size <= 10 (for example 5) then it fills array with numbers from 1 to batch size ([1,2,3,4,5] for 5), else (for example 50) it fills array with numbers from 1 to 10 and integer parts of division batch size by powers of 2 ([1,2,3,4,5,6,7,8,9,10,12,25,50] for 50). So, now lets see how many SQL queries will Hibernate perform in case when batch size is 50 and there are 38 records in the database. The answer is 3: 25, 12, 1. Are you surprised? Hibernate doesn't create too many JDBC prepared statements for batch fetching so it performs querying using array of batch sizes. So, when you know the truth how to define the best batch size for your application? The answer is simple and relies on the mathematical theory: you should use powers of 2 multiplied by 10 (for example 10, 20, 40, 80, etc.) because each positive integer may be represented as a sum of powers of 2 (for example 13 = 8 + 4 + 1). If you select 40 instead of 50 in the previous example you will see benefit for example when number of records is 23: 2 SQL queries (20, 3) instead of 3 SQL queries (12, 10, 1). Of course if you know that a number of records in the database will always be small enough then use smallest batch size from recommended - 10. If you don't know how many records will be in the database then switch on logs for SQL queries and analyze how many of them are performed to define batch size from the recommended formula. I hope this will help you make Hibernate more productive. Develop with pleasure!

Oct 18, 2009

Ideal project video

Recently I have viewed video presentation from the InfoQ about one team experience report. This team uses a lot of great engineering and collaboration practices, always experiments and analyzes results of their work. Its really only one right way to build products quickly, within the budget and with the highest level of quality. They always communicate with customers and gather feedback from them to build product according to the real needs of the market. Try this way and you will success!

Oct 14, 2009

XP injection at ITjam

In September Kiev became a little more Agile because of the largest Agile conference in the Eastern Europe. On Agilee I have presented "People factor as failure reason of Agile adoption". This presentation is about people and requirements for them from Agile world, how everybody may improve his skills and become Agile team member.

The next conference I'm going to participate in is ITjam. I will present "Kanban VS Scrum" and "XP injection" with my friend Aleksey Solntsev. From first presentation your will learn more about Kanban, its principals and compare it to Scrum. In the second presentation we are going to show on the real project how to inject XP engineering practices like unit testing, TDD, CI, automated build, code review, etc. You will find how to simplify your daily work and start producing better products. Registration is free, so you are welcome!

Oct 13, 2009

Agile Coaching site launched

Some weeks ago I have launched Agile Coaching web site. It contains a lot of trainings available for ordering and participating, presentations from different conferences and video materials. I'm going to continuously fill this site with new useful information. I hope that you find something interesting for you there. Develop with pleasure!

Manage dependent projects in Maven

If you have been used Maven for a long time you probably know that it supports both project inheritance and aggregation. There is detailed description with samples on Maven site how both of them work. But sometimes you have really separated projects one of them depends on another, but both of them are continuously in development phase. So you need to use latest build artifacts of the base project without wasting time to run all its tests. There are some ways to resolve this issue. Lets say that we have project B witch depends on project A. Both of them contain many modules and base pom.xml files to build their own hierarchy.

So the first natural way to resolve the issue is to build projects in the right order manually (A and then B). But if project A was not changed from the last time then you will just waste your build time. Also in this case you should use some commands instead of one (it may be problematic for your CI server). Maven has release plugin that may help you. You just need to setup internal (company or project level) Maven repository and configure releasing policy for your project build. There are a lot of players on the market of Maven repositories: Archiva, Nexus, Proximity, Artifactory, etc. All of them supports local releasing of project artifacts. When setting releasing policy for the project you should decide if you will support only stable versions or snapshots as well. In case of snapshots Maven may generate build time version of artifacts after each release build. After all parts are configured you may use release plugin commands to publish project artifacts to the internal repository and make them available for all local repositories. So you need to build project A only when changes are made (for example on each commit to VCS). When you build project B Maven will automatically check for new versions of the project A artifacts and download them to the local repository if changes was found.

But sometimes your project contains modules that are not written in Java and are very platform dependent (for example install C++ library code on each build). In this case the previous solution can't be applied. You need to change main pom.xml of the project B to add information about dependency on project A, but you want to apply it only if project A sources are available on your machine. Use following profile for this purpose:

<profile>
<id>Build A project</id>

<modules>
<module>${a.relative.path}</module>
</modules>

<activation>
<property>
<name>a.relative.path</name>
</property>
</activation>
</profile>

This profile will be activated only if system property a.relative.path is passed. Note that value of this property should be related to root of project B (for example ../../a). Maven will build project A as part of each project B build. But how to avoid running tests in the project B? We use aggregation so no settings are inherited from the project B pom.xml except system properties. So, lets use them. Add following profile to the project A base pom.xml:

<profile>
<id>Disable tests for external build</id>
<properties>
<maven.test.skip>true</maven.test.skip>
</properties>
<activation>
<property>
<name>a.test.skip</name>
</property>
</activation>
</profile>

This profile will be activated when a.test.skip system variable is present and will disable all tests execution like -Dmaven.test.skip does. If you need additional settings for quick build of project A set them using system properties as well. Now you may build both projects with one Maven command: mvn -Da.relative.path=../../a -Da.tesk.skip clean install. If you don't have access to the project A sources then rely on the current version of the artifacts from local repository and build phase of the project A will be skipped.
Maven is very powerful build tool and even complex scenarios may be implemented using it. Develop with pleasure.

Mar 12, 2009

Be careful of large log files using TeamCity

Previous week I had very interesting experience with TeamCity continuous integration server. From some moment it became unable to perform any actions, just hanging and them crashing. After analysis of the server logs and running processes we identified that it uses 100% CPU time and then fail with errors related to memory limitations. The first level of the issue was that it runs garbage collector very often and without success (GC couldn't release memory in any generation). But why? To understand such behavior we used one of the very well done open source profilers - VisualVM. We found that one of the threads created a very large list of strings (more than 100000) and continued to add new strings. The root of the issue was very large log file from one of our builds (this build always works some hours and gathers all actions in log). TeamCity saved this log and on each start it tried to read it all in memory line by line to allow users view build history logs. We have already configured TeamCity to perform clean up procedure for all historical artifacts and logs one per week, but it didn't save us from crash. To fix the issue we cleaned manually all historical data from .BuildServer/system/messages. Also we configured problematic build to output all log messages in log file and now we use this log file only as artifact of the build. So, logging is well, but be aware of large log files especially when somebody read them. Develop with pleasure!