Bigger Wrench: 2009

Tuesday, September 8, 2009

How many objects do we need?

One of the interesting, and potentially invaluable concepts that the Chandler project brought to light (i.e. to my attention ) was the idea of "stamping". The concept of stamping is the idea of defining incoming data by assigning it a type. When someone is simply typing in some information quickly, it by default goes into a note and then allows the user to stamp it as an event, or a task, or whatever else is needed.

A typical (though narrow) implementation of this concept is in the ability to take an e-mail that is an invitation and have the information be put into the calendar at the appropriate time. Microsoft Outlook allows e-mail's to be scheduled this way.

As far as I can tell, the Chandler project does this by displaying different facets of information in different contexts. This implies that the information is stored and "stamped" with several different types.

For example: someone may send you an evite to an event. You respond to the evite but keep the e-mail around as a reminder. If you liked the restaurant that the event occurred at you may decide to keep a note about the fact that you like that restaurant and you may also want to put the address of that restaurant into your contact list. In the Chandler world this is done by stamping the same piece of information twice. The first time it is stamped with the type "note" and the second time with the type "contact".

My apologies in advance if it turns out that I am misinterpreting the design notes for Chandler.

I think, the same thing could be achieved by creating new objects from the initial object and then maintaining associations between them. This would allow someone to create an event from an e-mail and the event would have a reference back to the original e-mail.

Incomplete thoughts on displaying PIM data

Modeling calendars, agendas, projects and so on as views into the data.

Examples:

A calendar could be considered a view into a timestream populated with events and tasks and anything else that can be located in time allows organizing chronological events in terms of days, weeks, months, years, etc..

A timeline could be considered a view into a timestream populated with the events, tasks and anything else that can be located in time.

A timetable could be considered a view into a timestream populated with the events, tasks and anything else that can be located in time in a tabular form.

The daily agenda could be considered a view into a timestream populated with events and tasks and anything else that can be located in time.

A Gantt chart could be considered a view into a set of actions that can be tracked and have a due date. A to do list could be considered another view into a set of actions that can be tracked and have a due date.

A contact list could be considered a view into a set of information about people, organizations, companies and so on.

So far, so good.

Taking on that there is a view into a otherwise undifferentiated set of data has the potential to give great results. That is effectively what ECCO allowed you to do on a limited scale.

And it seems to me that there are two parts to this. There are the constraints or criteria that gives the list of items to be displayed such as show me all the items that can be located in time and whose date occurs somewhere within the next month. And some of the criteria could be tags and or "all items referenced by the following taxonomy..."

So that gives us some common views that may be of use:

For sets of items that can be located in time => calendars, schedules
For sets of items that can be tracked and/or have a due date there are projects, to do lists, agendas, checklists.
For sets of items that are contact information of various types there are contact lists, mailing lists, and directories.
For communications and messages there are threaded conversations.
For events that have happened in the past there are journals and audit logs.

So if we now look at how those views could be created we start having to pull together all the previous notes and discussions.

Say for example we needed to create a mailing list for a specific community such

as the extended family. Of course the criteria would be something like: please

display contacts referenced by the "family" tag/taxonomy.

Context is decisive

Context is decisive. What ever context you have for piece of information governs

your understanding of that piece of information. So context is decisive.

I have had thoughts about context going around and around in the back of my head

for days now. And since they are getting in the way of my other work I am

putting them into the blog in the hopes that they will leave me alone.

One of the most common requests I have bumped into is for the PIM to take into

account the context in which you are running the application. So, in other

words, if you are at work it would only show you by default the tasks and events

pertinent to you being at work. If you are on the road it would only show you

those items that are pertinent to you being on the road.

The GTD methodology recommends setting up tasks list with tags such as @phone

and @office so that when you are on the road you can simply list those tasks

that you have the resources to perform. Is this something that makes sense to

model in a more sophisticated manner.

In other words, what does knowing the context make available in terms of work?

Should it be modeled and if so how?

Saturday, September 5, 2009

It was Col. Mustard in the library with the candlestick

One of the big questions in any enterprise development project is who gets to muck with the database or data repository.

This is significant even in the case of the PIM.

When looking at how to deal with pushing data out to other repositories such as Google, polling data from other repositories such as Google or RSS feeds, and just the general headaches of synchronizing to and from other devices is clear that who gets to touch the database and how is a critical question.

For now, I am assuming that all of the tools that feed to and from other data sinks/sources will do so by operating against the database rather than having the PIM up and running and managing those operations.

Kitchen sink modeling

The more I work with user stories and scenarios for PIMs the more it is becoming clear that there is simply a sea of information that only the user has any sense of. Most of the things that we are modeling have specific meanings to the individual (i.e. must a promise have a due date?). For some people the answer is yes for others the answer is no.

It is clear that for most people there are clear distinctions between the different types of data they have. In other words most people have a consistent way they mentally model meetings versus tasks versus holidays. And the way they model it is quite closely tied to the way they work. So people work and live in a world of pieces of data that have clear types and clear behaviors.

The collections (i.e. projects, agendas, to do lists, checklists, etc..) people use to manage those different types of data ( i.e. to do list items, tasks, promises, calls, errands, chores, etc..) are just as individualized as the data items themselves.

For example: For some people a task may be committed to to the extent that somebody has said "I will do that sometime next week". And they put a sticky note in their calendar so that when they are looking at that week they know what things they promised to do that they haven't put into "space and time" yet.

Other people maintain tasks on lists that don't count them as real until they are scheduled on a calendar. Until that point they are on the "Not Doing Now" or "Unscheduled" list.

So clearly, the majority of everything to be modeled needs to be customizable. so, after walking through user story after user story this is what I see:

1) Every thing to be tracked in a PIM has one or more behaviors.

The behaviors are

Locatable in time (LIT)
Can Occupy time (0T)
Locatable in space (LIS)
Has a lifespan (SPN)
Has trackable progress (TRK)
Has a due date (DUE)
Requires resources (RR)

The different types of items can have the behavior or not have the behavior. If they do have the behavior then they can either have the state of having a fixed value or not having a fixed value.

For example, a promise can have the behavior of being locatable in time, but it may not yet have been fixed in time.

2) Everything is versionable.

There are many areas where tracking the changes to a scheduled item (who generated them and when) is critical to fixing something. This is specially true when you're dealing with synchronizing schedules with multiple calendars.

3) All items tracked in the PIM are associated with a specific type and that type has associated properties and each type has default values for those properties and behaviors.

For example, an event can have the behavior of being locatable in time and may have the behavior of occupying time and the default time it occupies is 15 minutes.

4) Context is decisive: context such as home -- online or home -- off-line or work -- online etc. have resources available such as (phone, computer, Internet, e-mail, etc.)

Many methodologies such as GTD take into account the context in which you were working. For example there are some things that should only be done from home, there are some things that should only be done from work, and there are some things that can only be done when you have a phone available. The user's context governs both what they should be doing as well as what they are capable of doing given the resources available.

5) Taxonomies (hierarchies) go from wider towards narrower ( i.e. Extended Family => Immediate Family)

In looking at all the different ways people navigate through hierarchies and the way they create their own private taxonomies (using file folders, categories, tags, etc.) it is clear that most people navigate from wider and narrower.

Wednesday, August 12, 2009

PIM type tools.

On a slightly tangential note, one of the tools that I constantly review and revise are my tools for managing tasks and scheduling. I figured it was worthwhile to direct you to my other blog and links to my two latest posts regarding the state of the three big open-source contenders: Evolution, Chandler, and Thunderbird/Lightning.

http://simplexsaltations.blogspot.com/2009/08/upgrading-my-existing-pims-and-systems.html

http://simplexsaltations.blogspot.com/2009/08/upgrading-my-existing-pims-and-systems_12.html

- Jim

Upgrading my existing PIMs and Systems: Take 2

As I said last week, Chandler appeared to be a bust.

Over the weekend, I tried out using Thunderbird 2.0 and Lightning and a few plug-ins.

It is definitely far more stable than Chandler. Installation was a breeze. The Thunderbird functionality was rock solid. The Lightning integration was a little less so. Many of my Google calendars did not display all of the events.

I then went on to try Thunderbird 3.0 beta and the corresponding experimental Lightning version. Integration is a little cleaner, some Google events still did not display (though not the same ones) and the application did crash under the 64-bit version of Ubuntu (Jaunty Jackelope) I am running.

So I ended up back using Evolution as my primary PIM. But I did make a few changes in how I used it.

I change the system so that my two Gmail accounts are accessed using IMAP. My Comcast account is accessed using POP , and I am experimenting with using rules tailored for each of the three accounts.

In addition, I have switched over to using "Remember The Milk." (aka RTM) as my primary task manager. Google tasks appears to be of no real use for me since I can't sync my tasks to anything useful.

I use Tasque ( Linux only) to edit and modify my task entries and I use Evolution's read only access of RTM to see the tasks on the fly. It is far from ideal. Far, far, far from ideal. But it is a step up. When Evolution has the ability to access Google Tasks then I may be able to switch to one online provider. Of course, for that to happen Google needs to publish an API that allows us to access the tasks.

- Jim

Thursday, August 6, 2009

Upgrading my existing PIMs and Systems

Periodically, I reviewed the existing PIM solutions that are out there in the hopes of finding something that will please to handle my intermediate needs while I am writing something to handle all of my needs.

I am currently using Evolution in connection with Google calendar and sinking evolution to a Palm pilot. In my review. I discovered that Chandler is now at a greater than 1.0 version. Yay. Sort of.

I have often commented in this blog about how effectively the Chandler Project has modeled the domain of task management and scheduling. They have some brilliant concepts and have nailed many of the basic issues needed to take things to another level.

Unfortunately, stability is not one of them. I tried pulling down and installing Chandler on four different machines/OSes. Windows XP/Vista/Ubuntu 8 and Ubuntu 9.

Every single one of them crashed multiple times during normal operations. Whether it was importing from Google ics files, to creating a new collection, or creating a new calendar. Crashes were the rule rather than the exception.

This was also my experience over a year ago.

I don't know what else to say, but "Damn"

Over the weekend . I will look at both Thunderbird 2.0 and its corresponding Lightning plug-in as well as the Thunderbird 3.0 beta and its corresponding Lightning plug-in.

- Jim

Sunday, June 21, 2009

Model Mismatch

Interoperability is a stone cold *expletive deleted*. When writing a PIM or any other calendar app, you will pick wrong. Just get used to it.

Here is what I mean:

In the calendar world their is only one viable standard for the exchanged info : the icalendar file format. It can be exchanged in multiple ways, though the
CalDAV protocol (Sits on Top of WebDAV which sits on HTTP) is becoming the exchange method of choice.

But whos should we use ? Lots of applications support it. But Microsoft's ICalendar format is not always in sync with Google's and so on AND it is not even anyone's fault. The areas where things fall down are often where the spec itself has issues such as recurring events and events that last all day.

When does an all day event start ? 00:00:01 or 00:00:00 ?

When does an all day event end ? 23:59:59 or 24:00:00 (oh wait, thats 00:00:00) oops.

Those questions just begin to touch on the issues.

Whichever way you handle all day events you will have to do accurate conversions to the other way or risk all sorts of "round trip translation errors". If you have synchronized your PIM with another and back again you have probably already run into this....

There are conferences and technical groups that spend days just battling with how to specify and manage recurring events. Look for "IIOP Recurring Events"

Most PIMs solve this by closely adhering to the icalendar specification for their domain model. Thats great where the spec really works but their are some key things that don''t appear to be covered or are incompletely specified. i.e. Tagging and taxonomies, the same events in multiple calendars, hierarchical tasks and events, and so on. I am not yet an expert on icalendar but I will become solidly familiar with it over the next few weeks. Luckily icalendar provides an extension mechanism so that the application specific information can be captured. Of course, other applications won't use the info and some of them won't even preserve it (Google's icalendar support seems to remove tasks from the calenar).

My project for next weekend is to try and do a full round trip between Google calendar and the Saltation domain model using Google's Calendar API and maybe using the icalendar file format.

6.21.09 I wrote this post on 6.19.09 and then posted it today. As I go back and look at what Google actually supports I am stunned. To try and push through the Google API looks like a decent amount of work for little initial return. I think I am going to work first on FULL icalendar file format support (import and export) next weekend and then look at publishing and reading from Google directly. At least debugging the process will be simpler.

Progress

I just realized that naming the PIM and the blog the same thing may have been less than optimal. Ah well.

I did a quick clean up of my build scripts for saltation.

I am still wrestling with how to present the information on SourceForge in a way that works. The gods know that I have getting annoyed often enough with OSS projects and navigating around trying to figure out what is where. I will spend some
serious time and thought to laying things out in away that works.

Wednesday, June 17, 2009

Taking things public

About two weeks ago I realized that I was creating this project as an open source project but not taking it out into the open source community.

Anybody see a small disconnect there?

So, I have started a project (called saltation) on SourceForge and I will be checking in code within 24 hours.

And, I have already run into my first roadblock (or at least a speed bump).

Specifically, it is the build system. I use a common build framework for all of my projects which I call appropriately enough, common-build. It uses Ant and Ivy for the build and dependency management. For a little more information on that see this posting in my other blog : Ant versus Maven.

Both Maven and Ivy use an external repository for storing dependencies such as Java libraries. SourceForge still hasn't worked out all the ways to interact with this. so the bottom line is one of my first tasks is going to be making the build system require a minimum of setup (. I.e. you should be able to build right out of the box).

That will probably take me a day or two counting testing.

Till then,

- Jim

Friday, June 12, 2009

More in the world of unused code detection : Klocwork

As noted in a previous posting, I have a major refactoring task ahead of me with the code base that I am now the owner of.

Recently we had an intermittent problem that may have been the result of a resource leak. Because we were unable to reproduce it, we put some processes in place for the next time and I did what I always do. Look to see if there is some tool out there that will allow me to detect resource leaks in the code in the current branch of the code base.

The two that most people seem to recommend are Coverity and Klocwork. A number of my acquaintances have said that Klocwork was better at detecting resource leaks so I decided to try it.

Here is the good news: the tool mostly does the job.

The bad news : the company doesn't quite have it together.

In one key way it does: they are curteous, efficient and smart. The sheer efficiency and overall effectiveness highlights even more their marketing/sales deficiencies.

I think the key thing that I kept bumping into is that there were limitations and conditions that were not clearly spelled out during the purchasing process. Klocwork had a salesperson and a technician on the phone with me at their request to evaluate my needs. The decision was made for me to go with the Klocwork Solo product rather than the honking big enterprise licensed behemoth that is their main line product. When I pulled down the demo version of Klocwork Solo I discovered that the temporary license covered more than a month but that the demo version could only handle 99 files at a pop and the Solo product could only run under Windows even though it is a Java application ( it appears to spawn a Windows executable as part of its analysis process).

Neither of these facts came to the surface in the initial call.

Personally, I would recommend that when you are making available a trial application of a code analysis tool, that you make it a short period of time AND allow it to handle an enterprise class number of files. After all, you know people are going to use it on their code base, and will need to do so in order to demonstrate its worth to the powers that be.

The 99 file limitation was a pain, but I evaluated the tool enough to determine but it was probably worth the $100 it costs to get it in and try it out on a larger code base.

I purchased the $100 Solo product and discovered that it was limited to 1000 files (but apparently I can call Klocwork and get that expanded). Luckily, I was able to limit the code base I was interested in to 1000 files.

But the other thing I discovered is that the licensing tool appears to talk with the Klocwork mothership very time I start up Eclipse and the license is only valid for a year. That was another detail that was not presented upfront.

I don't want to leave you with the idea that I think that Klocwork is intentionally misleading people. I don't think that. But I do think that the management of licenses and sales are oriented toward a different scale of user than a small shop of 3 to 5 developers.

I don't think I will be recommending their tools until they rapidly revise their licensing and license management.

- Jim

Tuesday, June 2, 2009

Technology choices revisited

In one of my earlier posts Technology Choices I had looked at what user interface toolkit I should be using. What I've discovered in the intervening time is that this application is going to put a premium on flexibility in presentation. As a result, I am seriously reconsidering the user interface toolkit. The Eclipse RCP is a very powerful platform upon which to build an application. Unfortunately it does have a very strong set of metaphors upon which it is built (Views and Workspaces) which seem too rigid for what I am attempting.

So I am seriously looking at using QT with the Java bindings.

Any thoughts?

Going to and fro

The other part of the last few months has been looking at importing and exporting and synchronizing.

Spent a great deal of time looking at the iCalendar specification and all of the recurrence rules. And then I read all of the use cases for recurrence rules interoperability that have been put together by some incredibly diligent groups of people. The people that work on those specs definitely come under my heading of unsung heroes.

As often appears to be the case in the area of calendaring, there are really no clean answers. And, going back and forth between the icalender format and a more flexible internal model is going to require a great deal of detailed work. But, it is very clear that iCalendar is the de facto standard for now and I see nothing on the horizon yet that will replace it.

And personally, I think it gives the biggest bang for the buck. If I can support CalDAV and ICalendar to any significant degree I will be able to publish calendars to and from Google Calendar , as well as many others applications.

So my immediate attention is to support import/export for ICalendar , and then support using CalDAV to publish.

- Jim

When things don't fit into a neatly modeled the world

I am back after a long hiatus. A combination of work plus a course I'm teaching has kept me very busy.

But always in the background I am thinking about the PIM. That's how we referred to it in my household "The PIM". I would wonder if I'm a little obsessed except I'm having so much fun.

I've been coding some commandline utilities to test the data model and, as I expected, I have run into the limitations of my original model. I had originally started with calendars, and agendas and so on being first class objects. What I mean by that is that they have some distinct existence in the real world , and that is why I am modeling them. What I have discovered is that calendars, address books, task lists, journals and so on are all simply semantically themed collections of stuff we care about.

There are few assertions I can make about what a calendar is that are universal enough to be easily recognizable to everyone.

So I stepped back and thought about it for a bit and this is what I have come up with: we have an ocean of first-class objects that we would like to track (incense, tasks, promises, contact information, little notes) and so on. And then we spend our time trying to organize them into multiple collections they give us easy access to what we want to do when we want to do.

So I am experimenting with having any of these collections simply be taxonomies. Just like described in an earlier post. This would mean that many calendars have a relationship to each other. An example would be the calendar I maintain for my children's schedule all together which could have two related calendars ( one for Aidan and one for Trent ) . both Trent and Aidan's calendar would refer to events in the Boy Scout calendar.

The same would be true of address books. I could have an address book that references my friends and a related address book that only has the subset of my college friends.

The same appears to work well for journals, agendas, sets of conversations, and the collection of resources and notes that I refer to as Data Mines or just Mines.

Tuesday, May 26, 2009

New packaging equals new tools

Sometimes a new package gives new life to existing functionality.

I am working to track down a resource leak in our project at work, that only occurs during rare instances. Of course, I am using NeoLoad to stress the system, but since it appears to be a resource leak. I was planning to use the built-in JVM monitoring tools for doing what I need to do.

Those tools are generally commandline tools and I always have to refresh my memory about how they work.

But lo and behold, Sun had a brilliant idea in the package a front end interface called "Java Visual VM" that combines the profiling, monitoring, and heap dump capabilities all in one.

Go ahead and try it out. It should probably become a habit to routinely monitor applications you're working on when running unit tests and such. And since it is built into the 6.0 JDK , Sun has just lowered the bar to doing that kind of monitoring.

Very useful and built into your JVM.

Sunday, May 17, 2009

Short Tour Testing

This is a integration or system-level testing technique that scales well and works at the unit testing level as well.

I originally discovered it in an article by Tom Cargill in C++ Report many moons ago (see below). I have not found any electronic descriptions of the technique so I figured I would revive it for those who would find it useful.

Cargill appears to have originally derived the process from a text on validating computer protocols by Holzmann (see below). Having read the book thoroughly, I can see where he derived it from, though would not have occurred to me to do so.

The basic design and concept is very simple: for any mixture of states and transitions that can be walked through in some complex sequence to produce a bug there is a short tour (three to five steps) through the same set of states and transitions that will give you the same bug.

I found this to be invaluable for pounding on APIs in order to validate that they have the correct mixture of correct error handling and correct functionality.

I typically implement this is in Java in the pseudocode looks something like this:

int tour = 1001;

int numberOfMethods = 13;

int numberOfParamSetsPerMethod = 20;

for ( int step = tour; step--; step > 0)

{

int methodIndex = step % numberOfMethods;

int paramSetIndex = step % numberOfParamSetsPerMethod;

invoke(methodIndex , paramSetIndex );

}

The end result of this is a predictable "drunken walk" through the combinations of methods and parameters.

Of course, the code can be made even simpler using the reflection API in Java.

Once the tour code has been designed, the test is invoked and the results of the tour are validated by eye. Typically at that point I save the results of the log of the tour so that it can be programmatically compared against the test results each time.

If the parameter sets are chosen well, this form of testing will go a long way towards discovering interaction issues in the system. I have had great success using scripting languages such as Lua to call C language APIs to do this kind of testing. I have also used it to test service architectures to expose session managementand exception handling issues.

1. Cargill, Tom, "Short Tour Testing", C++ Report, vol 7, no. 2, February 1995, pp 60-62.

Wednesday, April 8, 2009

Text processing without the pain

Don't get me wrong, I love sed and awk. I have whole libraries of sed and awk scripts for doing all sorts of things. But some of them took a lot longer to write than they should've.

Last night I was faced with the task of translating a whole bunch of documentation from Tex and Latex to Docbook 5.0 XML. That meant doing multiline matches with sed and some preprocessing with awk, and my spirit rebelled.

I went beserk with online searches for "alternatives to sed awk" "text processing commandline utilities" and so on. The problem is that the standard text processing utilities need so much explanation that tutorials on how to do things with sed and awk are churned out so that they outnumber the alternatives by at least an order of magnitude difference.

I finally did what I should have done in the first place. I went to sourceforge.net and searched for "text processing". And I found Gema (pause for heavenly choir music).

It is not perfect by any means. the documentation in particular is just as cryptic as the original sed man pages. But in less than a half hour I had a script up and running that cleanly handled the multiline matches I needed to do.

As an example:

\\item\[*\]#\\=($1)#\n

will match the following

\\item[First Name]

The first name of the individual

This should not include any periods or commas.

And print out the following:

(First Name) The first name of the individual

This should not include any periods or commas

It's clean, it appears to be quick though none of the files I used it on with that large. It is well worth looking at. You also might want to take a look at this article.

Thursday, March 26, 2009

Load Testing Tools

Apologies for the delay is in posting. My company has been dealing with a customer problem that required some significant load testing. The previous set of load tests had appeared to miss something and we had to go back to the drawing board. The person who owned and managed the previous set of load tests was no longer here, and the load test framework, while documented, was not easy to extend.

So we went looking for a load test tool to give us a leg up quickly. Unfortunately, even though open source is my default choice I was not able to get any of the open choice solutions up and running quickly while giving me scope for quickly adapting to different load test scenarios. If I had had a few more days I probably could've put an open-source solution in place that would give me the same responsiveness as the proprietary solutions.

It very quickly came down to the proprietary solutions of NeoLoad and PureLoad. Almost all of the other proprietary solutions were too expensive or took too long to set up and configure.

Of the two, NeoLoad passed the five-minute test with flying colors. After downloading the install. It took me less than four minutes to start running load tests. If you are unfamiliar with the language of load testing, it may take you as long as a half hour to understand the documentation. I recommend looking at the Wikipedia entries for load testing first before attempting to use any load testing tool. The terminology can be misleading when you're first exposed to it.

PureLoad almost passed that test. I was able to get the free (PureTest) component that allows you to record and run the tests themselves up and running in five minutes.They have a separate tool called PureLoad tool that allows you to run the same tests as load tests. I was not able to get that up and running in the types of scenarios I wanted and my call for pre-sales support went 48 hours before I got a response. By that time, I had already committed to going with NeoLoad.

With NeoLoad, we have the license for 60 simultaneous virtual users within an hour after they received a purchase order. I was up and running load tests within the next half hour. The fact that they had an eval license that allowed me to prepare all of the tasks ahead of time and run them with three simultaneous users, allowed me to ramp up before I even got the full license. That is what I like!

Sidenote: What was truly frustrating was the fact that when I did get a presales support call back from PureLoad, the individual in question tried to convince me to change my mind and use the PureLoad product on the basis of price even when I highlighted to him that the reason for going with NeoLoad was speed of setup and speed of response. I pointed this out to him two times. After the second time, when he again brought up the subject of price, I had to make it clear that no means no. This is why I consider communication and listening skills to be critical for any kind of presales support. If, for some reason, NeoLoad does not work out I will be reluctant to go back to the PureLoad people simply because I don't want to deal with people that don't listen.

Friday, February 27, 2009

The build lifecycle

One of the things that is critical to any build system ( and this is something that the Maven guys nailed) is that it have a clearly defined lifecycle.

For myself over the last 20 years I have developed a build lifecycle that appears to answer my needs for every build.

Clean - Remove all build artifacts from the file system.
Init - Initialize the build system. All properties should be set here.
Prep - Prepare the file system for the build. Create folders as necessary.
GetDependencies - Resolve all dependencies. Locate and pull down the necessary dependencies and make them available to the build
Gen - Generate or Preprocess source code and resources of any type.
Build - Build anything that can be compiled or assembled.
UnitTest - Execute any tests that can be performed without deployment
Pkg - Package anything that can be packaged
Verify - Establish the internal integrity of the package
Deploy - Deploy any packages that need to be deployed.
SmokeTest - Execute any smoke tests against the deployed application(s)
Stage - Publish the build artifacts for local (i.e. machine local) usage.
Share - Publish the build artifacts for team wide usage.
Release - Publish the build artifacts for general (i.e. network) usage.
IntegrationTest - Execute any smoke tests

Most of that is not new. You have probably seen build systems that use subsets or supersets of these. The key is in making it very easy to plug in to a lifecycle event so that you can perform additional actions as necessary. Doing that in Ant takes some very careful thought but produces some very clean results.

It does require some conventions on the project and source code structure side of things. Another thing that the Maven guys nailed. :-)

The directory structure conventions I am using with my build system are the following:

I have a top level [project] Folder with several folders underneath.

[project]/root - This is the top level location from which you do builds that execute against everything. Commands like "CleanAll", "BuildAll", etc...
[project]/common-build - This contains the common build system I use across all projects. It is a separate project in my revision control system and is typically pulled down into a location under the project using Subversion's externals command or using Git's submodules.
[project]/components - This folder contains all of the sub projects that make up the components of this project.
[project]/apps - This folder contains all of the subprojects that make up the applications of this project.
[project]/installers - This folder contains all of the subprojects used to generate the installers of this project.

Under each components or apps or installers folder is a named subproject with a standard structure.

For example: The project "calendar-server" with the "calendar-utils" component would look like this:

calendar-server/components/calendar-utils

The component has a series of files. The build.xml file is of course the ant file for the component. The ivy.xml is a very simple file that describes what artifacts the component is dependent on. And the .project and .classpath files are those files needed by Eclipse.

calendar-utils/build.xml
calendar-utils/ivy.xml
calendar-utils/.project
calendar-utils/.classpath

And the calendar-utils component would have a checked-in folder structure that looks like this:

calendar-utils/src/main/java
calendar-utils/src/main/conf
calendar-utils/src/test/java
calendar-utils/src/test/conf

The source trees are separated for several reasons. Often the Java test code is compiled , managed, preprocessed, executed or referenced separately from the main code. The conf folders are also separate for some of the same reasons as well as allowing separate configurations to be set up for test and actual deployment.

There are also a series of common folders that are typically generated during the build. the first set of those is for generated code and those follow the same conventions as the checked source for much the same reasons.

calendar-utils/gen-src/main/java
calendar-utils/gen-src/main/conf
calendar-utils/gen-src/test/java
calendar-utils/gen-src/test/conf

The other set of folders is for the targets or results of the build. They all lie under the target folder. The classes and test classes are kept separate for the same reasons the sources are kept separate. The distrib folder contains all of the final output artifacts of the build ( jars, documents, zip files and executables etc.). The reports folder contains reports on the tests, profiling, etc.

calendar-utils/target/classes
calendar-utils/target/test-classes
calendar-utils/target/distrib
calendar-utils/target/reports

So to clean up this directory structure for a new pristine build only requires us to delete the gen-src and target folders.

In the next post I will talk about the design of the Ant build files found in the components and the common-build folder.

Build Systems : Ant versus Maven

Ever since I discovered Make (25+ years ago) I have been searching for a good build system. I have used everything from Configure and Make (talk about icing on a mud pie) to JAM and now Ant and Maven.

I keep going back and trying Maven again when they do a new release. It is such a good idea that I keep going back in the hopes that the implementation and documentation will finally live up to that promise. And I think a good number of people stay with Maven because it is such a good idea that they persevere and endure the slings and arrows of outrageous documentation and implementation. Alas, each time I come away frustrated.

The Maven repository concept is pure genius. And in fact, the implementation works well enough that I use it in connection with Apache's Ivyto do dependency management.

What is Ivy? A set of dependency management tasks used by Ant to pull down and access the appropriate jar or other dependencies needed by your project. I won't go into a tutorial about Ivy since there are more than a few out there. But the project itself is available at http://ant.apache.org/ivy/. it does suffer from some of the same documentation issues that Maven does but between the online forums and other peoples blog posts you can usually figure something out pretty quickly.

In the following few posts I will be discussing how I use Ivy in connection with Ant to produce a fairly clean build system with minimal bootstrap requirements.

Thursday, February 26, 2009

Revision Control: GIT

I love Subversion for revision control. It has a lot of power. And every once in awhile I need more power.

For those moments, there is GIT. It is blazing fast and industrial-strength. Unfortunately, reading the user documentation has my head ready to explode. There is this vague haunting of concepts just out of reach with explanations that almost, but don't quite give you the clue I need.

For those people who want a clear explanation of what it does, how it does it, and the concepts behind it so that you can manipulate it well; here is the book you need: "Pragmatic Version Control using Git" by Travis Swicegood.

Go to the Pragmatic Programming website and download the PDF for $22. It is clear, it is straightforward, it tells you what's going on in the background as well as having simple straightforward examples.

Discovering unused code in Java

When I set out to track down unused code in Java I came across a large number of static analysis tools that all seemed to do the job fairly well.

I used several of them including a fairly good eclipse plug-in called UCDetector. It was not fast, but it was very thorough.

By using those tools we were able to remove the obviously unused code. That resulted in a nontrivial shrinkage of about 30%. Unfortunately, due to the fact that much of the code gets called via Java's reflection API, There is a large amount of code that is not so obviously unused.

Since we have a UI test suite I thought that we would run the test suite against the front end and then log or track the methods that are actually called in the back end. Than we could eliminate the methods we found that were unused.

I first tried using the JDI interface of the JVM and simply remotely log each entrance into a method ( I didn't care about the exit). Unfortunately that slowed the Backend server system to a crawl. It would have taken weeks to get the data we needed.

I've tried both AspectJ and JBOSS AOP to produce a logging overlay and ran into significant problems deploying those in the older JBOSS 4.0.5GA environment. This was not significantly improved by the relatively nonstandard nature of our deployables.

Finally, we struck a gold mine. By using the YourKit profiler, which had minimal performance overhead, we were able to get the list of method calls that had been made. What made it especially easy was the profiler had a feature that would allow me to generate a dump of the call tree once an hour.

Here is the address of the profiler people: http://www.yourkit.com/

I just want to note that we were able to do all this using the evaluation version and that it easily passed a five minute test (. In other words, we were able to get it up and doing real work in five minutes). We have already ordered a copy.

Of course, taking 48 hours of those dumps and manually exporting them to CSV was a royal pain. To the YourKit guys: that is a hint.

What kind of jungle am I in?

Currently I am working for a company that is doing a rapid re-factoring of an existing Java code base. The application uses JBOSS for a J2EEE EJB back end serviced by a Tomcat web application as the front end. The wonderful thing is that the code does work.It is, unfortunately, an incredible bear to maintain.

This is, of course, no different than the life of many other developers. But since I am not interested in either myself or my fellow developers suffering, we are slowly but surely re-factoring the code base so that it is easier to maintain and we are faster at turning around new features.

One of the key elements is that the original 3 developers appear to have been stuck in a room and let loose for nine months to a year. As a result, we have a code base that has a phenomenal amount of unused code. One engineer seem to write a lot of code based on the idea that "it would be neat if the code did...". Another engineer wrote a lot of overly clever code rather than trying to brute force method to see if it would suffice. And the last engineer seemed to suffer from NIH: "Not Invented the Here" syndrome and rewrote portions of the Java standard library as well as the Hibernate toolkit.

So my task has been to track down unused code.

My next post will talk about the tools we used to do that.