Last night we tried to only crawl networks once per hour and wound up with missing crawler data. Usually we crawl at the beginning of the hour, then again 30 minutes later to crawl missed networks.
Due to a slight time difference between the crawl server and the database server some networks were being indexed at the end of the hour.
We crawl networks backwards from the last time each time. Meaning if Freenode was first one hour, then it will be last the next, so on and so forth. This caused data on large networks to be crawled at the end of the hour every other hour.
Conclusion / Fix
We have implemented an ntp client daemon on both servers, but just in case of future time desyncs we start indexing 1 minute after the hour. We will be monitoring the situation, and if this becomes a problem we will set it back.