Missing Crawler Data Post Mortem

Incident Details

Last night we tried to only crawl networks once per hour and wound up with missing crawler data. Usually we crawl at the beginning of the hour, then again 30 minutes later to crawl missed networks.
Due to a slight time difference between the crawl server and the database server some networks were being indexed at the end of the hour.
We crawl networks backwards from the last time each time. Meaning if Freenode was first one hour, then it will be last the next, so on and so forth. This caused data on large networks to be crawled at the end of the hour every other hour.

Conclusion / Fix

We have implemented an ntp client daemon on both servers, but just in case of future time desyncs we start indexing 1 minute after the hour. We will be monitoring the situation, and if this becomes a problem we will set it back.

xnite

I am the founder, and lead developer, of IRC-Source. I started IRC-Source in May of 2014 because, although I don't use it as much any more, I still have a deep interest in IRC. I also carry a deep interest in web development, and wanted to write something that would be challenging, and allow me to bring my passions together in a way that would be interesting for others. I run a network called BuddyIM (irc.buddy.im), it's pretty cool, you should chill there.

More Posts - Website - Twitter - LinkedIn - Reddit