>

Cron, the time-based scheduler available on any modern-day operating system which runs predefined jobs, or scripts in a consistent manner. Cron is a simple but excellent piece of software that most software companies in the world rely on. But what happens when the job or script cron fails during execution…?


Silence failure

I have been working on a side project of mine and was using it for a little while. I have now made it available as a service, and it is accessible here.

https://www.liveping.io/

liveping-1.jpg
Home page of LivePing

Cron and heartbeat monitoring for cron tasks, background workers, services, and many more. Get instant alerts with…www.liveping.io Landing page of LivePingLivePing is a heartbeat monitoring tool that helps check workloads that don't accept HTTP requests, like Cron and background workloads.


So, what's the big deal about cron anyway? There's a tweet that jokes about the importance of cron in the economy, but the impact it has is no joke. While cron is something a lot of companies rely on, there is also a problem that needs consideration when operating them.

How do you know when they are not doing their job?

Background workloads like cron tended to fail silently and had led to multiple issues in the past for me:

  • Database backups were not running
  • External facing SSL/TLS certificates did not get updated and caused a site outage
  • Retrained machine learning models did not get deployed
  • Internal services went down silently (related to internal TLS), creating a cascading effect on other services
  • Monthly email reports for customers did not get generated and sent out

these were merely some of the numerous examples.

Certain times, they might not even cause any side effects and only found out that the workloads have not been working for some extended amount of time.

fire-fine.jpg

Attempts in the past

I've solved this problem a couple of times in the past as an internal tool, but it always comes back to the same maintenance priority.

Since it's a monitoring system, it needs to have a decent amount of uptime, but guaranteeing that is typically not something you do single-handedly. At least, not if you're the solo, or maybe one of the few members of a small SRE team when you have a thousand other things to handle and automate. Don't mention unexpected incidents too.

Even with an internal tooling team at a larger company, that team is likely to be dealing with other projects, maybe language frameworks, CI/CD pipeline, and have other responsibilities. When trying to support more visible and direct tooling for other engineers, handling of a heartbeat monitoring system tends to get de-prioritized.

Another problem is ownership of the system. This is not a problem while you're still with the company, but it emerges likely after you left.

Developers are not necessarily good writers, and there are always gaps between documentation and the running systems. I'm guilty of that too. So things get missed, and systems can reach critical states without being noticed after a certain amount of time.

And last but not least, you get into the "who's going to watch the watchdog" problem. :)

So you might be thinking now, "I can see operating as an internal tool can be burdensome for people, then why don't just use Cronitor or Dead Man's Snitch?"

Yup, I've tried those, and yes, they do fulfill the bare minimum requirements I need. But they have their own set of problems, and I might touch-base on that in the future.

Problems LivePing solves (for now)

First and foremost, LivePing solves my problem of not needing to create another monitoring system that doesn't age well. And the issue of doing roughly the same thing over and over again at different places. ;)

Also, maybe providing something that I'm good at, which is reliability and scalability to a broader audience.

Enough of my issues, let's see what LivePing can do for you.

  • Eliminating silent failures that can lead to disastrous situations
  • Provides alerts before problems start to snowball
  • Preventing the issues I've listed earlier like DB backups and TLS certificate updates, etc.

LivePing can also help you make sure these scheduled tasks are running and working as expected:

  • Data synchronization pipelines
  • Data aggregation, garbage collection
  • System/vulnerability scans
liveping-2.jpg
List of pings registered
liveping-3.jpg
Details of a monitored ping

While LivePing is functional at this point with Email and Slack as alerting targets, there are still a lot of lacking features and integrations.

Some of the stuff on the near term roadmap would be:

  • PagerDuty alert integration
  • Custom webhooks
  • Grace period before alerting

-Temporarily silencing alerts

  • API access and documentation
  • OAuth2 authentication

and there are many more to come.

I'm working hard to improve and automate the system overall to provide a high SLA uptime and a reliable system and would appreciate your feedback using it. You can find other various forms like feedback and feature request on the contact page.

In effects to make it accessible, LivePing comes with 5 pings for free and doesn't require credit card information to sign up. Any plans going forward will have a 14 day trial period. You can see the pricing details on the pricing page.

What's next?

In the short term, I'm planning to be focusing on adding features to solve problems I've encountered in the past, including scalability issues.

But in the long term, I want to create a system that can grow with the team's needs. Initially, maybe just simple cron monitoring for the small number of jobs the company has. Then, moving up to dividing up jobs between teams, access controls, and dashboarding. Eventually, to a system that is easy to integrate into the automation of services.

When it comes to the observability of systems, monitoring is just a subset of it. And monitoring uptime vs. a distributed scheduling system requires a different approach. That's why even Prometheus has a push gateway.

If you're familiar with Prometheus, which is an excellent piece of software, by all means, go with it. If not, or you don't have the resources to set up such a team, give LivePing a try!

If you have any questions or simply just want to chat, please reach out!

Thanks for reading! :)