• Four years running Server Check.in
  • It's nearly been four years since I launched Server Check.in. A lot has changed since then, but the core of the service is exactly the same: focus on simplicity, and send alerts for servers or websites via email and/or SMS.


    I started this service as a ridiculously cheap alternative to much more expensive (and feature-heavy) services like Pingdom, Ruxit, NewRelic, AppNeta, etc. My idea starting Server Check.in was simply: I need a cheap way to monitor all my servers and get a reliable SMS notification every time one goes down, from the same phone number each time.


    Anything else is icing for many of the servers and services I run, and I often supplement Server Check.in with open source monitoring tools like Munin and Elasticsearch/Logstash/Kibana if I need more in-depth monitoring.


    Many people have asked a few other questions about the service, though, and I though I'd answer them here:


    How is business? How many customers do you have?


    A lot of people have asked about this, and in the past I've been coy with hard data. But I figured it's been four years, the service is running strong, and I don't think hockey-stick growth is in the cards.


    In the first year, I grew to about 70 subscribers with only the $15/year plan—around $1,000 in Annual Recurring Revenue (ARR). The subscriber base has grown to over 120 (with no marketing outside of word-of-mouth), and now that I've added a $48/year plan, ARR is about $2,100. This is nowhere near a life-sustaining amount of money, but my goals early on were to basically build an SMS-sending uptime monitoring service for dozens of my own servers—not to make any major profits.


    The costs to run the service were (and still are) minimal (besides my time of course!):


    • DigitalOcean prod Drupal 2GB droplet: $20/month

    • DigitalOcean hot spare Drupal 512MB Droplet: $5/month

    • 8-10 globally-distributed Low End Box-style servers: ~$120/year

    • 'Close-to-unlimited' international Twilio SMS: ~$40/month

    Annual recurring revenue: ~$2,100
    Annual expenses: ~$900
    Annual profit: ~$1,200


    It's not a huge deal, but it is enough to give me spare cash to do things like build Raspberry Pi Clusters for fun and education, buy little things like a nice new trackpad and keyboard every year, and pay for a bunch of other servers used for nonprofits or local user groups.


    If I were in this for the profit, I would have to work a lot on marketing, increase the plan pricing, stop using 'real' SMS (which costs money) and fall back to the email-to-SMS gateways most of the free or cheaper services use, and push more people towards the higher-profit-margin plans.


    As it is, for the past two years I spend minimal time doing anything besides ongoing maintenance, so I'm probably going to keep things as-is at least until I work on a couple long-term goals, like a Drupal 8 and API-driven replatform, and moving all the backend servers to a Go app instead of Node.js.


    Technically, the service is performing over 100,000 individual checks per day, tracking an average of 150 outages per day, monitoring over 500 servers.


    Do you run the service by yourself? Is it self-sustaining?


    Yes, and yes—but there's no way I could quit my day job with Server Check.in alone. Throw in Hosted Apache Solr and writing projects like Ansible for DevOps, and it might be more plausible.


    But to turn a SaaS product into something that sustains more than one person requires a higher profit margin and a lot more marketing.


    What are some of the best decisions, in hindsight?


    The original version of the site ran on PHP alone, with one server performing all checks and sending all notifications. Rebuilding the server checking functionality as a microservice that ran on Node.js allowed me to architect it better for scalability (see older post, Moving functionality to Node.js).


    After that move, I also migrated all the shell-scripted server build process entirely to Ansible, so all servers are managed with an 'infrastructure as code' approach. I can now spin up a new check server, no matter what low-cost hosting provider I rent space from, in about 5 minutes. And I can also bring up the entire stack locally using Vagrant and Ansible for testing in a matter of 10 minutes or so (even allowing me to test situations like high network latency between servers, all on my local Mac).


    Another decision I made at the outset was to use Stripe for payment processing instead of some of the other payment processors I had used for other services in the past, like PayPal or Authorize.net. Stripe's developer-centered focus attracted me, and the UX and low fees are bonuses. Stripe has been reliable, easier to deal with than any other payment processor I've used, and integrating automated tests with Stripe's built-in test environment is a breeze!


    Finally, one decision which I've waffled on, but in the end is probably best, is that I used Drupal to build the site's API and front end. Drupal is my golden hammer, but that doesn't mean bending a system like Drupal to do something that might be better suited to a smaller framework or even a different language entirely is a bad thing. It works for me, it's been highly resilient (even running everything from the UI to the API backend), and it's aged well, since the front-end of the site is extremely focused. I just needed a system to allow user registration and access management, content management, and content display—Drupal handily checks off those boxes and is plenty fast.


    Why haven't you offered a free tier?


    This was a decision I've gone back and forth on dozens of times, but I always end up sticking with no-free-plan. There are dozens of server uptime monitoring systems that offer a free tier—usually with just email, checks only every 5 or 15 minutes, and if any SMS, only 'email-to-SMS' gateways instead of SMS from a real phone number (since the latter costs money).


    I almost implemented the same thing for Server Check.in, but I realized that every minute I spent working on that plan, and spent helping customers who never planned on converting to the paid plan, was a minute taken away from helping the people who make the service possible—paid clients.


    In addition, having a free plan early on would've likely killed the service on the scalability front, because it took a solid year or so to iron out all the wrinkles involved in building a resilient, distributed microservices-based architecture for the server checking backend.


    What makes Server Check.in better/different than free uptime monitors?


    From the beginning, there have been many people who ask the question "if there are already [X] number of [server uptime monitors], why build a new one?" (This question is asked of almost any new software product these days). There are some small differentiators, of course, like:


    • SMS messages sent via Twilio from a unique, consistent phone number (so you can set a ringtone for notifications, and receive SMS in any country)

    • 1 minute check frequency

    • Extremely low price ($15/year for 5 servers)

    But the main reason was stated earlier: I wanted something like Pingdom et all, but with just one feature (tell me when my server's down) for all my servers, without having to pay a ton of money. So I built the service for myself, and would gladly pay myself $15/year for the service (though I've graduated my own plan to a 50-user plan for $48/year!).


    How do you stay motivated to keep the service running after four years?


    There are so many 'side-projects-as-services' I've seen that don't gain much traction then are abandoned a year or two later. But because I use it myself, and because I've set it up to be very low-maintenance, I've been able to keep things running without a hitch for four years (and counting).


    It's also continually fresh for me, as I can try out a new language or technique, or hone my skills on a particular feature. For example, in 2016, even though I've only worked on the site maybe 20 hours total, I've been able to reduce the mean authenticated page load time (as measured by GTMetrix and Pingdom) from 2 seconds to < 1 second.


    I also use the service as a test bed for Ansible, and have used the experience to improve and expand my book, Ansible for DevOps.


    Related


    Here are a couple other posts from pivotal moments in Server Check.in's lifespan:


    I'll likely do another retrospective in 2018 or so—see you then!


    - Jeff Geerling