after the fiasco earlier this week, i've been taking steps to minimize the impact if tilde.team were to go down. it's still a large spof (single-point-of-failure), but i'm reasonably certain that at least the irc net will remain up and functional in the event of another outage.
the first thing that i set up was a handful of additional ircd nodes: see the tilde.chat wiki for a full list. slash.tilde.chat is on my personal vps, and bsd.tilde.chat is hosted on the bsd vps that i set up for tilde.team.
i added the ipv4 addresses for these machines, along with the ip for
yourtilde.com as A records for tilde.chat, creating a dns round-robin.
host tilde.chat
will return all four. requesting the dns record will
return any one of them, rotating them in a semi-random fashion. this
means that when connecting to tilde.chat on 6697 for irc, you might end
up on any of {your,team,bsd,slash}.tilde.chat
.
this creates the additional problem that visiting the tilde.chat
site will end up at any of those 4 machines in much
the same way. for the moment, the site is deployed on all of the boxes,
making site setup issues hard to
debug. the
solution to this problem is to use a subdomain as the roundrobin host,
as other networks like freenode do (see host chat.freenode.net
for the
list of servers).
i'm not sure how to make any of the other services more resilient. it's something that i have been and will continue to research moving forward.
the other main step that i have taken to prevent the same issue from happening again was to configure the firewall to drop outgoing requests to the subnets as defined in rfc 1918.
i'd like to consider at least this risk to be mitigated.
thanks for reading,
~ben
update: the round robin host is now irc.tilde.chat, which resolves the site issues that we were having, due to the duplicated deployments.