Cloud and the Commonwealth Games

Introduction

Well, unfortunately, in these days of Cloud Computing, we are still faced with IT infrastructures which fail if too many users access them at the same time. This has happened this week with the Glasgow Commonwealth Games, where the ticketing system failed over two days, and then followed by a lock-down for another week before it would be back on-line. To say that this is an embarrassment against the significant investment is an understatement. With Cloud services so inexpensive these days, and the ability to spin-out new instances in the Cloud to cope with the demand, something somewhere didn’t quite integrate well. What is likely is that it was the initially gating of request that brought it down, as the back-end, provided by organisations who are well used to large-scale accesses for ticket should be fairly robust and well tested. The key elements of any system of this sort is to make sure that we avoid Big Bang scenarios, where many users converge on the system at the time, and compete with each other for resources. An obvious solution, without the need for any complex IT architectures, would have been to stagger the sales of each of the events over different time periods, on different days, and inform users well in advance. If possible we estimate demand, and make sure we have tested the infrastructure for that level of demand. Users are also told in advance how the process will be handled, and that the system will be as fair as possible to guide the user to the end result.

The overall problem that we have is that we have a choke point in the system. It is similar to having two wide motorways being linked by a single carriage road. At either end we the potential for good throughput, but the bit in-between provides our choke point, and reduces throughput. If more cars arrive on at the input than are leaving, we can end up with deadlock, which we have seen from the Commonwealth Games ticketing system (Figure 1).

ticket03

Figure 1: Choke point

Load balance and dynamically respond

One of the key things in any Web infrastructure is to have a back-end which can dynamically respond to user demands, especially during peak times, as you often just have one opportunity to gain the user’s trust, so they might feel like they are well supported. In a Cloud environment we can create instances to cope with increased users, where new Web servers are started up if the demand goes too high. With the Commonwealth Games problem, it is likely that the front-end interface to the ticketing back-end was the single point-of-failure, and it couldn’t cope with all the new requests coming in, at the same time.

Figure 2 outlines the problem, where the back-end ticketing infrastructure is likely to cope well with large-scale demand for tickets, and can create the processing power to cope with large demands, which are often seen in purchasing tickets on-line. The front-end, the user interface, too, is probably optimized with new Web server instances started-up to cope, but the part in the middle which takes user details is probably the weakest link in the chain, and which hasn’t been tested with large-scale user accesses. This could have been easily fixed by dynamically creating new instances in the Cloud which could cope with the accesses between the front-end, and the back-end.

This buffering effect is used by many real-life systems, such as in a restaurant, where the head waiter, when tables are all occupied, will create a waiting area for new arrivals, and then estimate how long it will take to clear the tables. As long as the number arriving is roughly equating to those leaving, everything will be in balance, as long as the waiting area is large enough to hold of the people arriving. If the waiter accepts too many, the waiting area is going to overflow, and people will get annoyed, and will decide to leave or stay for a long time. If they leave, it is probably a good thing for the current problem, but bad for their reputation, as the customer is unlikely to return. If they stay, then the staff in the restaurant will probably become over burdened by the waiting customers, and be distracted away from the paying customers, and thus to whole restaurant moves towards a high stress point, and which point everyone gets upset.

This can happen in an on-line ticketing system. For the restaurant, they might stall customers by showing them the menu, and then ask to pre-order, and then get them to fill-in a few customer questions, or order a drink. Also they may have several waiting areas, which can be opened-up when the demand gets too high. The restaurant might have to tell the customers who have just joined the queue to actually leave, and come back later, if the waiting rooms become too busy. In this way, customers who are at the front of the queue, and who have been waiting the longest will not get update. Basically we normally do a FIFO approach, which is First In, First Out.

ImageFigure 2: Basic architecture for ticketing

Slowing down the user and getting rid of time-wasters

One think you must guard against is too many users accessing the same Web service at the same time, as it could crash it (using up too much memory or computing power), which can then crash other elements of the system. The most obvious way to slow the number of accesses down, would have been to stagger the tickets into time gates, spread over many days, and where users could target their requests based on their interests. With several events being advertised at the same time, a check-at-the-gate approach can be used, where the customer defines which ticket they want, before moving onto to the purchasing back-end. If this way, popular events do not block less popular ones, and the throughput is increased.

Another methods that can be used is to define several checkpoints in the ordering process in order to slow the users down, and allow them to race each other to the end result. Some will get though quicker, while also getting rid of those who probably are not actually going to purchase the tickets. If we say that the tickets will go on-line at 9am on Monday morning, users will often be waiting and cause an avalanche of requests at that specific time, which is similar to a Distributed Denial of Service (DDoS) against the system. Within well managed infrastructure, there can be a filtering system for the initial surge, where the initial requests to purchase the tickets are coped with, and then the user is slowed down by providing a few details which they will complete a different speeds, and which allow the system to channel their responses. For example, before purchasing the tickets, the system could ask a fun question:

Do you know where the last Commonwealth Games where held?
A.Delhi.
B.Edinburgh.
C.London

This slows down the user, and, possibly, allows eager users to progress to purchase the tickets. If someone is just time-wasting, they will often stop at this point, as it is not worth their while progressing. Every user will then finish the challenge with a different time. If the system is extremely busy, then more questions or little bits of interesting information can then be send to the user. This “stalling” is focused on analysing the queue. If more people are coming in, than are leaving, the stalling must increase, in order to slow in the intake down. As shown in Figure 3, we thus get the concept of a buffering zone, which is watching the throughput of incoming against outgoing, and stalling users. The system would be able to define the number of users which can enter the buffer zone, and stop too many being in this. Thus, when too busy, users are told to try back in a certain time period, which the system estimates. In this way users go and make a cup of tea, and come back, otherwise they will try again, and the whole thing snowballs.

ticket02

Figure 3: Stalling the user

Test for real-life user activity

An obvious falling in the Commonwealth Games system, and many other on-line systems, is they are often validated that they are working okay with light user activity, and are not tested with significantly higher loadings and with different types of users. This is similar to just testing a car in the Spring time, and forgetting that the car engine’s will have to face high temperatures in the Summer, and low ones in the Winter. Also we might also test the car for city driving for short time periods, and forget that it could be driven at its top speed for extended time periods. In IT for some reason many architectures forget out the surge, or increased usage of the system at peak time. A good part of developing a system, is thus ability to simulate these type of events, and make sure the system will cope. With cloud environments, we can dynamically create a load sharing environment for any part of the system, and create data stores on-demand, and then collapse them when it is less busy. Supermarkets do this, where they monitor waiting times at the checkout. If the waiting time gets too long they deploy their staff to open up new checkouts, and try and reduce the waiting time, and then close them when the demand has been dealt with. In the Commonwealth Games architecture, there were probably extensive load balancing on the front-end and the back-end, but it could have been the bit in the middle which broke the system.

Another method that can be used, if they system knows there will be a long wait, is to give the user a working space, and give them a time to come back in, and resume their connection. This is a bit like having a ticketing system at a cheese counter, where users get their ticket and come back when its there turn. This can be just a counter based system (1, 2, 3, …) or a timed out (come back in 5 minutes, and we’ll be ready). In an IT system, basically the system says that we will be 15 minutes delayed, so log off, and come back in 15 minutes, and we’ll be ready for you, with no queue jumping.

Lessons learnt

The lesson learnt is thus to test the system fully, and copy with spikes of demand. If you know there will be a wait, take some users on, and then tell others to come back later. Users are normally quite happy to wait, as long as they know the situation. To allow everyone in, and say it will be up to 60 minutes, and then for them to have to wait over seven hours with the only interaction with a user a little spinning circle, is not a good way of engaging a user.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s