Você está na página 1de 9

The Milibo Cloud Experience Page |1

“Elastic Apps” using Amazon EC2

The Milibo Experience

1. Introduction
This document chronicles the experiences of my company in moving from a dedicated server
environment to the Amazon EC2 Cloud environment. We started out as skeptical of the cloud
environment and have become ardent proponents of its value.

This document is organized as follows:

First, we will describe the Milibo application and environment.

Second, we will describe the fundamental challenges of moving from a dedicated server
environment to a cloud environment.

Third, we will describe the solution Milibo chose to address the key challenges presented by the
cloud environment.

2. The Milibo Application


Milibo operates a high performance talent society. That is, top performing individuals join Milibo so
that they can find (and, be found) better opportunities. Membership is by invitation-only and
invitations can only be gotten from those who are thought leaders, industry experts, top
faculty/staff at leading colleges and universities.

Our application consists of the following:

Web portals: Milibo consists of twin web portals (one each for talent and employers). It is written in
C# on top of ASP.NET v4. We also have a few additional “service” portals:

 “Single sign on” accounts portal. This handles account management (registration, login,
signoff, etc.)

 “Admin” portal for the Milibo administrator(s) to do any kind of site administration.

 “Payment notification” portal to handle notifications back from Google Checkout and Paypal
– those two systems handle our payments.

Database: We use MySQL v5.1. All .NET apps connect to the server using the .NET connector
provided by MySQL. The SOLR search (see below) data import handler uses the JDBC connector.

© Milibo 2010 sridhar at milibo.com


The Milibo Cloud Experience Page | 2

Free text search: This is handled by the REST interface provided by Apache/SOLR (version 1.4.1). We
have a master-slave architecture so resumes and job postings each have their own master indices
where all updates are done (modifications, adds, deletes, etc.), while SOLR handles replication onto
the slaves (which handle all the queries). SOLR runs with any servlet container; we use Apache
Tomcat version 6.0.14.

MiliboCronService: To keep the app operational, there are numerous tasks that need to run at
various times (for example, to perform cleanup, to send reminder emails, etc.). All these cron tasks
are handled by this windows service.

Scoring Module: The scoring module runs as a stand-alone (Windows console) application. When
someone does a search for jobs (or, the employers do a search trying to find talent), in what order
do the search results get displayed? The scoring module runs every so often assigning scores to
postings and resumes so that when searches get performed results appear in rank order.

Caching: In order to avoid expensive disk accesses for database queries, we use caching. We started
with Microsoft’s Velocity cache (obviously runs only on Windows), and then moved to the open
source Memcached for the bulk of our caching needs (though portions of the application still use the
Microsoft Velocity cache).

3. The Challenges in Migrating to the Cloud


The basic challenge is this: Most applications today are architected to refer to services provided by
software/services resident on servers at specific end-points/URLs. Thus, for example, in the Milibo
application various applications referred to the database by having the following line in a
configuration/properties file:

...
<add key=”DbConnectionString” value=”server=10.10.11.11;database=dbname;user
id=user; password=password" />

Here the application assumes that there is a (MySQL) database server configured and listening on a
port at the IP address 10.10.11.11.

In addition to database servers, the Milibo application talks to search servers (for responding to
search queries), to memcached servers (for caching requests), to ASP.net session state servers (for
handling session state in ASP.net web applications), etc.

To understand the problem with this innocuous declaration so prevalent in web applications today,
consider what would happen if the server at 10.10.11.11 died. To recover, you’d have do the
following:

a. Provision a new server (load all the software, settings, etc.);

b. Modify all configuration files throughout so that references to 10.10.11.11 are now modified
to refer to this new server IP address.

© Milibo 2010 sridhar at milibo.com


The Milibo Cloud Experience Page | 3

c. Start/restart all applications which were modified in the step above. (Even in relatively
medium-sized application this means a change in 11 configuration files, and 11 application
restarts).

Instead of a server dying, a similar situation occurs if instead you could provision a new server to
accommodate increased demand. Suppose the Milibo application takes off due to a favorable
mention in (say) the New York Times. Search traffic skyrockets and we decide to add two more
servers to the pool of servers handling search traffic. Today our configuration file contains the
following:

...
<add key=”SearchServers” value=”10.10.11.13, 10.10.11.14” />

Now we add new servers 10.10.11.15 and 10.10.11.16. So again we need to modify the various
configuration files to add these new servers in, and (re)start the appropriate applications.

So we can see that we have a problem when servers die or when we have to add new servers into
the mix. But why does the cloud environment create this problem? Actually, the cloud environment
does not create this problem – it merely exposes it because:

a. Cloud environments allow you greater flexibility/agility in starting/stopping servers.


Whereas in a dedicated server environment it would take us approximately a few hours to
get a new server up and running, the Amazon EC2 environment allows you to start/stop
servers at will.

b. Cloud environments are optimized for agility and not necessarily reliability. While that does
not mean that the servers are “unreliable”, it means that the degree of reliability one can
expect from a cloud server is less than what one would expect in a dedicated server
environment.

c. Even if a server goes down in a dedicated environment, one has flexibility in setting its
attributes. To deal with the case of a server going down, for example, in a dedicated server
environment one could ask the provider to provision a new server and assign the old IP
address to that server. At EC2, you cannot ask Amazon to assign an instance a particular IP
address.

Net-net: The cloud environment is inherently more dynamic and (possibly) error-prone. This means
that any successful cloud architecture will embrace these essential attributes of the systems
deployed in the cloud.

4. Designing a Fully Adaptive Environment for the Cloud


A fully adaptive environment should accomplish the following:

1. If a server goes down, all services dependent on services provided by it should be paused or
stopped. Thus, for example, if the database server goes down, any services that depend on
the database should “pause” or “stop”. A web site, for example, that depends on the

© Milibo 2010 sridhar at milibo.com


The Milibo Cloud Experience Page | 4

database server should display “Site is under maintenance or temporarily unavailable”


message.

2. If a new server is added, existing servers should be able to seamlessly take advantage of
services provided by it.

To accomplish this, we need the following:

First, when a server is launched, a specific set of services is assigned to be run on that server.

Second, once the server starts up, the set of services assigned to it is started up (if the services it
depends on are available). Thus, for example, a website that depends on the “database service” will
only be started up if the “database service” is available.

4.1. Launching an Instance (on build/client machine)

(This activity is performed on the build/desktop computer. It is usually done manually.)

Milibo has defined a set of services that comprise the entire Milibo suite of applications. We have
also built a set of web-based (and, command-line) tools that help us launch instances to provide
specific services.

The figure above is a screen shot that shows how we launch our Windows instance. In this particular
example, we specify that the launched server should run the following services:

 ec2info: This service (runs on all Milibo instances) and provides information about services
running on the instance. See the discussion on ISDE later on in this document.

 smtpserver: This server provides outgoing email services.

© Milibo 2010 sridhar at milibo.com


The Milibo Cloud Experience Page | 5

 milibowebsite: This server runs the web sites associated with the Milibo application
(employer website, talent website, etc.)

 miliboadmin: This server should run the milibo administration portal.

 milibocronservice: This server should run the various cron tasks.

 miliboscorer: This server should run the scoring task.

Once the launch button is clicked, our system grabs all the roles selected for the instance, packages
them into the “user-data” field and launches the instance. The “user-data” field is simply a text field
that contains user-specific data associated with an instance. This user-data field is only accessible to
the instance by querying a specific Amazon EC2 meta-data URL (at
http://169.254.169.254/latest/user-data) from the instance.

4.2. On the Instance

Every Milibo instance comes pre-configured to respond to a GET request on port 8080. We call this
service ISDE (for “Instance Service Discovery Endpoint”). Thus, if the private IP address of an
instance is 10.10.11.11, a request to:

http://10.10.11.11:8080

results in a response as follows:

ec2info production
smtpserver
milibowebsite
miliboadmin
milibocronservice
miliboscorer

There is a line of output for each service that the instance in question provides. In this example, we
learn the following:

 Line 1: ec2info is a generic service that every Milibo instance provides. The argument
“production” says that this instance is a “production” instance. Other instances can be
running that are “test,” “stage,” etc.

 Line 2: This instance provides an smtp server for outgoing email.

 Line 3: This instance provides milibowebsite – this means that the Milibo portals (accounts,
employer and talent) run on this server.

 Line 4: This instance also runs the Milibo administration website.

 Line 5: This instance also runs the MiliboCronService which runs a variety of tasks at periodic
tasks (such as sending email reminders, to updating search indices, etc.)

© Milibo 2010 sridhar at milibo.com


The Milibo Cloud Experience Page | 6

 Line 6: This instance runs the MiliboScorer.

So how does ISDE know what services are running on the instance? Very simply: Each application
that is supposed to run on an instance (that is, the ones that have been selected to run on the
instance) registers itself with ISDE once it successfully starts running; conversely, once the
application stops running (for example, if one of its dependencies stops being satisfied), it de-
registers itself with ISDE. (If an application crashes after registering – and we try to do everything we
can to prevent that – we have a service that runs on each instance cleaning up and de-registering
such applications from ISDE.)

Milibo instances are configured to run all Milibo apps on all instances. Each application does the
following:

1. On startup it checks the EC2 Meta-data URL (http://169.254.169.254/latest/user-data) to


see if it supposed to run. If it is not supposed to run, it simply halts.

2. Every application has a main processing thread and another “monitor” thread. The main
processing thread does what the application is supposed to do. Both threads loop infinitely,
sleeping N seconds after processing.

3. Every time step,

a. The monitor thread gets the list of all running instances (this is gotten using the
Amazon API). For each instance in “running” state, we obtain the private IP address.
We then make a GET request to the ISDE service for that instance. We combine all
the service information from each instance and create an “instantaneous” picture of
all services currently running (and, the instances they are running on).

b. The monitor thread gets the list of services it is dependent on. It runs through each
service it is dependent on. If the service it is dependent on is not available, it sets
the “unhealthy” flag and waits until the next time step (and removes the service
entry from ISDE). If all dependent services are available, it clears the “unhealthy”
flag, and adds the entry to the ISDE. (This update to the ISDE occurs because we
want an ISDE request for this instance to return all services running on it. So when a
service starts running on the instance, it registers itself with ISDE. If it pauses, it de-
registers itself with ISDE.)

c. In addition, the monitor thread every time step refreshes the set of instances
providing certain services. Thus, for example, if the web application is dependent on
a set of search servers, every time step it refreshes the list of search servers. In this
way, every time step the system is able to adjust to any changes in the runtime
environment: If new servers have been added, their resources may be consumed. If
existing servers have gone offline, they will no longer be used. (Currently, the
monitor thread consults the ISDE ports of all running instances every 60 seconds.
This means that there can be at most a 60 second lag between a resource coming
online or going offline for it to be reflected in the application.)

© Milibo 2010 sridhar at milibo.com


The Milibo Cloud Experience Page | 7

4. The main processing thread checks the “unhealthy” flag at each time step. If the status is
“unhealthy” it skips the processing (in the case of the web apps it redirects the request to
the page that shows the “site is down for maintenance or else is temporarily unavailable”
message) and waits for the next time step.

The beauty of this architecture is that each Milibo application is independently adaptive and self-
correcting. That is, each application does what it needs to do to run – if required – and register itself
with ISDE or pause/stop (and de-register with ISDE).

5. Other Advantages of the Amazon EC2 Cloud


We moved to the Amazon EC2 environment not to be cool but because we wanted capabilities that
our existing dedicated environment did not provide:

 Images: Everything in the cloud is built around the notion of “images” and one is forced to
think about images. Thus, after configuring the system to ones’ liking, it is good to set that
as a baseline and create an “image.” Images then become the basis for launching servers.
When a server is launched from an image, it starts with configuration identical in all respects
to that which it had when the image was created. This means that in the event of failure
launching an image means you can get back to the previous settings pretty quickly.

 Elastic IP: When we run normal maintenance on our site, we have a script that:

o First, starts up nginx (a simple fast web server) which houses a basic version of the
Milibo website with one page – the maintenance page.

o Reassigns the elastic IP to the Linux instance that offers the “milibowebsitebackup”
service.

Voila – a few seconds later anyone hitting Milibo.com will see a nice “Site is undergoing
maintenance” page.

 S3 storage: Amazon provides ultra-cheap and ultra-reliable storage. Whereas before when
people uploaded resumes/files to their account, we would store the files on the file system
(with the attendant difficulties of dealing with disk failure and backups, etc.), now all such
files are stored at S3. We don’t have to worry about backups or storing it in our database,
etc.

 Binary database snapshots: Amazon provides a cool storage mechanism called “EBS” (for
Elastic Block Store). These are block devices whose backing store is S3. Using a technique
pioneered by Eric Hammond (www.alestic.com) we actually mount the data storage
directory on an EBS volume. Using excellent scripts provided by Eric, we are able to create
snapshots of our MySQL database. Snapshots take only a few seconds (and we run these
nightly during periods of low traffic). Now what is cool is that these are binary snapshots.
When we create a volume from a snapshot and mount this volume on another instance (as

© Milibo 2010 sridhar at milibo.com


The Milibo Cloud Experience Page | 8

we do when we launch the Milibo Scorer), we in effect have a view of the database at the
time of the snapshot. No expensive restore from backup, or replaying of SQL scripts.

A very compelling application of all the principles outlined above (dynamically launching instances,
utilizing the snapshot, etc.) occurs with the Milibo Scorer. Recall that the Milibo Scorer runs every
one in a while to score all job postings and resumes. It is a time consuming and expensive algorithm
and one that we’d prefer to not run on one of the existing EC2 instances that is performing other
tasks. So we do something pretty neat and interesting. The scoring task (that runs weekly) on the
Windows instance does the following:

 First: It creates a volume from the last database snapshot. So it has an almost (within 24
hours) current version of the production database.

 Second: It launches a new instance with the “MiliboScorer” task enabled. It attaches the
volume to this instance once the instance is running.

 Third: The scorer application runs on the newly launched instance and performs all the
scoring activity on the older snapshot of the database. Once the scoring is complete, the
scores are written into the actual production database.

 Fifth: The scoring task monitors the state of the scoring – Once the scoring is complete, it
terminates the scoring instance, and deletes the newly created volume.

This architecture allows one to deploy hardware just for this resource consuming task and no more –
the quintessentially perfect use of cloud resources.

6. Shortcomings and Future Work


While the Milibo architecture works fine for most situations, there are nonetheless a few areas
where it does not work. Take memcached, for example. The way memcached works is as follows:
Any time you want to cache an object, the system hashes the key and determines the server on
which to store the object. Thus, the hashing operation depends on the servers being used. If you
add/remove a server to the mix, all keys computed previously become invalid.

We solve this by not updating memcached servers dynamically. If new memcached servers come
online, they don’t get exploited until application restart. A way around this would be to use a smart
memcached library such as the Northscale Memcached Server (www.northscale.com) – which
provides a REST interface to memcached.

The database is another area where simply adding new servers does not work. Considerable more
work needs to be done to setup replication slaves and/or multi-master. How to set this up is beyond
the scope of this document.

© Milibo 2010 sridhar at milibo.com


The Milibo Cloud Experience Page | 9

7. Conclusion
The Milibo system is completely self-healing and adaptive. As a result of this architecture, we no
longer have static property/configuration files containing IP addresses or end-point information.
Instead, the service information (what services are available and on what instances) is deduced
automatically each time step by each application. There are no additional data repositories that
need to be updated.

8. Acknowledgements
The following made our task of migrating, using and exploiting the Amazon EC2 environment:

1. Amazon: While they are a supplier, they are indeed a class act. The documentation, API’s
and tools they provide are great. We have personal experience with their .net SDK which we
believe makes development of tools/automation scripts easy. In addition, the forums they
have established – which is where people congregate to get help, and share ideas – is very
useful, and provides real-world and practical ideas from many practitioners.

2. Eric Hammond (of Alestic at www.alestic.com). Our ability to create a reliable MySQL
infrastructure would not have been possible without Eric’s great work. I used his
instructions on using EBS volumes for MySQL and also his ec2-consistent-snapshot script.

3. Shlomo Swidler (resident guru of the EC2 forums) provides numerous thoughtful ideas and
suggestions to practical problems that people encounter. He originated an idea of using
empty security groups as a way of annotating services that instances provide. I started with
this approach before finally moving to the ISDE approach outlined in this document. I have
learned quite a bit from reading his posts. You can read Shlomo’s work at
www.shlomoswidler.com.

© Milibo 2010 sridhar at milibo.com

Você também pode gostar