Enhancing application robustness in cloud data centers

We propose OX, a runtime system that uses application-level availability constraints and application topologies discovered on the fly to enhance resilience to infrastructure anomalies for cloud applications. OX allows application owners to specify groups of highly available virtual machines, following component roles and replication semantics. To discover application topologies, OX monitors network traffic among virtual machines, transparently. Based on this information, OX builds on-line topology graphs for applications and incrementally partitions these graphs across the infrastructure to enforce availability constraints and optimize communication between virtual machines. We evaluate OX in a realistic cloud setting using a mix of Hadoop and YCSB/Cassandra workloads. We show how OX increases application robustness, by protecting applications from network interference effects and rack-level failures.