RSS

Latest Blogs

Beyond Transactional Consistency

Software has been growing organically for years, now it's time to scale it.

Evolution of Parallelism: Part 2

New query planning and optimization techniques are needed to deal with distributed, independent and heterogeneuos data sources.

The Evolution of Parallelism: Part 1

The amount of digital information created and replicated in the world will grow to an almost inconceivable 35 trillion gigabytes by 2020. New tools are needed to manage and analyze the vast amounts of data.

Lightwolf Team Blogs

Beyond Transactional Consistency

Posted by Esmeralda Swartz on July 28, 2010 8:11am EDT Comments (0)

For years, software has been growing organically. Today, system architects are faced with the pressure of scaling it.

We’ve discussed the multiple pressure points of capacity/performance, operations, and cost scaling at length in other blogs as well as on the Lightwolf website.  It is important to understand these when evaluating the data points for the make vs. buy decision.  Without doing so, it’s very difficult to gauge what the application needs to implement and what other software including the database layer should be implementing.

A key question that consistently comes up, no pun intended, is how to deal with data consistency. More often than not, products provide pockets of consistency (i.e. isolated relational databases are consistent), but do not deal with system-level data consistency.  The gap between what is required at the application level for consistency and what is supplied by the underlying database/software then must be made up by the application, somehow.  This is one key issue at the heart of application scaling. 

Why is Data Consistency so Important?

For companies that have ongoing customer deliverables, it is hard to juggle day-to-day customer roadmap commitments with the problem of scaling software which can be rationalized as a longer term deliverable. Unfortunately, in the case of data consistency it shows up in areas that customers do care about such as geographic redundancy/multiple live sites (how consistent does data need to be between geographically distributed sites), availability (how consistent does data need to be between primary and backup), rollback (how precise is the data, what is the consistency at the point of rollback), asynchronous design (multiple parties accessing the same data set in different ways), data migration (how is system kept online while moving data), etc. The lack of system-wide data consistency, ends up forcing scarse development resources to be re-allocated to solving this problem rather than supporting additional product features.

Beyond Transactional Consistency

The simple diagram below depicts a common scenario: data is replicated three ways for read performance. The application server on the left is responsible for doing updates. The application server has a transaction manager and is replicating the data using a two-phase commit to all three databases. The problem is that the commit does not take place simultaneously on all three databases. The application server on the right, depending on the database chosen for the query can see different answers. This is true even when the transaction isolation level is set to read committed. Even more problematic is when the read application server decides to use multiple databases at the same time. The data is inconsistent between the three databases and effectively breaks the isolation level. The application has to decide that it does not care or deal with the consistency breakdown.

We’ll look deeper at issues with the CAP theorem as it applies to two-phase commit and transaction managers including denormalization in subsequent blogs.

Evolution of Parallelism: Part 2

Posted by Esmeralda Swartz on June 8, 2010 7:01am EDT Comments (0)

In the database area, distributed query planning and optimization is one of the hardest problems to solve.  In fact, the success of large incumbent databases is due in part to the development of their query optimization technology.  The optimizer determines both the indices to be used to execute a query and the order in which query operations (e. g. joins, selects, etc.) should be executed.  The optimizer lists alternative plans, estimates the cost of a plan using a cost model, and chooses the plan with the lowest cost.

In the past, query processors produced acceptable query plans by leveraging I/O cost information as well as history and other statistics to determine the best executable plans.  Today, the data management area is demonstrating a pressing need for new query planning and optimization techniques.  The high volume of sophisticated reporting requires full table scans with “complex” joins.  Data sources are distributed, independent, and heterogeneous.  The query optimizer does not have access to histograms or other useful statistics.   Further, the cost of I/O operations can be high, unpredictable and variable across the WAN.  Resource characteristics of the database federation can fluctuate.   Assumptions made at the time a query is submitted will often not hold true throughout the duration of query processing.   As a result of all of these pressures, traditional query optimization and execution techniques are proving to be ineffective.

At the heart of a successful distributed database architecture is the distributed query planner, optimizer, and executor.   It is the distributed query planner that separates “starter” databases from those capable of supporting the growth and success of a business.  While the components that comprise a distributed query architecture are mostly similar, the ability to optimize query performance is very dependent on other architectural decisions outside of the query planner and optimizer.

The Evolution of Parallelism: Part 1

Posted by Esmeralda Swartz on June 4, 2010 7:52am EDT Comments (0)

“Between now and 2020 the amount of digital information created and replicated in the world will grow to an almost inconceivable 35 trillion gigabytes.” Source: IDC  Digital Universe White Paper

In 2009, one of the few growth rates that did not slow down with the economy was the amount of data (structured and unstructured) generated by consumers and enterprises. The amount of digital information being created and stored per second is growing at an exponential rate and shows no signs of slowing down.  Large-scale distributed computing environments have emerged (private and public clouds, sensor networks, grids, etc.) in response to big data needs.  The tools to manage and analyze the vast amounts of data need to change.    Queries and relations are becoming more complex.  There is a pressing requirement for new high performance distributed parallel query processing, optimization, and execution techniques to query and analyze datasets across large networks of federated computers.  

The MapReduce model in which computation is divided into a map function and a reduce function exposes large amounts of parallelism. The map function takes a key/value pair and generates one or more intermediate key/value pairs. The reduce function then takes the intermediate key/value pairs and merges all values corresponding to a single key. These functions can run independently. Yes, but not every query is “mapreduceable” say the traditionalists. While this may be true, every query does have a part that is mapreduceable (sort or index functions partition the data allowing it to be mapreduced).   It is possible to achieve parallelism on at least part of the query and make that go really fast while ploughing through the rest.   What is needed is structure on top of unstructured data to allow it to be accessed and processed quickly to drive operational decision making.

Sharding for Business Intelligence

Posted by Esmeralda Swartz on May 17, 2010 7:38am EDT Comments (0)

The demand for new sharding solutions as a result of evolving Business Intelligence (BI) applications is being felt by large enterprises across every major vertical, from financial to pharmaceuticals.   BI applications are becoming increasingly interactive, both in the number of users running real time queries on the data set, and the need for the data set to receive real-time updates on business operations.  In addition, large enterprise’s BI applications are now including Web 2.0 functionality.   According to a 2009 Gartner report [1], collaborative decision making will emerge as a new product category that combines social networking software with BI Platform capabilities.

BI datasets are extremely large and include data organized along multiple dimensions, with foreign key references linking data along each dimension.  Star and snowflake schema represent typical data organizations utilized in data warehouses for BI applications.

Issues for BI applications include:

  • Interactive query performance on extremely large data sets
  • Supporting real time updates to the dataset without degradation of query performance
  • Supporting complex queries and analytical functions on large data sets (both structured and unstructured)
  • Balancing uneven dataset usage patterns; more recent data is likely accessed much more frequently than historical data
  • Supporting dataset growth
  • Supporting growth of the user population
  • Supporting increasingly complex and sophisticated analysis
  • Ensuring consistent data and foreign key maintenance

Sharding provides the ability to spread the large datasets across hundreds or thousands of servers.  In addition sharding may be used to distribute the dataset such that more server compute performance is available to support sophisticated analysis needs.

Management of extremely large datasets spread over hundreds or thousands of servers presents a management challenge.   In addition, supporting continuous availability of that dataset and dynamic dataset migration to maintain desired response times and supporting dataset variability in growth and access patterns presents substantial operational challenges.

It is also necessary to correctly maintain database consistency and foreign key linkages while supporting the operational requirements of today’s BI applications.

[1] “Gartner Reveals Five Business Intelligence Predictions for 2009 and Beyond,” http://www.gartner.com/it/page.jsp?id=856714

Sharding for Web 2.0

Posted by Esmeralda Swartz on May 17, 2010 7:25am EDT Comments (0)

In general, the Web 2.0 issues are the same as those faced by SaaS applications although in a slightly different environment.  For Web 2.0 applications, the usage is based on each individual user, and that is the primary axis of segmentation.  However, unlike the typical SaaS multi-tenant application, the Web 2.0 data belonging to a single user is not private to that user, but may be shared by multiple users.  For performance reasons, certain of the data may be replicated and distributed along differing axes.

For example, in the Flickr photo sharing application, if a user makes comments on another user’s photo, for performance reasons, the comment is stored in two places.  Once with the commenting user, and once with the photo being commented on which belongs to the other user.  In database terms the dataset is denormalized for performance reasons. 

The problem is ensuring that the replicated data remains synchronized and consistent.  If the comment is deleted; it should be deleted in both locations.  If the comment is edited and changed, both copies of the comment must be updated.  Ensuring this consistent synchronization is a challenge in a dynamic environment when data locations are changing in response to ongoing operational needs.

The  sharding solution needs to provide the ability to distribute a database table simultaneously along two different axes.  So in the Flickr example above, the application could have a single database table of comments, and the shard mechanism could automatically shard that table along both the axis of the commenting user, and the axis of the user owning the photo.  The benefit to the application developer is that the application maintains only one table and the shard manager automatically and transparently ensures that the table is consistently replicated and distributed along both axes.

In addition the shard solution should provide the ability to transparently replicate and distribute along an alternate axis only portions of a table (selected columns).  The above Flickr example could also be implemented using this capability.  A single table containing the picture and users comments on the picture could be distributed in its entirety along the axis of the user owning the picture, and along a secondary axis the comments only could be distributed based on the user generating the comment.

Sharding for SaaS

Posted by Esmeralda Swartz on May 17, 2010 7:17am EDT Comments (0)

In the May 14th blog, we started the discussion on sharding.    In the next three posts, this one is the first, we’ll spend some time looking  at sharding across three market segments: SaaS, Web 2.0, and Large Enterprise to highlight the demands on a sharding solution by market segment.

For Software as a Service (SaaS) applications there is a natural axis of segmentation of the database by customer.  The data usage pattern of a typical SaaS application demonstrates two different patterns of access.  The first usage pattern has application specific data which is global across all customers. Therefore to ensure maximum data locality and performance, this data should be replicated across all database servers.  The second data usage pattern is segmented by customer.  Each customer has a dataset which is unique and private to that customer.  Therefore a key such as a unique customer identifier number is most appropriate for use as the content control of the sharding data distribution.

While implementation of this type of key directed distribution is relatively straightforward in a static situation, today’s dynamic business environment is anything but static.  Some real world examples illustrate the point well:

  • Salesforce.com originally launched in 1999 grew from 1,500 customers (30,000 users) in January of 2001 to 41,000 customers (1,100,000 users) in January of 2008. And while the average number of users per customer ranged from 13 to 27; in 2004 Salesforce had four large customers with over 1,000 users each (SunTrust Banks, Sungard, Corporate Express, and ADP).
  • Animoto grew from 25,000 to 250,000 customers in three days. At the peak, almost 20,000 new customers per hour registered.

As the examples above illustrate the SaaS issues which must be dealt with include:

  • Supporting additional new customers.
  • Dealing with customer churn – rebalancing to account for inactive or terminated customers.
  • Supporting growth of existing customer usage, either due to customer acceptance and increased usage or due to additional application functionality over time.
  • Ensuring that Service Level Agreements (SLAs) with respect to response time and availability are maintained.
  • Managing one or more “Hot” customer(s) and/or transient loads caused by fluctuating demand.  For example, demand fluctuations due to seasonality (tax returns on April 15th), time zone effect (most usage during normal working hours), or other event(s).

Effectively resolving the issues above requires not just a flexible sharding mechanism to control data distribution, but also operational support which allows the shard data distribution to be modified dynamically in real-time without interrupting normal operations.  The shard solution must provide both a flexible and application transparent sharding mechanism, and an accompanying rich set of integrated management and operational tools which allow real-time shard migration and replication.

In addition the sharding mechanisms need to include integrated support for policy based security based on per customer and per user role based access credentials, and policy based per customer and per user data encryption.

Sharding for Success

Posted by Esmeralda Swartz on May 14, 2010 7:29am EDT Comments (0)

There is growing interest by an increasing number of customers to use database sharding to support business growth.   The nature of today’s applications requires that sharding be considered a key component of any scaling architecture with a view to success.  When planning an online (Web 2.0, SaaS or Business Intelligence) application, companies have to consider the application’s growth potential.  The issue is that it is difficult if not impossible to predict when and how fast the application and customer base will grow.

Companies are faced with supporting growth along multiple axes:

  • Increasing numbers of users driving system demand up.
  • Over time as the per user usage increases, the size of the dataset per user increases as well, linearly in the case of a SaaS application, but exponentially in the case of a Web 2.0 application.
  • As the capabilities and feature set of the application increases with each release, the computation and database load to support each user increases as well.

This translates into issues in a couple of areas:

  • Web server: problems with accessing files or too many concurrent user requests; attempts to deal with this include using load balancers for web servers or a CDN for file storage.
  • Database server: problems occur both at writing to or reading from the database; at a certain point even a replication model that uses fewer write servers and more read servers in replication can be inefficient when dealing with indexes and big tables.

The database layer becomes a bottleneck as it becomes saturated due to one or more independent causes including the aggregate dataset size or the dataset’s compute load.  In addition, system availability may become compromised as the system load increases.  To continue to support growth and maintain both acceptable response times and availability, the application must be sub-divided and distributed across a number of database servers.  Availability also improves since the dataset is distributed over multiple servers and therefore a failure of one server removes only a portion of the dataset not the entire dataset.

While partitioning divides a large table logically into multiple smaller subset tables, sharding adds the additional step of distributing those subset tables onto different physical servers.   This is sometimes referred to as Content Driven Placement because portions of the dataset content determine the location (s) to which the data is distributed and stored on a persistent basis.

With the right architecture, sharding brings substantial benefits in performance, manageability, and availability.  The sharding architecture should also provide the developer the ability to improve performance without changing application code. 

Why Does Performance Improve with Sharding?

  • Generally the smaller the table, the faster the operations on that table.  If a table can be divided into smaller sections and SQL operations can be identified as only applying to a particular section of the table then the operation will complete faster since only that smaller section is involved.  The SQL operation is pruned before reaching other sections.
  • If the SQL operations cannot be identified as only applying to a particular subset section of the table, but must be applied to all of the table then to the extent that sections of the table reside on different physical servers, the SQL operations can proceed in parallel.  This yields a performance improvement over an operation executing over the entire dataset on one server.  For example, if an operation involves sequentially scanning 10,000 rows in a table and that operation can proceed in parallel with each of 10 servers scanning 1,000 rows each or each of 100 servers scanning 100 rows each then the aggregate response time of that operation improves proportionately to the number of servers available for parallel operations.
  • While sharding is generally thought of as a solution to a write/update performance problem, many other benefits accrue from its use.  Sharding helps to isolate and constrain storage, CPU, memory, and IO resource usage.  Sharding also increases read/query bandwidth and may also avoid the need for database replication.
  • Manageability improves with sharding as operations such as backups and bulk add or bulk delete of entire shards can be achieved easily and without the performance impact which would occur if operating on an entire large dataset.

Hadoop: Not as Easy to Use as a Plush Toy Elephant

Posted by Esmeralda Swartz on April 22, 2010 1:21pm EDT Comments (0)

One of the more interesting talks I attended at the Cloud Computing Expo this week was a session on the practical use cases for Hadoop.   The talk started with what many in the industry already know, managing data in traditional ways is not working.  Hadoop is a great tool for helping companies deal with unstructured big data on commodity hardware, and to query that data.  However, the most interesting part of the discussion was that deploying Hadoop requires a lot of IT expertise (unlike its namesake, the plush toy elephant).  Hadoop is perceived by many as easy to use open source software that provides help for IT departments who need to analyze large data sets.  After all, many companies in the Web 2.0 community have been using either Hadoop or other MapReduce style software to handle their big data needs.   Surely this must mean that this is a silver bullet worth looking at?  What most companies uncover is that Hadoop is not as straightforward as initially thought.  In fact the company presenting makes a living helping IT departments use Hadoop.  What many of the companies that have successfully deployed Hadoop have are rocket scientists that know how to use it.  Even if you wanted to or could pay for Hadoop expertise, where do you find these Hadoop experts?  There is a community of contributors doing good work out there, but what is missing is deployment expertise.  The current state-of the-art in MapReduce parallelism is far from being turnkey.

Musings from the 5th Cloud Computing Expo NY

Posted by Esmeralda Swartz on April 21, 2010 12:22pm EDT Comments (0)

This is the first blog on observations from the Cloud Computing Expo held in NY this week.   I spent a day at the Javits Convention Center and a number of thoughts struck me when walking the show floor.   This was supposed to be a cloud computing show and yet there appeared to be more PaaS, IaaS companies than SaaS or cloud.   While a part of me is not surprised since this is what you would expect to see in a market in its infancy stage, the magnitude of what needs to be accomplished for mainstream adoption hits you.  Don’t get me wrong all of the right marketing messages were there, but the steps and practicality to deliver on what is being promised needs a lot of work.  There were many companies presenting on the benefits of cloud computing, but the use cases were the same ones heard before.  Good work is being done on making clearer what applications are suited for the cloud and more importantly the ones that are not.   Why am I highlighting this?  While it is clear that the business model for computing and storage in both private and public clouds has been validated in the form of the commodity server building block, there is much to be done in the application area.  Just how far the componentization of hardware has come hit me as I walked up to the Microsoft booth.  Microsoft had on display two of their 20-foot data center containers completely packed with pre-assembled servers and power and cooling connectors ready to be rolled into a data center.   Another thought struck.   This was an interesting example of outsourcing, with the work required to set up a data center of servers shifting from IT to HVAC expertise.  While I am simplifying it, it does make the point of how far we’ve come on computing and storage which is in direct contrast to the application side of the equation.

Big Data, How Much?

Posted by Esmeralda Swartz on April 1, 2010 12:48pm EDT Comments (0)

The topic of Big Data has received a lot of attention over the last year or so. There is general consensus that the massive data sets in common use today can only be effectively handled with distributed data processing techniques. As more data (structured and unstructured) and users (technical and non-technical) of data continues across the enterprise, the requirement for storage, bandwidth, and computational capacity to query and analyze these datasets is growing at an unprecedented rate. When a one terabyte hard drive costs less than $100, the complexity of dealing with the amount of data, data types, and data analysis will only increase.

How much data are we talking about and is it really such a problem? Here is a real world example to attempt to answer the question. It is not unusual these days for an enterprise to ingest 50GB of new facts/day. How much is this anyway? About 5.8 thousand new facts being inserted/second. Of course the real world has peak loads so you cannot assume that the database can be level loaded throughout the entire day. When is it a good time to stop loading and perform analysis? The answer is there isn’t.

Big Data and Cloud Techniques

Posted by Esmeralda Swartz on March 31, 2010 11:30pm EDT Comments (0)

In the March 30th blog, we began the discussion on the changing face of business intelligence. We’ll be doing a series of blogs on this topic since it is a big problem that most organizations are grappling with. The purpose of these blogs is to highlight some of what we are hearing from customers and attempt to shake out some of the noise in the market. What techniques (whether from the cloud or the enterprise) have a shot at solving at least part of the problem? The problem is so big, excuse the pun, that even the most conservative enterprises have started to look to the cloud for answers. Often what they find are products that point to the right technology direction, but do not solve the problem either. Technologies such as MapReduce are useful tools to parallelize large scale data analysis across a federation of commodity servers. They point to the right technology direction. There is no question that parallelization is the only answer to improve the performance of BI and queries. However, there are a couple of limitations worth noting that must be addressed. MapReduce was never intended to be a database system. It is not SQL compliant and does not easily interface with existing databases. It was also never intended to be a complete solution over structured data. For those companies where the requirements are relaxed for the database, this is fine. If all that is required is support for unstructured data, brute force analysis works really well. The problem is that a lot of enterprise data is coming from operational data stores where it is more structured. Attempting to use MapReduce techniques beyond their comfort zone will yield disappointing performance results. What is required is a hybrid solution that supports massively parallel processing to achieve the desired scale and performance while retaining relational capabilities and improving upon MapReduce style techniques.

Changing Face of Business Intelligence

Posted by Esmeralda Swartz on March 30, 2010 3:30pm EDT Comments (0)

One of the biggest challenges faced by today’s enterprises is extracting different types of data from a growing number of data sources in a meaningful and timely manner. For performance as well as economic reasons, parallelization across hundreds or even thousands of servers is required to reduce analysis time from hours to minutes or seconds. BI tools have helped users understand what happened in the past by analyzing historical information, typically from a data warehouse. The problem is that the value of this intelligence often becomes stale before reaching the decision maker. Waiting a couple of hours for batch loading is no longer acceptable, the time between when information is created to when analysis is presented to decision maker needs to decrease from hours to minutes or seconds. Organizations want to use BI to connect the dots to support operational decicion making. Data is being ingested at ever increasing rates with some enterprises importing 10s to 100s GBs of data per day from across the enterprise. Increasing volumes of data are flowing from the day-to-day ’database of record’ into the BI OLAP part of the world.

Shared-nothing database architectures have been used for many years in the BI and Analytics area as they are ideally suited to scaling and could support large amounts of data. In the past, these solutions counted on the fact that analysis was primarily read-only with occasional batches. However, today there is no longer a good time to stop loading and perform analysis. BI applications are becoming similar to Web 2.0 applications with many concurrent users and sophisticated integration and presentation of data from many disparate sources. Increasingly, they must support massive amounts of structured and unstructured data. As BI becomes part of the customer portal, it is dealing with live data and requires immediate response. The transition of BI from an offline batch activity to an online interactive activity is causing strain on existing BI databases.

The Genie is Out of the Bottle

Posted by Esmeralda Swartz on October 23, 2009 2:45pm EDT Comments (0)

The hype around the cloud can be very distracting and provides plenty of ammunition for cloud bashers, pundits, and the like. However, it is far more interesting to look at the trends that are spilling over from the cloud to the enterprise, to fully appreciate the impact. Enterprises already make extensive use of virtualization to abstract the software and operating systems from the hardware. The current state of the art of the public cloud makes use of federated components albeit with weak or non-existent SLAs. And of course one can’t talk about the cloud without mentioning open source software. Today, the lack of guarantees around performance makes it unacceptable for enterprise IT to take it seriously for any mission critical applications. There is a fork in the road, do you improve the basic storage, processing, and communications components or do you add a software layer to compensate for the shortcomings? The market has spoken on this topic; you cannot destroy the economics. The performance of the federated components must be improved for the enterprise while maintaining the cloud economics. This is a very simple, yet powerful statement when you consider the implications back into the enterprise. In the old days, SIs added value by providing a turn-key solution that included expensive hardware optimized and tuned for the software. The genie cannot be put back in the bottle. The cloud will completely re-shape the hardware and software ecosystems for the enterprise.

A Tale of Two Camps

Posted by Esmeralda Swartz on October 18, 2009 4:59pm EDT Comments (0)

In the Sept 12, 2009 blog, we discussed some of the reasons why current cloud solutions are not compatible with Enterprise IT requirements. In this month’s blogs, we begin to take a look at why enterprise database software is not suited to a distributed environment. Web scale out, geographic distribution, transactions per second, latency, and throughput pressures are causing fragmentation on the database in both the cloud and the enterprise. However, the most significant cause of fragmentation is the fact that the cloud offers low-cost commodity hardware which forces a distributed implementation.

In the early days, the SQL database was synonymous with the application and it was both data repository and report generator. It grew up over the years to take on many features ranging from the data model to APIs, to tools and conventions that took years to develop. Incumbent software vendors have spent a lot of time over the years defining the database. The conundrum is clear, distributed environments (i.e. the cloud) bring unique challenges that are not suited to the traditional SQL database, mainly in the scale out and performance area. How do enterprise software vendors meet the unique requirements of a cloud environment and not sacrifice their current market position? The answer is they can’t at least not cleanly. Their business is built upon having the richest most full-featured database to up-sell to customers. Maintaining continuity of software products at the high-end requires expensive hardware to run the database software. How do they take non-federated high-SLA software building blocks and move them to the cloud when the cloud is about utilizing federated low-SLA building block components? Converting a high-grade software building block to a low-grade one without changing anything is not the answer. A famous quote from a large software vendor that shall remain nameless illustrates the point: “We’ll make cloud computing announcements. I’m not going to fight this thing. But I don’t understand what we would do differently in light of cloud.” This vendor utilizes a processor virtual core de-rating factor to lower the price of the database in the cloud without changing the list price. Why? If the software is the same and the price is lower in the cloud it would have the negative impact of resetting the price for enterprise customers.

The early cloud camp had a clean sheet of paper and could focus on solving the scale-out problem. They made technology choices based on what they knew how to do along with the dominant storage usage patterns from early customers. This greatly reduced the need to rely on relational storage for scaling out systems. As a result processing, storage, and communications lag behind the enterprise. The cloud camp now has a legacy technology/customer base to support and they need to improve the quality of their federated offerings to provide SLAs. Hardening the cloud components is the only way the market can begin to expand beyond very simple non-mission critical applications. Today, the relational database camp gives you transactions, but not scaling and performance in the cloud and the distributed storage camp gives you scale and performance at the expense of transactions.

Compatibility with Enterprise IT

Posted by Esmeralda Swartz on September 12, 2009 6:42pm EDT Comments (0)

We talk a lot about current cloud solutions not being compatible with Enterprise IT requirements. Where do we think the divergence is most apparent? In our monthly blogs, we’ll be looking at many areas where this is the case, but a good starting point is with the well known CAP Theorem which states that there are three fundamental core system requirements that have a synergistic relationship when designing and deploying applications in a distributed environment. These attributes are Consistency, Availability, and Partition Tolerance (CAP). You must pick two out of the three attributes, but not all three.

The cloud camp needed availability and partition tolerance more than consistency; the storefront can’t go away when data centers are partitioned. We discuss the CAP Theorem on the overview section of the Lightwolf website. The Enterprise IT camp chose consistency and availability, and deal with partition tolerance via deferred updates (persistent message queues).

Enterprise IT has also become synonymous with the use of an Enterprise Framework and enterprises have embraced their choice of framework for coding applications.

The diagram below of an Enterprise stack illustrates the interaction between various layers to help take apart the problem. This is important when looking at various bandaids that have been applied to the same underlying problem.

At the top of the stack are applications. Below applications, Enterprise Frameworks like Spring, ASP.net, or EJB emerged. The Enterprise Framework exposed a persistence abstraction like Hibernate. This persistence abstraction did a good job of hiding the database you were talking to. For performance reasons, caches were added, but the uniqueness of each cache was exposed to the application layer. This caused a ripple up affect to the application. The cloud camp choice of two under the Cap Theorem, threw away consistency also causing an application visible effect.

New Year, Same Bottleneck

In the beginning Hibernate started off as an object relational mapping for the database and talked directly to the database. Hibernate is the most popular persistence abstraction, but it is representative of larger Enterprise IT (e.g. J2EE). With web scale out, issues began to surface as large numbers of sessions attempted to talk to the database, which was not designed for hundreds of thousands of concurrent users. To compensate, Hibernate started modeling sessions and sessions within a single web server could share a database connection. This led to another problem as you still needed to talk to the database, and sessions still needed to know how to share database connection. The database facing session needed to get in and get out for the most efficient sharing of the connection. The user session at the web portal can be measured in minutes or longer. Sessions in Hibernate evolved into transactions. A problem occurred as the session cache went away at the end of the transactional session and you needed to go back to the database for the user information. This problem was solved by adding an L2 cache. The current problem is that for Enterprise IT, it is a continuum from small users to large users. The original purpose of the L2 cache was not for sharing distributed data. The database is still the bottleneck under scale out conditions. You need to be able to share data amongst multiple web servers without going to the database. Hibernate was never designed for this purpose and the L2 cache was designed for running a single web server. This problem is not solved well by existing “solutions.” Solutions need to be purpose-built for Enterprise pattern of use while providing access to cloud computing resources.

Why Does the Public Cloud Look Like it Does?

Posted by Esmeralda Swartz on September 11, 2009 5:30pm EDT Comments (0)

Today, we are launching Lightwolf’s first blog. We will be updating this blog at least once a month–sooner, if we believe strongly in a topic.

IDC says cloud computing is “an opportunity in its infancy, but, even conservatively, poised to drive big marginal growth.” Burton Group adds that “the best organizations will use cloud computing’s unique business model, elasticity and scalability to streamline IT operations, offload lesser-value IT processes and focus on driving core business value.” Cory Doctorow writes “the main attraction of the cloud to investors and entrepreneurs is the idea of making money for you, on a recurring, perpetual basis, for something you currently get for a flat rate or for free without having to give up the money or privacy that cloud companies hope to leverage into fortunes.”

The pundits all agree on the future potential, but it is worth taking a look at what it will take to make the cloud a general purpose solution for Enterprise IT from its current boutique role. Today, one need only go out to Silicon Valley to realize that the hype far outweighs the reality. Whether you have a cloud solution or not, everyone wants to stake out a position in the next big thing. I am reminded of a time not so long ago of having the same experience with the number of companies springing up to take advantage of the next evolution of the Internet. Cloud computing is also being touted as the next evolution of the Internet and not just the Internet, but computing as well. No question, the potential is there, but we must realize that the build it and they will come mentality has been proven time and time again to be risky. In order for the cloud to become a general purpose solution for IT, consideration for the thousands of man years of best practices and Enterprise IT table stakes must be considered.

How did we arrive at today’s definition of the public cloud, anyway? The industry largely associates the public cloud with Amazon. The following represents my own observations and opinions on how Amazon shaped the public cloud definition.

The Amazon Cloud Case Study

The Amazon cloud was a result of many internal innovations dictated by their business model which drove Amazon’s incremental migration to the cloud. Its first storefront for books required innovation around massive scale of infrastructure to support their desired market penetration. The infrastructure was critical as Amazon’s responsiveness at the storefront was its competitive differentiation.

As other storefronts emerged (e.g. music and video) the Amazon innovation was around the path to closing the sale (Jukebox with 30 second samples for sampling before buying). This resulted in more bits to stream (Amazon player in the web browser) and requirements for data replication, caching, and geographic locality exploded. Amazon then needed to develop internal technology to support the storefront business ease-of-use (hit play with immediate response). The Amazon player and distributed key-value stores were infrastructure innovations.

Now that infrastructure was in place, the focus turned to something Amazon knew how to do really well, personalization of the user experience. Once again, innovation was required and this time it was around making suggestions based on buying patterns (correlated buys based on data mining). Amazon needed massive data analytics for access to buying patterns.

This is when it got really interesting; Amazon’s seasonality of business (e.g. the holiday rush) resulted in excess data center capacity throughout the rest of the year. In order to do cost recovery, Amazon then innovated around packaging capacity from the industry norm of long-term capacity commitments to rent-by-the-hour. They turned a liability into a new line of business. This meant development in operations support (virtualization, automated turn-up/tear-down, performance monitoring) was needed to support rent-by-the-hour. Amazon transferred their distributed systems innovations from their mainline business to its Elastic Compute (EC2) offering.

Once the new line of business was established, Amazon once again focused on ease-of-use. The focus turned to creating new applications for its cloud platform (easier to program) and enabling a broader range of applications.

Why was it important to go through the Amazon case study? Because once the industry associated the public cloud with the Amazon model, we saw attempts to pattern match other massive scale-outs as targets for the cloud (social networking, search, Apple Apps store, etc.). The back office also needs to scale so why can’t the back office use current massive scale-out solutions from the public cloud? The answer is simple. In order to achieve massive scale-out, other requirements were relaxed along the way. We believe that you have to start with Enterprise IT concepts and adapt these to the cloud:

  • Security (including physical)
  • Authentication, Authorization, Accounting (and Audit)
  • ACID database, transactions
  • Compatibility with existing code, design style
  • High-availability, non-stop operation
  • On-line processing
  • Consistency/Coherency
  • Thousands of man years of best practices