Tuesday, October 14, 2003

Business Intelligence

by Asim Jalis

1. The most tantalizing and interesting problem in the web services space might be business intelligence. How can web services be used to promote this. What are some interesting problems in this space. 2. The other interesting area to continue to focus on is security scenarios. When companies cooperate and integrate their systems with each other what are the use cases associated with that? It might be interesting to play through some of those scenarios to see what we come up with. 3. Let's work on business intelligence first. 4. One critical element of BI will be forecasting. How does this fit into web services management? Well, if we can assume that WS calls will take about the same amount of time every time, or that they have a normal distribution, and if we have dependence relationships between WS endpoints, then it might be possible to forecast how long a certain call will take. A probability distribution could be created. 5. The main application here is forecasting or advanced data analytics. 6. The web services management framework publishes considerable information. Using advanced analytics we can put it all together, which might generate some good insights. 7. We can combine relationships and performance data to forecast future performance. 8. We could also just use the existing performance data to forecast future performance. It might be possible to correlate performance with time of day. To do all this data mining it is important to log all the performance data, not just as performance counters, but rather as a file, with date-time information of when the call occurred. 9. The disk space could be recycled periodically, through web services. In the meantime the files should be compressed using gzip. 10. What other forecasting opportunities are there? 11. Wait. This kind of forecasting could even be useful for simple HTML web pages. This does not have to be specialized to web services. Anything that is served by IIS can be measured and then predicted. 12. We can also measure and forecast number of faults. 13. I can see definite management value in all of this information. 14. This performance data could be correlated with other events external to the web services. For example, when a new customer is added the performance might degrade. Now it will be possible to measure exactly how much the performance degraded by after a certain date. 15. We could offer performance measurements for different periods. 16. Basically this would be a query, analysis and reporting tool. 17. One of the things missing in the current suite of web services management is reporting. Managers loves regular (daily, weekly, monthly, yearly) reports. The system should have the capability to generate these reports on the selected time frame. This in fact with be the business intelligence element. 18. The regular reports could cover all kinds of web traffic, not just web services. Similarly, the performance measurement and forecasting should also cover all kinds of web traffic, not just web services. 19. We have identified things managers would be interested in querying: (a) Look at average performance by time period, and by web service, and by web method. Find other causal factors in performance degradation. 20. There is another point I have wanted to make for some time: Instead of focusing on web services management we must focus on HTTP transport management. The problems are essentially the same. By thus generalizing the problem the solution can address a pre-existing market out there and can solve their real problems. 21. Another area managers will want reports on: which false web services or HTTP endpoints were hit generating 404 errors. Similarly they can also look for 403 (unauthorized access violation) errors. These kinds of reports, complete with where the attacks came from might be useful in figuring out the profile of future attackers and pre-emptively deal with such attacks. 22. In web pages, but not in web services, there is also the concept of the href link. Thus pages link to each other. There is no analogous concept in web services. This might be a critical issue for web services. The reason the web took off where many other approaches did not is because of this easy and intuitive ability to name resources over the web and then to pass around references to them. 23. The web and the C programming language have both benefited greatly from pointers, and owe much of their power to the subtle and philosophically deep concept of a pointer. Web services lacking pointers will always be the poor cousin of the W3 family of protocols. Or more likely, the rich but unpopular cousin. The only thing going for him right now is money. 24. The main elements of a WS management framework are: (a) Properties (stateful queries with time-invariant keys), (b) Logs (queries with time-dependent keys), (c) Events, (d) Discovery of structure. 25. The main kinds of information available at a WS management endpoint are: (a) Information about the structure or the topology of the WS network, (b) Information about past events. 26. As always what we have here is the duality of space and time, of structure and function. 27. Incidentally, href links on the web make sense because all URLs support two basic operations: GET and POST. So as soon as a browser sees a link it knows what to do with it. It has only two choices. In web services, even when you have links, such as pointers to other web services, it is not possible for the browser to know what to do with the link. The browser needs to know a lot more about the link -- about the SOAP endpoint -- to know what to do. 28. This is part of the reason for the popularity of the RSS-style feeds. They are more like the web in that they support GET, and less like web services, because they don't support arbitrary functions, like web services do. An RSS feed says: here is an XML document, go ahead and GET it, and then see if you can figure out what to do with it. 29. The RSS feed does not expose a set of operations through a WSDL. It supports only one operation: GET. 30. The logging based web services management framework I have proposed elsewhere has the neat quality that it also allows you to get away with using just GET, much like RSS feeds. 31. This makes sense too in a way. After all web services management is ultimately related to the semantic web. You are associating meaning and meta-information with web services. So it makes sense to use RSS-like concepts here. RSS is used to publish information about a website. Similarly a WS management framework is used to publish information about a web service. There is a similarity here. 32. What other forecasting and datamining could we do? We have structure and function information. All structural information could be rendered as logs also. Logs of the form: (a) web service file added, (b) web service file deleted. However, this seems awkward. Representing structural information without log files seems to make the most sense. 33. Reports could state how many web services are published universally across the enterprise. How many per machine. 34. Reports could also give usage statistics. Which are the top 10 web services or web methods. I suspect the usage distribution will satisfy some form of the 80-20 rule and will be of a Pareto variety. 35. Once you aggregate the information and compare web services with each other many possibilities open up. For example, how slow is this web service compared to other web services. 36. Besides dependence we can also ask: How many times does this web service call the other web service. 37. Based on the scenario in 36, it makes sense to log both incoming as well as outgoing calls. If one of our web services is calling out to another web service too frequently then that is something to be investigated. For example if service A is called 10 times, but called service B 200 times, that suggests that service A might be calling service B 20 times per call. This suggests an easy opportunity for design improvement. 38. These 80-20 or Pareto reports could be extremely useful to an IT manager in deciding which web services to speed up. They will help the organization identify its bottlenecks. There is no point optimizing the services that are already fast enough. There is also no point in throwing hardware at such web services. However, there is a lot of value in speeding up the slowest web services. 39. There is a tension here between speed and popularity -- or how many times a web service is called. Clearly a web service that is not called a lot can afford to be somewhat slow. But one that is called a lot should be fast. 40. A simple way to combine these measurements is to look at total execution time per month (or per some period). This can be computed by adding all the computation times for each one of the individual calls together. A simpler way to do this (with some tolerable loss of accuracy) could be to multiple the average computation time of the web service with the number of times it was called. 41. When I say computation time I mean response time or execution time. I am not sure if I want to include the network latency in this or not. It might be interesting to have both statistics -- both the total time with the network latency and without the latency. 42. Note that we now have logging capabilities both with incoming messages as well as outgoing messages. We log outgoing messages with SOAP extensions. We could log total computation time with network latency in the logging routine of the outgoing messages. 43. The outgoing logs will have to be different from the incoming logs. When a message is incoming you don't know who it came from. It's just a SOAP message over a wire. It has no information about its source. However, in outgoing logs we know who the sender is and who the target or receiver is. So we will need to log these separately. 44. If we can make some assumptions about the synchronization of the clocks then we could synthesize these logs from different machines together and draw a dependency graph, which could show the breakdown of the execution time across the different web services. Push-Based Notifications Considered Harmful 45. Here is the reason I am opposed to active (or push-based) notification: it generates too much network traffic. The people who have argued for it claim that it is necessary for issuing urgent alerts in case an exceptional condition occurs. 46. I think there are two different value propositions colliding here. Here they are: (a) Immediate term error handling and fire-fighting, (b) Longer-term performance improvement. 47. Push-based events make sense for fire-fighting. They are like fire alarms in a building. You want them to blast as loudly as possible and as quickly as possible. There is a fire. It has to be put out immediately or all hell will break loose (if it hasn't already). 48. On a real network this might correspond to a virus attack or some other extremely unusual circumstance. If these kinds of alarms go off on a regularly basis then the organization has serious problems. 49. Most real and significant improvements to a network and to distributed applications require a more circumspect approach. In my mind this is the real value proposition that can be made to a manager. We'll give you the reports that will allow you to reflect on where the bottlenecks are in your system and how you can solve them. We will show you how to do more with less. 50. Most systemic improvements require reflection. You don't want your fire alarms to go off just because the national crime rate has hit a new high, or because your company's earnings were below expectations. These are not unusual events. They identify systemic problems that require subtle well thought-out solutions. Calling in the fire engines will not solve the problem. 51. The claim: a business creates real value for itself (in terms of cost savings, or better output with fewer resources) when it makes systemic improvements, not when it puts out fires. If there really is a network emergency -- for example your network has been taken over by hackers -- in this case you want to shut it down. Hopefully this is not something that occurs every day. If it occurs every day, it might make more sense to take a few steps back and figure out what it is about your system that causes it to flare up in this way so regularly. 52. It follows from 51 that the real money is in systemic analysis and reporting tools. Web services fire fighting tools are useful to have but can only generate a fraction of the value that systemic tools can. While the cost of each fire might be quite high, the systemic tool will payoff very quickly because its value will increase exponentially. Each systemic improvement will make it easier to see the next one. 53. As the inventory clears up around the bottlenecks, new bottlenecks become apparent (see Eli Goldratt's The Goal). 54. The problem with push-based events is that they generate web traffic at odd times. If the performance of an application has sunk to a new low, it is probably because the network is experiencing heavy traffic. The worst thing to do in this case is to send out more messages to the network announcing how bad things are. This is almost like screaming "fire" in a crowded theater. 55. Now push-based events might make sense for a fire-fighting solution. Even though screaming fire in a crowded theater is dangerous, this is precisely what fire alarms do. You hope they warn you early enough that everyone has time to get out. (Now I am curious: Do they have fire alarms in theaters?) 56. So in this exceptional circumstance they make sense. But they should be used carefully. Special attention should be paid to ensure that they don't exacerbate the problem they are trying to warn everyone about. 57. Sending out as many push-based events as there are people registered for them sounds like complete insanity. At most a single event should be sent to the operator. At any one time only one push event receiver should be identified. And this is all assuming that we want to sell a fire-fighting solution instead of a systemic problem-solver. More Thoughts on Business Intelligence 58. In the absence of push events we are back to an RSS-style feed. Pull events are very much like RSS-feeds. 59. What other kinds of information could we generate from the logs that are created by web activity? 60. We have performance data, we have call graphs, and all of these things are aggregated across the company. 61. Going back to the theme of systemic improvement, we can use this to keep nibbling at the low-hanging fruit. Improve the web service that is called the most and that takes the longest. 62. When my hard-drive is full the way I deal with it is to delete the biggest files. Deleting five of the biggest files will have much more impact on the problem then deleting 100 tiny files. The same idea can be applied to web services. Optimizing the top 10 worst performers will have a much more remarkable effect on the performance of the network than working on any other web services. 63. Another interesting piece of information the manager might be interested in correlating with would be the hardware specification of the machines he is using. For example, the machines with the least amount of memory will probably have the worst performance. Over time the manager might be interested in migrating the least used services to the slowest machines (or the machines with the feeblest resources) and migrating the most highly sought after services to faster machines. This is the entire premise of adaptive management. 64. Presumably there are other tools out there which can automatically move services around between machines, and can do load balancing, all based on the statistics published by the management framework. 65. But can we do this in real-time? Does avoiding push events mean we are always going to be locked out of the real-time space? This is not really true of course. The reason is that with pulled events the manager decides when to pull the events. The manager here might be a person or a really smart program -- it is the entity that tries to optimize IT operations by allocating resources. With pull events the manager, who understand the global picture of the whole system pulls the events. He can still do this in real time. He could poll services regularly every few minutes to get a real-time experience of WS usage. However, he is in a much better position to decide when the network can take the extra traffic that the events will need, then the isolated node, which has a local view of itself, but does not understand what is going on around it. 66. The business intelligence server should also support arbitrary queries against the data sets. It might make sense to store the data sets as queryable database files. The queries could be executed using web services. The table names could have the form: host/table. This way it will be possible to refer to the same table on different machines. For example, the events table on machine A could be joined and queried with the event table on machine B. 67. The BI server could also be integrated into other BI solutions, such as the one from MicroStrategy. 68. Could we fit neural networks into this somehow? Is there an application here for neural networks? 69. Is there an application here for optimization algorithms? 70. The framework could create different scenarios for the manager and help it to predict the effect of various allocation decisions on performance. The framework for example could run simulations. 71. The framework could also create a model of the network, simulate and test out different organizations and recommend its own optimal solutions. 72. These programs would be separate solutions, but they would integrate with the data feeds being provided by the management framework. 73. Sometimes intensive planning and forethought misses important points about reality. So the framework could support an adaptive approach. Instead of moving services to the best possible machine, perturb the system slightly to see what effect that has. The system would use gradual hill-climbing. It would incrementally improve itself and try to get closer and closer to optimal. 74. The system could include scheduled reports. A scheduler creates reports at specific times. These can be considered snapshots of the system. The system should have facilities to take these kinds of snapshots. 75. The system could also support historical charts. I.e. reports that give a historical perspective on the system. The managers can use these to see how things have been getting better (or worse) over times. Things might be getting worse because the number of customers using the system keep increasing. Sales tries to get as many customers in the door as possible. Meanwhile IS's resources get crunched. These historical charts will help IS managers ask for more money and receive it. The IS manager can argue that in the last 3 months, the usage of the system increased by 30%. This correlates with a 30% increase in users. To support 30% more users the organization needs to buy 30% more computing power. The business managers will see the immediate value of such an investment when they see the performance degradation and how it ties in with the growing customer base. 76. The organization could compare itself to its own past and see how it is doing. 77. It might also be interesting to plot equipment quality and quantity against performance over time. This might be useful in justifying future sales. 78. The management framework will put all the tools at the IS manager's disposal that will allow him to get the organization focused on the importance of IT. Instead of begging for IT resources, he will be able to show business managers the intimate connection between profits and IT resources, and will have managers throwing money at him to bring the IT systems up to par with the organization's needs. 79. Now tell me this: If you gave an IS manager a choice between a tool which would help him fight fires, and a tool which could help him double the size of his department which one would he choose? The value proposition of the BI tool should be obvious. 80. The beauty of this model is that it can be integrated easily with a web services management framework. All performance metrics are stored as events which the aggregator can pull (or poll) when he wants to. 81. However, this is somewhat different from the vision of web services management that I have seen anywhere else. 82. All of this could be implemented through a soap extension. It has access to both incoming and outgoing messages. The deployment will be super simple. Just drop the dll on the machines and make a tiny tweak to the web.config files of the services to be monitored. Thoughts on Security 83. Later on we could restrict who sees what reports. It is possible that the organization does not want everyone to know which system is used the most. This might be a proprietary trade secret. It might give competitors some idea where they make all their money. 84. For this we will need a security architecture. There is an initial admin. He can create other accounts and give them privileges. They in turn can create more accounts. The admin can delete accounts. Other accounts can delete their own descendants, but not others. All accounts except the admin's can have a finite lifetime. The admin is like a supreme deity: he creates but was not created; he gives life and death. 85. To make this slightly more complex creating a new account might require approval from N existing accounts. All of these will be non-admin accounts, since the admin can singly create any account he wants. 86. In a pantheistic universe there might be M admins and at least m of them have to approve for a new account to be created. This way a single megalomaniacal admin does not get absolute power. 87. When the admin is laid off a new admin must take over. So admins can create other admins. Or a majority of admins can vote another admin in. Requiring the majority will prevent factions from developing, where each party tries to get its own friends in. 88. People can be added as accounts based on their NT passwords. This way they will not need to log in multiple times. The system can use people's NT accounts as keys. 89. Besides people programs will also need accounts. Programs can be a separate account category. They are voted in by admins (by some quorum of admins). Each program account is issued to a single or to a group of programmers. The program's account is associated with its programmer's account. However, when the programmer is laid off or leaves, the account continues to live on. New programmers can be attached to a program's account over time as they take on the maintenance tasks of the program. For a program to continue running it could require at least one current employee to take responsibility for it. The idea is that programs run as proxies for their programmers. Someone has to take responsibility and ownership for each program. 90. I am not sure if we should have roles or not. Roles are generally confusing. It's hard to remember which role creates which privileges. I would instead go with a privilege based system, using ACLs. Each program has a list of people who are allowed to run it. Only those people are allowed to run the program. This way different entry points into the system can be created for different people. 91. The same kind of idea could be extended to web methods. For each web method only certain people are allowed to run it. Again, identification is established through NT single sign-on. 92. The system should support self-created accounts. These are user accounts that were created without admin approval. In general arbitrary NT account holders should have some privileges inside the system and should be able to see some reports. The system could allow them to see some documents and reports and to run some programs. The facility should exist for non-admin initiated users. This self-service model will help reduce the workload of the admin. Business Intelligence Toolkit Continued 93. An extremely important component of the toolkit will be the API documentation. The services provided by the toolkit should be easy to interoperate with and to call through dlls or through .NET framework. 94. At each tier of the architecture the system is completely open and accessible through standards based hooks such as web services. 95. Similar management ideas could be used to "improve" any traditional software. For example, bug tracking software should report how many times the bug has been reported, by how many people, when did the reports start coming in, what is the overall efficiency of the organization over time, what are the historical trends. 96. In general there are two kinds of information that is accessible: snapshots and historical trends. 97. I just want to observe in passing that this process of fleshing out thoughts in detail, exploring alternatives, exploring value propositions, all of this is extremely enjoyable (for me; and hopefully not too bad for you either). This seems like something that can remain sustainable for a long time. 98. The perception of what is important to an IS manager is based on Gartner's analysis of MicroStrategy. 99. A question some readers will pose, and validly, is: But why web services management? Surely the argument you have made applies to managing all kinds of applications. Why would a manager buy a web services management solution instead of buying a solution that he can use to manage all his applications. 100. The answer: The reason web services management is different is because web services gives you an easy to instrument control point into the application -- namely, the web service interface itself. In the past, with previous applications instrumenting them for manageability was error-prone and require some investment of development time. This kind of instrumentation is particularly difficult after the fact with proprietary closed-source applications. This is not to suggest that it is easy with non-proprietary or open-source applications. In fact it is difficult with both. However, it is nearly impossible with closed applications. 101. The universal interface of web services allows all web services applications to be managed and monitored. WS management frameworks can thus shed much more light into the WS management space than was possible with traditional applications. In fact WSM could be a value proposition of web services. With web services it becomes easy to measure your applications, which allows you, the IS manager, to approach your business managers with reams and reams of data describing the impact of business decisions on IS throughput. As the business managers make the connection they will throw money and resources at you as never before.