Asim Jalis: Business Intelligence

by Asim Jalis

1. The most tantalizing and interesting problem in the web
services space might be business intelligence. How can web
services be used to promote this. What are some interesting
problems in this space.

2. The other interesting area to continue to focus on is security
scenarios. When companies cooperate and integrate their systems
with each other what are the use cases associated with that? It
might be interesting to play through some of those scenarios to
see what we come up with.

3. Let's work on business intelligence first.

4. One critical element of BI will be forecasting. How does this
fit into web services management? Well, if we can assume that WS
calls will take about the same amount of time every time, or that
they have a normal distribution, and if we have dependence
relationships between WS endpoints, then it might be possible to
forecast how long a certain call will take. A probability
distribution could be created.

5. The main application here is forecasting or advanced data
analytics.

6. The web services management framework publishes considerable
information. Using advanced analytics we can put it all together,
which might generate some good insights.

7. We can combine relationships and performance data to forecast
future performance.

8. We could also just use the existing performance data to
forecast future performance. It might be possible to correlate
performance with time of day. To do all this data mining it is
important to log all the performance data, not just as
performance counters, but rather as a file, with date-time
information of when the call occurred.

9. The disk space could be recycled periodically, through web
services. In the meantime the files should be compressed using
gzip.

10. What other forecasting opportunities are there?

11. Wait. This kind of forecasting could even be useful for
simple HTML web pages. This does not have to be specialized to
web services. Anything that is served by IIS can be measured and
then predicted. 

12. We can also measure and forecast number of faults. 

13. I can see definite management value in all of this
information.

14. This performance data could be correlated with other events
external to the web services. For example, when a new customer is
added the performance might degrade. Now it will be possible to
measure exactly how much the performance degraded by after a
certain date.

15. We could offer performance measurements for different
periods.

16. Basically this would be a query, analysis and reporting tool.

17. One of the things missing in the current suite of web
services management is reporting. Managers loves regular (daily,
weekly, monthly, yearly) reports. The system should have the
capability to generate these reports on the selected time frame.
This in fact with be the business intelligence element.

18. The regular reports could cover all kinds of web traffic, not
just web services. Similarly, the performance measurement and
forecasting should also cover all kinds of web traffic, not just
web services.

19. We have identified things managers would be interested in
querying: (a) Look at average performance by time period, and by
web service, and by web method. Find other causal factors in
performance degradation.

20. There is another point I have wanted to make for some time:
Instead of focusing on web services management we must focus on
HTTP transport management. The problems are essentially the same.
By thus generalizing the problem the solution can address a
pre-existing market out there and can solve their real problems. 

21. Another area managers will want reports on: which false web
services or HTTP endpoints were hit generating 404 errors.
Similarly they can also look for 403 (unauthorized access
violation) errors. These kinds of reports, complete with where
the attacks came from might be useful in figuring out the profile
of future attackers and pre-emptively deal with such attacks.

22. In web pages, but not in web services, there is also the
concept of the href link. Thus pages link to each other. There is
no analogous concept in web services. This might be a critical
issue for web services. The reason the web took off where many
other approaches did not is because of this easy and intuitive
ability to name resources over the web and then to pass around
references to them. 

23. The web and the C programming language have both benefited
greatly from pointers, and owe much of their power to the subtle
and philosophically deep concept of a pointer. Web services
lacking pointers will always be the poor cousin of the W3 family
of protocols. Or more likely, the rich but unpopular cousin. The
only thing going for him right now is money.

24. The main elements of a WS management framework are: (a)
Properties (stateful queries with time-invariant keys), (b) Logs
(queries with time-dependent keys), (c) Events, (d) Discovery of
structure.

25. The main kinds of information available at a WS management
endpoint are: (a) Information about the structure or the topology
of the WS network, (b) Information about past events.

26. As always what we have here is the duality of space and time,
of structure and function. 

27. Incidentally, href links on the web make sense because all
URLs support two basic operations: GET and POST. So as soon as a
browser sees a link it knows what to do with it. It has only two
choices. In web services, even when you have links, such as
pointers to other web services, it is not possible for the
browser to know what to do with the link. The browser needs to
know a lot more about the link -- about the SOAP endpoint -- to
know what to do.

28. This is part of the reason for the popularity of the
RSS-style feeds. They are more like the web in that they support
GET, and less like web services, because they don't support
arbitrary functions, like web services do. An RSS feed says: here
is an XML document, go ahead and GET it, and then see if you can
figure out what to do with it.

29. The RSS feed does not expose a set of operations through a
WSDL. It supports only one operation: GET.

30. The logging based web services management framework I have
proposed elsewhere has the neat quality that it also allows you
to get away with using just GET, much like RSS feeds.

31. This makes sense too in a way. After all web services
management is ultimately related to the semantic web. You are
associating meaning and meta-information with web services. So it
makes sense to use RSS-like concepts here. RSS is used to publish
information about a website. Similarly a WS management framework
is used to publish information about a web service. There is a
similarity here.

32. What other forecasting and datamining could we do? We have
structure and function information. All structural information
could be rendered as logs also. Logs of the form: (a) web service
file added, (b) web service file deleted. However, this seems
awkward. Representing structural information without log files
seems to make the most sense.

33. Reports could state how many web services are published
universally across the enterprise. How many per machine. 

34. Reports could also give usage statistics. Which are the top
10 web services or web methods. I suspect the usage distribution
will satisfy some form of the 80-20 rule and will be of a Pareto
variety.

35. Once you aggregate the information and compare web services
with each other many possibilities open up. For example, how slow
is this web service compared to other web services.

36. Besides dependence we can also ask: How many times does this
web service call the other web service.

37. Based on the scenario in 36, it makes sense to log both
incoming as well as outgoing calls. If one of our web services is
calling out to another web service too frequently then that is
something to be investigated. For example if service A is called
10 times, but called service B 200 times, that suggests that
service A might be calling service B 20 times per call. This
suggests an easy opportunity for design improvement.

38. These 80-20 or Pareto reports could be extremely useful to an
IT manager in deciding which web services to speed up. They will
help the organization identify its bottlenecks. There is no point
optimizing the services that are already fast enough. There is
also no point in throwing hardware at such web services. However,
there is a lot of value in speeding up the slowest web services.

39. There is a tension here between speed and popularity -- or
how many times a web service is called. Clearly a web service
that is not called a lot can afford to be somewhat slow. But one
that is called a lot should be fast. 

40. A simple way to combine these measurements is to look at
total execution time per month (or per some period). This can be
computed by adding all the computation times for each one of the
individual calls together. A simpler way to do this (with some
tolerable loss of accuracy) could be to multiple the average
computation time of the web service with the number of times it
was called.

41. When I say computation time I mean response time or execution
time. I am not sure if I want to include the network latency in
this or not. It might be interesting to have both statistics --
both the total time with the network latency and without the
latency.

42. Note that we now have logging capabilities both with incoming
messages as well as outgoing messages. We log outgoing messages
with SOAP extensions. We could log total computation time with
network latency in the logging routine of the outgoing messages.

43. The outgoing logs will have to be different from the incoming
logs. When a message is incoming you don't know who it came from.
It's just a SOAP message over a wire. It has no information about
its source. However, in outgoing logs we know who the sender is
and who the target or receiver is. So we will need to log these
separately.

44. If we can make some assumptions about the synchronization of
the clocks then we could synthesize these logs from different
machines together and draw a dependency graph, which could show
the breakdown of the execution time across the different web
services.


Push-Based Notifications Considered Harmful

45. Here is the reason I am opposed to active (or push-based)
notification: it generates too much network traffic. The people
who have argued for it claim that it is necessary for issuing
urgent alerts in case an exceptional condition occurs.

46. I think there are two different value propositions colliding
here. Here they are: (a) Immediate term error handling and
fire-fighting, (b) Longer-term performance improvement. 

47. Push-based events make sense for fire-fighting. They are like
fire alarms in a building. You want them to blast as loudly as
possible and as quickly as possible. There is a fire. It has to
be put out immediately or all hell will break loose (if it hasn't
already).

48. On a real network this might correspond to a virus attack or
some other extremely unusual circumstance. If these kinds of
alarms go off on a regularly basis then the organization has
serious problems.

49. Most real and significant improvements to a network and to
distributed applications require a more circumspect approach. In
my mind this is the real value proposition that can be made to a
manager. We'll give you the reports that will allow you to
reflect on where the bottlenecks are in your system and how you
can solve them. We will show you how to do more with less. 

50. Most systemic improvements require reflection. You don't want
your fire alarms to go off just because the national crime rate
has hit a new high, or because your company's earnings were below
expectations. These are not unusual events. They identify
systemic problems that require subtle well thought-out solutions.
Calling in the fire engines will not solve the problem.

51. The claim: a business creates real value for itself (in terms
of cost savings, or better output with fewer resources) when it
makes systemic improvements, not when it puts out fires. If there
really is a network emergency -- for example your network has
been taken over by hackers -- in this case you want to shut it
down. Hopefully this is not something that occurs every day. If
it occurs every day, it might make more sense to take a few steps
back and figure out what it is about your system that causes it
to flare up in this way so regularly.

52. It follows from 51 that the real money is in systemic
analysis and reporting tools. Web services fire fighting tools
are useful to have but can only generate a fraction of the value
that systemic tools can. While the cost of each fire might be
quite high, the systemic tool will payoff very quickly because
its value will increase exponentially. Each systemic improvement
will make it easier to see the next one.

53. As the inventory clears up around the bottlenecks, new
bottlenecks become apparent (see Eli Goldratt's The
Goal).

54. The problem with push-based events is that they generate web
traffic at odd times. If the performance of an application has
sunk to a new low, it is probably because the network is
experiencing heavy traffic. The worst thing to do in this case is
to send out more messages to the network announcing how bad
things are. This is almost like screaming "fire"  in a crowded
theater.

55. Now push-based events might make sense for a fire-fighting
solution. Even though screaming fire in a crowded theater is
dangerous, this is precisely what fire alarms do. You hope they
warn you early enough that everyone has time to get out. (Now I
am curious: Do they have fire alarms in theaters?)

56. So in this exceptional circumstance they make sense. But they
should be used carefully. Special attention should be paid to
ensure that they don't exacerbate the problem they are trying to
warn everyone about.

57. Sending out as many push-based events as there are people
registered for them sounds like complete insanity. At most a
single event should be sent to the operator. At any one time only
one push event receiver should be identified. And this is all
assuming that we want to sell a fire-fighting solution instead of
a systemic problem-solver.


More Thoughts on Business Intelligence

58. In the absence of push events we are back to an RSS-style
feed. Pull events are very much like RSS-feeds.

59. What other kinds of information could we generate from the
logs that are created by web activity?

60. We have performance data, we have call graphs, and all of
these things are aggregated across the company.

61. Going back to the theme of systemic improvement, we can use
this to keep nibbling at the low-hanging fruit. Improve the web
service that is called the most and that takes the longest. 

62. When my hard-drive is full the way I deal with it is to
delete the biggest files. Deleting five of the biggest files will
have much more impact on the problem then deleting 100 tiny
files. The same idea can be applied to web services. Optimizing
the top 10 worst performers will have a much more remarkable
effect on the performance of the network than working on any
other web services. 

63. Another interesting piece of information the manager might be
interested in correlating with would be the hardware
specification of the machines he is using. For example, the
machines with the least amount of memory will probably have the
worst performance. Over time the manager might be interested in
migrating the least used services to the slowest machines (or the
machines with the feeblest resources) and migrating the most
highly sought after services to faster machines. This is the
entire premise of adaptive management. 

64. Presumably there are other tools out there which can
automatically move services around between machines, and can do
load balancing, all based on the statistics published by the
management framework.

65. But can we do this in real-time? Does avoiding push events
mean we are always going to be locked out of the real-time space?
This is not really true of course. The reason is that with pulled
events the manager decides when to pull the events. The manager
here might be a person or a really smart program -- it is the
entity that tries to optimize IT operations by allocating
resources. With pull events the manager, who understand the
global picture of the whole system pulls the events. He can still
do this in real time. He could poll services regularly every few
minutes to get a real-time experience of WS usage. However, he is
in a much better position to decide when the network can take the
extra traffic that the events will need, then the isolated node,
which has a local view of itself, but does not understand what is
going on around it.

66. The business intelligence server should also support
arbitrary queries against the data sets. It might make sense to
store the data sets as queryable database files. The queries
could be executed using web services. The table names could have
the form: host/table. This way it will be possible to refer to
the same table on different machines. For example, the events
table on machine A could be joined and queried with the event
table on machine B.

67. The BI server could also be integrated into other BI
solutions, such as the one from MicroStrategy.

68. Could we fit neural networks into this somehow? Is there an
application here for neural networks? 

69. Is there an application here for optimization algorithms?

70. The framework could create different scenarios for the
manager and help it to predict the effect of various allocation
decisions on performance. The framework for example could run
simulations.

71. The framework could also create a model of the network,
simulate and test out different organizations and recommend its
own optimal solutions.

72. These programs would be separate solutions, but they would
integrate with the data feeds being provided by the management
framework.

73. Sometimes intensive planning and forethought misses important
points about reality. So the framework could support an adaptive
approach. Instead of moving services to the best possible
machine, perturb the system slightly to see what effect that has.
The system would use gradual hill-climbing. It would
incrementally improve itself and try to get closer and closer to
optimal.

74. The system could include scheduled reports. A scheduler
creates reports at specific times. These can be considered
snapshots of the system. The system should have facilities to
take these kinds of snapshots.

75. The system could also support historical charts. I.e. reports
that give a historical perspective on the system. The managers
can use these to see how things have been getting better (or
worse) over times. Things might be getting worse because the
number of customers using the system keep increasing. Sales tries
to get as many customers in the door as possible. Meanwhile IS's
resources get crunched. These historical charts will help IS
managers ask for more money and receive it. The IS manager can
argue that in the last 3 months, the usage of the system
increased by 30%. This correlates with a 30% increase in users.
To support 30% more users the organization needs to buy 30% more
computing power. The business managers will see the immediate
value of such an investment when they see the performance
degradation and how it ties in with the growing customer base. 

76. The organization could compare itself to its own past and see
how it is doing.

77. It might also be interesting to plot equipment quality and
quantity against performance over time. This might be useful in
justifying future sales.

78. The management framework will put all the tools at the IS
manager's disposal that will allow him to get the organization
focused on the importance of IT. Instead of begging for IT
resources, he will be able to show business managers the intimate
connection between profits and IT resources, and will have
managers throwing money at him to bring the IT systems up to par
with the organization's needs.

79. Now tell me this: If you gave an IS manager a choice between
a tool which would help him fight fires, and a tool which could
help him double the size of his department which one would he
choose? The value proposition of the BI tool should be obvious.

80. The beauty of this model is that it can be integrated easily
with a web services management framework. All performance metrics
are stored as events which the aggregator can pull (or poll) when
he wants to.

81. However, this is somewhat different from the vision of web
services management that I have seen anywhere else.

82. All of this could be implemented through a soap extension. It
has access to both incoming and outgoing messages. The deployment
will be super simple. Just drop the dll on the machines and make
a tiny tweak to the web.config files of the services to be
monitored. 


Thoughts on Security

83. Later on we could restrict who sees what reports. It is
possible that the organization does not want everyone to know
which system is used the most. This might be a proprietary trade
secret. It might give competitors some idea where they make all
their money.

84. For this we will need a security architecture. There is an
initial admin. He can create other accounts and give them
privileges. They in turn can create more accounts. The admin can
delete accounts. Other accounts can delete their own descendants,
but not others. All accounts except the admin's can have a finite
lifetime. The admin is like a supreme deity: he creates but was
not created; he gives life and death.

85. To make this slightly more complex creating a new account
might require approval from N existing accounts. All of these
will be non-admin accounts, since the admin can singly create any
account he wants. 

86. In a pantheistic universe there might be M admins and at
least m of them have to approve for a new account to be created.
This way a single megalomaniacal admin does not get absolute
power.

87. When the admin is laid off a new admin must take over. So
admins can create other admins. Or a majority of admins can vote
another admin in. Requiring the majority will prevent factions
from developing, where each party tries to get its own friends
in.

88. People can be added as accounts based on their NT passwords.
This way they will not need to log in multiple times. The system
can use people's NT accounts as keys. 

89. Besides people programs will also need accounts. Programs can
be a separate account category. They are voted in by admins (by
some quorum of admins). Each program account is issued to a
single or to a group of programmers. The program's account is
associated with its programmer's account. However, when the
programmer is laid off or leaves, the account continues to live
on. New programmers can be attached to a program's account over
time as they take on the maintenance tasks of the program. For a
program to continue running it could require at least one current
employee to take responsibility for it. The idea is that programs
run as proxies for their programmers. Someone has to take
responsibility and ownership for each program.

90. I am not sure if we should have roles or not. Roles are
generally confusing. It's hard to remember which role creates
which privileges. I would instead go with a privilege based
system, using ACLs. Each program has a list of people who are
allowed to run it. Only those people are allowed to run the
program. This way different entry points into the system can be
created for different people. 

91. The same kind of idea could be extended to web methods. For
each web method only certain people are allowed to run it. Again,
identification is established through NT single sign-on.

92. The system should support self-created accounts. These are
user accounts that were created without admin approval. In
general arbitrary NT account holders should have some privileges
inside the system and should be able to see some reports. The
system could allow them to see some documents and reports and to
run some programs.  The facility should exist for non-admin
initiated users. This self-service model will help reduce the
workload of the admin.


Business Intelligence Toolkit Continued

93. An extremely important component of the toolkit will be the
API documentation. The services provided by the toolkit should be
easy to interoperate with and to call through dlls or through
.NET framework.

94. At each tier of the architecture the system is completely
open and accessible through standards based hooks such as web
services.

95. Similar management ideas could be used to "improve" any
traditional software. For example, bug tracking software should
report how many times the bug has been reported, by how many
people, when did the reports start coming in, what is the overall
efficiency of the organization over time, what are the historical
trends.

96. In general there are two kinds of information that is
accessible: snapshots and historical trends.

97. I just want to observe in passing that this process of
fleshing out thoughts in detail, exploring alternatives,
exploring value propositions, all of this is extremely enjoyable
(for me; and hopefully not too bad for you either). This seems
like something that can remain sustainable for a long time.

98. The perception of what is important to an IS manager is based
on Gartner's analysis
of MicroStrategy.

99. A question some readers will pose, and validly, is: But why
web services management? Surely the argument you have made
applies to managing all kinds of applications. Why would a
manager buy a web services management solution instead of buying
a solution that he can use to manage all his applications.

100. The answer: The reason web services management is different
is because web services gives you an easy to instrument control
point into the application -- namely, the web service interface
itself. In the past, with previous applications instrumenting
them for manageability was error-prone and require some
investment of development time. This kind of instrumentation is
particularly difficult after the fact with proprietary
closed-source applications. This is not to suggest that it is
easy with non-proprietary or open-source applications. In fact it
is difficult with both. However, it is nearly impossible with
closed applications.

101. The universal interface of web services allows all web
services applications to be managed and monitored. WS management
frameworks can thus shed much more light into the WS management
space than was possible with traditional applications. In fact
WSM could be a value proposition of web services. With web
services it becomes easy to measure your applications, which
allows you, the IS manager, to approach your business managers
with reams and reams of data describing the impact of business
decisions on IS throughput. As the business managers make the
connection they will throw money and resources at you as never
before.
Asim Jalis

Tuesday, October 14, 2003

Business Intelligence

Site Feeds

Main Site

Previous Articles