Big Data and NoSQL
Going with the times, Big Data and NoSQL appeared on my radar for a while now. Aside from performance and size aspects, what is fascinating to me is the proliferation, rather than convergence, of data models that underlies the new breed of databases. It appears that databases and data models are going to follow the programming language route: constantly renewing by creating new languages with the most significant languages staying in the game while not so significant ones are disappearing.
What are the core research aspects? There are plentiful. An interesting initial one would be to categorize the field of Big Data and NoSQL databases into various concurrent categories (e.g., system performance or model/language expressiveness). Another interesting work would be to compare existing systems for real, including system performance and language expressiveness. How do the new systems compare to established ones? More core research topics will appear over time for sure.
What are eco-system research aspects? Aside from system and models, there will be probably the "usual" set of topics appearing on the horizon: backup and recovery, transactions, triggers, stored procedures, streaming queries, queues, rules, and so on - all in context of the new systems.
And, finally, I predict that research areas using databases as support technology will take notice: event processing systems on NoSQL databases, high-performance transaction systems on NoSQL databases, workflow management systems on NoSQL databases, just to name a few.
One thing is clear: Big Data and NoSQL represent a major shift in the area of databases and systems that depend on databases. This shift is as significant as the appearance of SQL and relational systems because it fundamentally changes how data are modeled, managed and used by application systems.
Business Rule Computing
Business rules are an area that is gaining more and more visibility in industry. The main idea is that business rules like the rebate amount or discount percentage can be changed dynamically at runtime without changing the software code and without recompiling and redeploying the code. Instead, a model approach is followed where the underlying data and rules can change declaratively. This increases the flexibility from a business perspective while reducing the load on the IT department.
Changing data independent of the software code is a known methodology using, for example, database systems. In such a case the rebate amount is stored inside a database. Whenever the rebate amount in the above example needs to be changed, the rebate amount or percentage in the database is changed and this change will be picked up by the software code at runtime. Changing rules dynamically is possible, too, today without specialized business rules languages. For example, simple rules can be encoded as and/or logic inside databases. More complex rules can be implemented using stored procedures inside databases that can be updated dynamically independent of the software code of the business application.
Rules languages are declarative in nature. They can be used to change data, too. In addition, rules languages are declarative and allow inferencing, providing more expressiveness through this property. However, they are not necessarily superior to the aforementioned approaches. If inferencing is not needed and a declarative language not seen to be beneficial, then other languages are appropriate, too.
While in principle it is a good idea to use a business rule language (or another language for the same purpose), from an implementation viewpoint the main error made is that a rules language is used like a programming language. Very often a rule or several rules are put in place where without a rules language a procedure or method would be implemented with a procedural programming language. The downside of this approach is that while the rules can be changed as well as their underlying data, the invocation location of the rules is fixed as the rules are executed like a procedure or method. Adding a rule at a different place in the code in order to implement appropriate business semantics requires a full code implementation cycle in order to make that change.
A by far better approach is that the time the rules are invoked is not hardcoded, but declaratively determined so that the programmer or software engineer does not have to implement the rule invocation itself into the regular business logic code. This allows to model rules in such a way that their invocation location can be dynamically changed, too, not just their underlying data. This then allows to add or remove rules throughout the business logic, not just at the locations that are pre-planned and hard-coded into the software.
Multi-Tenant Computing and SaaS
Multi-tenant aware software systems recognize that the data of different legal entities (tenants) is contained within its persistent store as well as its business logic implementation; in principle, any number of tenants can be supported (hosted) in the same instance of a software system. Any existing limit is only non-functional. Multi-tenant aware software systems have to ensure that the data of its tenants are kept strictly separated and do not mix at all as if each tenant would be hosted in a separate instance of the software system running on a separate physical or virtual machine. This very much follows the principle of operating systems in terms of a strictly disjoint address space. In case of a separate software system instance for each tenant the strict separation would be achieved by having a persistent store for each of the tenants and no danger of mixing data would exist ensured by the separate software system installation.
SaaS is the acronym for Software-as-a-Service. Together with multi-tenant computing this represents a significant evolution of the application service provider (ASP) model. The ASP model originally focused on making software systems available remotely by a service provider for its clients based on various fee models. The focus was not really in ensuring that the same instance of a software system can support several tenants.
The Wikipedia entry for SaaS is here: http://en.wikipedia.org/wiki/Software_as_a_Service. It provides an overview as well as some references into related work.
Application service providers (ASPs) are companies that run software systems on behalf of their customers. The benefit from a customer's perspective is that it does not have to install and run the software system itself in its own data center. Instead, the ASP does the installation, maintenance, upgrading, running and backing up. The customer merely accesses the application system user interfaces remotely over networks, preferably with browser technology over the Internet. Additional benefits are that in general the customer pays the ASP on a monthly basis and on a per user account basis. From a customer viewpoint this means that the cost are minimal in the sense that it pays only for what it really needs. However, from an ASP viewpoint the world looks a bit different as the ASP has to put up as many machines as it has clients, often more when the number of accounts exceeds the load a single machine can handle. While the customer can fine tune the number of accounts, the ASP has at least one machine and one software system instance as the minimal entry cost per client. This causes problems, especially for ASPs that are entering the market or even for startup companies that work on innovative business functionality that initially only draws a few user accounts. It would be definitely preferable to achieve an independence between number of clients, client accounts and machines in form of tenant awareness.
The key difference is that ASPs have in general one software installation per client as the software does not allow for multi-tenancy as compared to SaaS where the software can host several tenants concurrently.
In an ASP or in a SaaS environment tenants are not directly aware of each other. In both environments a strict separation is achieved in different ways. In an ASP setting the separating is achieved by physical separation, one hardware and software system per computer hardware system. In SaaS the separation is achieved by a separate addressing system that enforces a strict separation of data and business logic between the tenants.
However, in an interconnected business world clients are not operating in isolation. Business-to-business (B2B) relationships exists and a single client might interact with many other clients, sometimes going into the thousands or ten thousands. In a SaaS world it is likely that two interacting clients are hosted by the same SaaS provider. In this situation the clients are aware of each other due to their business relationship and consequently they need to be able to connect with each other with B2B integration technology.
Traditionally B2B integration technology assumes that the communicating partners are separated by a wide-area network or the Internet. However, in a SaaS setting the two clients are actually hosted in the same application instance. Therefore, B2B interactions become a lot easier as the communication is basically reduced to a database update. However, in case the interacting clients are hosted by different SaaS providers the "traditional" B2B integration technology applies.
There are many examples in different domains that are SaaS systems. Simpler ones are financial institutions like etrade http://www.etrade.com or Ameriprise http://www.ameriprise.com where the same software system caters to different customers at the same time. Social networks like linkedIn http://www.linkedin.com are hosted environments where the clients are actually aware of each other by intention. Several SaaS vendors provide business functionality, with salesforce.com http://www.salesforce.com being the most prominent one. A recent addition is NetSuite http://www.netsuite.com.
Standardization is not yet an important area in the SaaS world. One interesting standard is http://www.sas70.com/. However, as SaaS applications have well-defined boundaries in terms of the organizational and technical boundaries the question remains at this point if standards are going to be important of if current standardization efforts in e.g. Web Services are sufficient.
Principles of Computing
Peter Denning and Craig Martell started working on defining the Great Principles of Computing (PoC). They put together a web site (see here http://cs.gmu.edu/cne/pjd/GP) that contains the current status of their ongoing work. They identified seven categories of principles of computing; these are
In the following I will discuss the category of coordination in more detail as much of my work is in this area and reading the definition and details of this category triggered some thoughts that I put down here.
Coordination is an important set of principle as much time is spent every day on coordination, at work, at home, during the commute, while being on business or vacation trips, with more and more support through software systems. Some examples of those supporting systems are email, calendar or instant messengers. But also cell phones, iPhones, PDAs and Blackberries carry software that supports human and system coordination like SMS, GPS, walky-talky and other functionality.
Coordination is communication with a specific purpose: communicating entities (agents) synchronize their activities in order to achieve a common goal. An example is the set of employees involved in a travel approval and expense reporting process where each agent acts in a specific way like a traveller, an approver, etc., in order to make a business trip successfully happen. One can argue that in the absense of a purpose coordination does not take place.
The coordination category by Denning and Martell outlines the basic elements of coordination. It contains 2 types of coordination, direct coordination by speech acts between agents and indirect coordination of agents to synchronize access to a shared resource. The first case supports the direct communication between the agents in such a way that each agent knows what is expected of him during the communication in order to achieve the goal. The second case, however, does not support the direct communication between the agents and they do not even have to be aware of each other in this type of coordination.
One observation of this categorization is that the involved computer systems do not have a representation of the coordination itself. For example, in a direct coordination two employees can coordinate their actions over the phone or with an instant messenger. The coordination is not formally defined or executed in form of ongoing instances. Instead, the medium for coordination is unaware of the fact that coordination is ongoing and might even be stateless, i.e., the communication is not recorded. For example, in the case of direct coordination an instant messenger could be used in a mode that does not allow to recall the conversation. In this case the speech acts are not defined in the instant messenger at all and at runtime, the individual messages sent are not related to each other in such a way that the instant messenger system could retrieve a sequence that relates the messages.
The same applies for the synchronization of a shared resource. The coordination in this case is indirect as the agents do not communicate directly with each other, but through a transaction that is not aware of the agents at all. The coordination is not formally defined, nor executed as such, and the agents are unaware of each other. The example used by Denning and Martell is a checking account access through database transactions. In this case the involved agents to not have a common goal. Instead, the coordination is enforced due to the data integrity constraints placed on a checking account.
In addition to the above mentioned types of coordination there are additional ones, however. The following matrix shows additional types of coordination. The matrix has two dimensions, one is the 'awareness' dimension. The point in the dimension 'Software Aware' means that the software system has a formal representation of the coordination itself and can recall it. 'Software Not Aware' means that the software system does not know about the fact that it is used for coordination. It does not have a formal definition and therefore it cannot recall it. The other dimension is the 'control' dimension. 'Agent Controlled' means that the involved agents control the steps and the progress of the coordination. 'Software Controlled' means that the software system controls the progress and the steps of the coordination (the numbering is used to refer to the fields later on.
|| Software Not Aware of Coordination||| Software Aware of Coordination|
|Agent Controlled Coordination||| (1) Speech Acts||| (2) Constraint Management|
|Software Controlled Coordination||| (3) Resource Synchronization||| (4) Workflow Management|
An example for constraint management (field (2)) is a software code management system like Perforce (http://www.perforce.com). It is aware of agents (software engineers) as it carefully keeps track of which engineer checks in or checks out software. It also keeps carefully track of multiple checkouts of the same code and conflicts upon checkin. If a conflict happens, it states so, however, without telling the software engineers how to resolve the conflict. It only states that a constraint was violated. A software management system in this sense has rules and constraints that define a consistent state or an inconsistent state of the whole software system and its parts. As engineers resolve conflicts, they might actually be resolved or violation of constraints appear elsewhere in the code. In this approach the software management system is aware of the ongoing coordination, it can recall the history of what the software engineers did over time, but it does not control the coordination of the software engineers themselves.
In contrast, and example for (4) is a workflow management system where the business process is formally encoded and the software system actually drives the coordination by indicating to agents what they have to do and when. It can report on the status of different ongoing workflows and keeps in general a full history of all coordination that took place.
An example for (3) is a relational database system that coordinates transactions such that write access to a data item is done in such a way that agents do not interfere with each other. In this case the agents of agents are coordinated by the software, however, the software is not aware of the coordination. No history is recorded, that agents are unknown to the software. It can even be the case that two actions of the same agents are synchronized.
An example for (1) is the instant messenger where the instant messenger is not aware that the exchanged messages are actually coordinating the typing agents. As they are in control they drive the coordination. However, there was a system implemented (ActionTechnologies) that implemented speech acts literally and coordinated agents. However, this system really belongs into (4) as the speech acts was 'just' a specific process representation and the system exhibited all characteristics of a workflow management system.
Process computing comprises a series of currently separate areas that in the future will come together and will become one technical area. These are
The reason for my prediction is that the underlying principles and concepts of these different areas are precisely the same as well as their specific requirements. For historical reasons these areas were developed separately, however, from a technology and underlying concept perspective they are the same and one technology and one conceptual model is sufficient to provide a complete solution for these areas.
For me Semantic Computing is a major shift in computer science where all aspects, from language theory to operating systems make use of semantic technology in such a way that semantic understanding, interpretation and interoperability are easily (!) achieved. Ideally, Semantic Computing starts from the microprocessor level with introducing more adequate data types and ends at an user interface level that can intelligently interact with users to obtain semantically correct data.
Semantic Computing, in my mind, requires the examination of all areas of computer science as a whole. This is in contrast to ongoing research efforts that try to apply semantic technology to a single domain or technology in isolation. Just as one discussion point, it is not possible today to retrieve an RDF triple and pass it on through all software layers to the user interface without it being re-represented in various languages and type systems throughout the software and technology component stack (see the discussion in this column: Is Semantic Web Technology Taking the Wrong Turn? [Cached].
Major conferences in the space of semantics are the following:
Thoughts On ...
The 2007 panel at the ICEIS conference (International Conference on Enterprise Information Systems) asked the question "Are you still working on Inter-Enterprise System and Application Integration?" and requested the panellists to provide input in form of a few pages upfront to the discussion. With SaaS and Web 2.0 as some of the current industrial developments I put together some thoughts on how these new developments challenge and eventually change the area of integration (inter-enterprise integration and application integration): The World Moved On in 2007_06_12_iceis_panel.html.
As of today, 7/10/2007, it feels like the good old search engine days, but this time on the topic of social networks. After the initial phase of emergence, everybody now seems to put out a social network and works hard to get people signed up in it. The more known social networks are http://www.academia.edu/LinkedIn, http://www.linkedin.com, http://www.friendster.com and http://www.foaf-project.org, but there is also http://www.zoominfo.com, http://www.xeequa.com, http://www.ecademy.com, or http://www.spock.com. Aside from the basic functionality of linking people represented in these networks, they seem to try to attract different audiences in terms of professional domains. In LinkedIn a lot of business people as well as academic folks are represented. Zoominfo concentrates mainly on business people and Xeequa is attracting sales channel professionals. The latest network addressing academia specifically is Academia.edu. That's all well and good, however, if you as a single person are part of various professional communities, then you have to keep profiles in many social networks (if you want to be represented in those at all). And that takes a huge effort time-wise as well as understanding the detailed functionality within a given social network.
This makes you hope that somebody soon will come up with Social Meta Networks so you only have to maintain one single profile that is replicated in all the social networks you choose to participate in.
It gets worse with those social networks that automatically harvest your public information from the Web and try to automatically put together a profile for you. In many cases this automatically collected information is accurate, many times, however, it is not, misrepresenting you completely. That would be all fine, except these networks put our your profile on the Web and search engines pick it up, and they also pick up the inaccurate information about you (as they are as non-semantic as the social networks). So, if you are at all interested in having data and information about you being accurate, you are forced to join those networks to at least fix the errors and inaccuracies.
Of course, to make things manageable, you could maintain one complete profile in one social network and reduce the profile in all other social networks to a minimum by simply pointing to the complete one. That's a private Social Meta Network strategy and makes maintaining your information less cumbersome. However, the actual links between people cannot be replicated this way. So, one hopes that either the space reduces significantly, or the social networks start cooperating, or competing more heftily so that they sort out.
In academic computer science publications, papers or articles alike, claims are made that cannot be independently verified by reviewers, referees or readers. The reason being is that the software code would have to be made available, and also would have to be installable without problems. I was very happy to see that one of the major database conferences, ACM SIGMOD 2008, established the verification of the claims, see here: http://www.sigmod08.org/sigmod_research.shtml.
While the successful verification is not yet part of the criteria to accept or reject a submitted paper, I really hope that it will be and that other areas of computer science will at large move into this direction.
See my input here on the quality and my review approach: Philosophy.
At least until today, Social Networks do not forget anything. Is this an problem? An immediate reaction could be "yes" or "no", however, after reading the following article, forgetting is a lot bigger an issue then I ever thought: The Web Means the End of Forgetting [Cached] [Cached]. This article is a must read for everybody who uses Social Networking systems.
While transactions are used widely in context of database management systems, transactions in context of programming languages are not yet used in mainstream programming today. An abort or a rollback on a database transaction does not have an effect in main memory or even on the user interface as it has within a transactional database management system.
However, from a programming perspective it would be desirable if the whole implementation stack, i.e., all layers (user interface - business logic - database) and their technologies can participate in transactions with the same effect as in database transactions. A programmer, on abort or rollback, would expect that with the data also the user interface state as well as the main memory state (business logic) is rolled back to a consistent state as it was before the initiation of the transaction. So instead of only having transactions in a database, transaction boundaries bracket all computation.
Such a holistic transactional behaviour, i. e. one that encompasses all parts of the computation, from the user interface to the database system would truly make a difference in software engineering for dependable systems.
Sun's research work on transactional memory can be found here: http://research.sun.com/scalable/. In Wikipedia is an overview of various approaches here: http://en.wikipedia.org/wiki/Software_transactional_memory.
It is interesting to observe that several works in this space emphasize the concurrency control problem, not the transactional problem. In frameworks like J2EE in conjunction with application servers concurrency is not really an issue programmers have to deal with as this is taken care of by the application server. However, transactional behaviour is very important for programmers to deal with as this establishes the correctness of the business logic's results. Based on this train of thought it would be very undesirable to have transactional memory that is independent from the database transactions in programs that use both.
© Christoph Bussler, 1991 -