Christoph Bussler

NoSQL Analytics

What Is NoSQL?

NoSQL refers to a new breed of database management systems. There is a huge variety of approaches currently being developed. An extensive list can be found here: https://en.wikipedia.org/wiki/NoSQL.

There are many differences compared to relational database management systems. One of the key differences relevant for analytics is that NoSQL databases in general do not follow the relational model and algebra. Instead, they implement a variety of data models like key-value or documents.

Due to the difference in the implemented data models, the NoSQL movement does not agree or implement a particular query language, but each system has its own approach for accessing the database for creation, reading, updating and deletion of data.

Top

NoSQL Analytics

NoSQL analytics implements analytics functionality on NoSQL databases. This includes computing aggregations, projection, selection and many other manipulations in order to retrieve relevant values for user-defined metrics.

In context of NoSQL databases the traditional analytics (or business intelligence) tools do not work any more since NoSQL databases implement data models that are incompatible to the relational data model.

The term "tools" refers to how a database is queried, how data is retrieved, interpreted and displayed (in text or graphical form). The base assumption of traditional analytics tools that data is uniformly structured in table form does not hold at all anymore. A new approach is needed for analytics in context of NoSQL databases that operates properly on the new data models the NoSQL databases implement.

Top

Approaches to NoSQL Analytics

There are several different approaches to NoSQL Analytics.

One approach that is currently emphasized is the use of a map/reduce framework in context of implementations like Hadoop (https://en.wikipedia.org/wiki/Apache_Hadoop). The reason for this approach is two-fold: Hadoop with map/reduce supports the custom implementation of functions that can deal with any type of data model. Secondly, Hadoop promises to be scalable to whatever size necessary to finish the particular computation at hand. There are many variations of this combination in the market place with various degrees of support for particular data models.

An alternative approach is that of "query - response". This particular way of implementing analytics means in general to write (several) queries, and to process their results so that in the end the desired metric values are available. This approach uses the particular query interface of NoSQL databases in order to obtain the query results. However, it also requires a programming style approach to analytics as the queries have to be issued, their results collected and further processed. Changes in requirements usually means to change code: the queries and/or their result processing.

The approach taken in this project here is distinctly different in many aspects:

Main memory analytics engine
Declarative analytics language
Document-oriented data model support (JSON/BSON)
Schema-based analytics, not query or map/reduce-based analytics

This approach separates the computation of metrics from the retrieval of metric values. Clients do not have to be concerned about the creation and execution of queries anymore, instead, they only have to query properties of documents.

The following section outlines some of the details.

Top

Schema-Based NoSQL Analytics

The key notion of NoSQL analytics in this project is that of schema-based analytics. This means that in order to retrieve key metric, the schema of the underlying documents is enhanced by adding properties to documents. The value of the documents are the metrics. A client queries document properties in order to obtain the metric values.

An example highlights the approach. A collection called "users" in a database called "userdb" contains documents of the form


{
    "user_id": 1234,
    "logins": [
                  {"date":"1/1/2013"}, 
                  {"date":"1/17/2013"}, 
                  {"date":"1/25/2013"}
              ]
}

For each user, the identifier "user_id" is stored, as well as an array of dates that indicate the day when the user has logged in at least once.

An interesting metric might be to compute for every user, how often the user logged in. Using the declarative language, this is formulated like this


set(userdb.users.no_logins, count(logins))

This statement adds a property "no_logins" to every document in the collection "users" in the database "userdb". The value of the new property is the number of elements in the property "logins". After the computation the above example document looks like


{
    "user_id": 1234,
    "logins": [
                  {"date":"1/1/2013"}, 
                  {"date":"1/17/2013"}, 
                  {"date":"1/25/2013"}
              ],
    "no_logins":3
}

The declarative language is generic in the sense that it allows to add properties on all levels of the data hierarchy formed by the various elements: database - collection - document - properties, embedded documents and sub-collections. This supports the formulation of complex computations and therefore the addition of properties that represent complex metrics.

Newly introduced properties can be used like any other property coming directly from the base database. And, newly introduced properties can be used inside the computation for other properties. This creates dependencies between the definitions of new properties and at runtime a dependency analysis takes place in order to ensure that required properties are present before being used in computations.

A client that needs to retrieve metric now only has to retrieve property values since the computation of those was done by the analytics engine independent of the client access.

Top

Architecture

Architecturally, the analytics engine is a main memory engine. All data resides in main memory and all computation takes place in main memory. The declarative language statements are registered with the analytics engine and are run upon request after a dependency analysis established their execution order. Once run, the additional properties are available for clients to retrieve.

The main architectural components are:

Database Connector. This component connects to the MongoDB database and is responsible for the retrieval of documents.
Main-Memory Database.

Storage Manager. This component manages documents in main memory.
Query Engine. This component implements the declarative query language (parser, semantic checker, execution)
Dependency Analyzer. This component analyzes the dependency between the property definitions.

Integration Layer. This component provides a REST-API implementation in order to make the analytics engine accessible over the network.