Analytics as a process
Energiewerks Ensights is a toolkit of loosely coupled and highly cohesive tools. The design principle dictates that the analytics process is broken down into elemental atomic tasks. Each task translates to a standalone script that accomplishes one aspect of the process only. Each task has an input and an output.
The tasks are orchestrated using a scheduler or an event as a trigger or a sequential group of logically related tasks.
Every task in an analytical solution can be broken down into one or more categories of tasks as detailed below
- Acquire : Acquire data from public, proprietary or internal system.
- Prepare : Convert the data into a format that can be standardized and analyzed.
- Store : Store the data with as much structural and semantic relationships intact
- Analyze : Statistical, mathematical and numerical analysis of the underlying data using models
- Visualize : Visualize the raw data or outputs using visual tools in a simple and concise manner, to convey as much information about large datasets quickly to the observer
- Disseminate : Disseminate the data via the internet, blogs, email, web services, content servers, social tools, mobile phones depending on the use case
- Operationalize : Run each of the categories of tasks in an automated manner to generate the analytics artifacts with as little human intervention as possible
Energiewerks Ensights toolkit is designed with several utility classes, patterns and idioms that focus on exactly one category of tasks.The tools allow the analyst to break down the use case into individual smaller tasks and deliver software artifacts in a highly responsive and agile manner.
The artifacts provide most of the flexibility afforded by a spreadsheet, yet provide the robustness, scalability and resiliency afforded by the enterprise system. It provides a pathway for models to be developed on a desktop, yet run on a server farm within a data center or on the cloud without significant changes and effort to deploy.
This is accomplished by a combination of technology stack choices, design principles, singular aspect focused tools and a focus on building simple cohesive software artifacts.
The Energiewerks Ensights Analytical Process In Detail
Almost all analytical use cases can be broken down into tasks that are categorized under Acquire, Prepare, Store, Analyze, Visualize, Disseminate or Operationalize tasks.
As a design principle if the atomic task seems to span two or more of the categories it would seem prudent to explore further granular sub-tasks before proceeding with the development of the software artifacts.
A brief description of each category of tasks follows
Data used in reporting, predictive analytics and machine learning can be obtained from public and government organizations, proprietary vendor databases and internal systems.
This data may be available in various formats. Data may be in the form of a Comma Separated Value (csv) file. It may be obtained by scraping a web page for pieces of information. Crawlers will extract a hierarchy of related web pages that contain relevant information. Some proprietary data vendors have anonymous or secure ftp websites from where data can be extracted. Other proprietary data vendors may offer programmatic access to a REST application programming interface to retrieve data.
Data obtained from such services may be of different types – plain text, xml , html or json. Each of these files could be free form or conforming to a schema. Some other sources of data could be real time streaming sources of information.
For analytics purposes, all this data has to captured in real time or periodically. The data retrieved has to be stored in the raw form.
Ensights provides a set of tools, patterns and idioms to retrieve data from all of these data sources. Scripts to retrieve data from often used sources are available in the toolkit.
These scripts are incorporated into analytic solutions to accomplish data acquisition tasks.
Data in raw form obtained from a source is not readily usable.
Data munging is a process where raw data is converted into a form that can be analyzed. The output of the data munging process is prepared datasets which are cleansed, normalized, annotated and structured data that conforms to an information model specific to the business domain.
The Ensights toolkit parses different formats and extracts relevant information into an efficient analytics focused data structure. The prepare process typically fills gaps in data with relevant values that may be default values or values conforming to some mathematical, statistical or numerical model.
Reference data could be added to enhance the source data. Two or more sets of data can be combined either using merge. Blended values can be calculated from the underlying raw data that may be only be obtained by combining two disparate datasets from completely different sources. This data may conform to some matching criterion with one-to-many relationships or may use an analytical model to derive values.
The software artifacts will prepare this data and store it in a memory efficient structure geared to easy analysis.
The prepared data is further converted into a format that stores data efficiently in relational and NoSQL databases.
The flavors of databases currently supported are Relational ( MySQL ), Document (MongoDB), Columnar (Cassandra) and Graph ( neo4j).
Data stored in these data stores are accessed by analysts either directly from the databases if desired. Alternatively, tools exists that merge data from and to these databases in user friendly interfaces via SQL like interfaces, rest apis or micro applications developed using the dissemination toolkit.
The analyze toolkit comprises models that are use case specific. These models could be simple scripts that blend datasets. Enhancement of datasets with reference information is an additional task.
More complex tasks associated with data mining, machine learning, deep learning, statistical analysis, technical analysis and fundamental analysis scripts are built using the analyze tools.
Scripts can be written in various numerical programming packages such as python , R , Matlab and C++.
These scripts are integrated in to the analytic process via a scheduler driven , event driven or workflow driven stimulus from the Ensights toolkit.
Outputs of these models are stored in one or more databases or output as a Comma Separated Value (CSV), Excel (XLS) or Text (txt) format for further visualization or dissemination.
Visual representation of source data or model outputs is generated by these the Visualize tools.
Innovative techniques can be quickly incorporated in to the visualization components as they become available due to the cohesive and uncoupled nature of the scripts involved in the process.
Reports, Predictive Analytics and Research is disseminated using Email, Portable Document format (pdf) files, Excel (xls), images, tables, narratives , blogs, Slack integration and micro apps.
Additional facilities are available to syndicate content in to dashboards created within QlikSense . Pentaho and Tableau.
Other tools include online OLAP front ends, GRAPHML representations of data or geospatial data formats such as GeoJson, KML or shapefiles.
Self serve microapps for specific functional users can be developed using two rapid application development web based frameworks included in the Ensights Toolkit.
The operationalize toolkit consists of tools implementing three stimuli methods to trigger autonomous execution of the tools detailed above.
The first is scheduler based. Schedulers like Quartz, Tidal and Windows Task Manager are supported.
The second is event driven using AMQP compliant message brokers – Active MQ, Rabbit MQ and Kafka are supported. Any AMQP compliant proprietary messaging broker can be seamlessly integrated with appropriate configuration parameters with minimal effort.
The third paradigm uses a rest api / web service based trigger which can be used to trigger particular aspects of the analytics program either through remote applications or via an api so that externally developed applications or user facing applications can trigger the analytics processes and workflows.
Override of the autonomous process is one use case where the third paradigm can be deployed allowing the end user finer grained control without the need for IT intervention.