Data science

Top 10 Hot Data Science Technologies

In April, Gil Press posted a list of top 10 hot big data technologies in Forbes Magazine. The technologies being featured as hot were:

Predictive analytics

Predictive analytics is the use of statistical and machine learning methods on historical data to predict future outcomes. The goal is to use the knowledge of what has happened in the past to predict what might happen in future to seize opportunities and reduce risks.

How will you predict business and customer behavior and elevate analytics know-how throughout your company?
Spotfire is so much more than a fancy chart maker. It’s a user friendly analytics tool that makes advanced features easy and consumable for the masses.

NoSQL databases

What is NoSQL?

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases

NoSQL (sometimes also called ‘Not Only SQL’) is a growing sector of the datastore market based around non-relational databases.

These systems offer excellent scalability, performance and flexibility, as well as simpler maintenance and cheaper hardware requirements. Consequently, they’re popular options for enterprise-level ‘big data’ and therefore network visualization.

How Well Do You Know NoSQL?

A DBA’s life is full of surprises – some more pleasant than others. As a DBA you’re pretty much “on it” 24-7-365: putting out fires, checking alerts, training people, keeping test databases in sync with production, debugging, and praying your developers are writing halfway decent code. If you’re swinging over to the NoSQL side (the “dark” side?), or considering it, then you’ll want to read this white paper, The DBA’s Guide to NoSQL. It’s essentially a NoSQL bible, sans orders from God -- or Darth Vader.

Search and knowledge discovery

In our 23-criteria evaluation of cognitive search and knowledge discovery solution providers, we identified the nine most significant ones — Attivio, Coveo,, Hewlett Packard Enterprise (HPE), IBM, Lucidworks, Mindbreeze, Sinequa, and Squirro — and researched, analyzed, and scored them. This report shows how each provider measures up and helps application development and delivery (AD&D) professionals make the right choice. This is an update of a previously published report; Forrester reviews and updates it periodically for continued relevance and accuracy.

Stream analytics

Stream Analytics is a serverless scalable complex event processing engine by Microsoft that enables users to develop and run real-time analytics on multiple streams of data from sources such as devices, sensors, web sites, social media, and other applications.[1] Users can set up alerts to detect anomalies, predict trends, trigger necessary workflows when certain conditions are observed, and make data available to other downstream applications and services for presentation, archiving, or further analysis.

 In-memory data fabric


In Memory Data Fabrics represent the natural evolution of in-memory computing. Data Fabrics generally take a broader approach to in memory computing, grouping the whole set of in memory computing use cases into a collection of well-defined independent components. Usually a Data Grid is just one of the components provided by a Data Fabric. Additionally to the data grid functionality, an In-Memory Data Fabric typically also includes a Compute Grid, CEP Streaming, an In-Memory File System, and more.

Distributed file stores


Distributed File Systems (DFS) provide the familiar directories-and-files hierarchical organization we find in our local workstation file systems. Each file or directory is identified by a path that includes all the other components in the hierarchy above it. What is unique about DFSs compared to local filesystems is that files or file contents may be stored across disks of multiple servers instead of a single disk.

Data virtualization


Data virtualization is any approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located, and can provide a single customer view (or single view of any other entity) of the overall data.

Data virtualization can be deemed to be an alternative to data warehousing and extract, transform, load (ETL). Unlike the traditional extract, transform, load process, the data remains in place, and real-time access is given to the source system for the data. This reduces the risk of data errors, of the workload moving data around that may never be used, and it does not attempt to impose a single data model on the data (an example of heterogeneous data is a federated database system). The technology also supports the writing of transaction data updates back to the source systems. To resolve differences in source and consumer formats and semantics, various abstraction and transformation techniques are used. This concept and software is a subset of data integration and is commonly used within business intelligence, service-oriented architecture data services, cloud computing, enterprise search, and master data management.

Data integration


Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories, for example) domains. Data integration appears with increasing frequency as the volume (that is, big data) and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved.

Data preparation (automation)

Data mining tasks typically require significant effort in data preparation to find, transform, integrate and prepare the data for the relevant data mining tools. In addition, the work performed in data preparation is often not recorded and is difficult to reproduce from the raw data. In this paper we present an integrated approach to data preparation and data mining that combines the two steps into a single integrated process and maintains detailed metadata about the data sources, the steps in the process, and the resulting learned classifier produced from data mining algorithms. We present results on an example scenario, which shows that our approach provides significant reduction in the time in takes to perform a data mining task.

Data quality

Data quality is a perception or an assessment of data's fitness to serve its purpose in a given context. The quality of data is determined by factors such as accuracy, completeness, reliability, relevance and how up to date it is. As data has become more intricately linked with the operations of organizations, the emphasis on data quality has gained greater attention.