Here’s our weekly roundup of updates on Big Data Analytics. Hope you find it useful – please don’t forget to subscribe!
- 2016 Hadoop Summit @ San Jose, CA (Jun 28 – 30)
- Spring for Apache Hadoop 2.4.0 GA released
Supports Apache Hadoop stable 2.7.1, Pivotal HD 3.0, Cloudera CDH 5.7, Hortonworks HDP 2.3 and 2.4
Spring for Apache Hadoop simplifies developing Apache Hadoop by providing a unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive. It also provides integration with other Spring ecosystem project such as Spring Integration and Spring Batch enabling you to develop solutions for big data ingest/export and Hadoop workflow orchestration
- Using MapR, Mesos, Marathon, Docker, and Apache Spark to Deploy and Run Your First Jobs and Containers (This blog post describes steps for deploying Mesos, Marathon, Docker, and Spark on a MapR cluster, and running various jobs as well as Docker containers using this deployment).
Components used for this example:
- Mesos: an open-source cluster manager.
- Marathon: a cluster-wide init and control system.
- Spark: an open source cluster computing framework.
- Docker: automates the deployment of applications inside software containers.
- MapR Converged Data Platform: integrates Hadoop and Spark with real-time database capabilities, global event streaming, and scalable enterprise storage to power a new generation of big data applications.
- Next-Generation Genomics Analysis with Apache Spark
Spark is an ideal platform for organizing large genomics analysis pipelines and workflows. Its compatibility with the Hadoop platform makes it easy to deploy and support within existing bioinformatics IT infrastructures, and its support for languages such as R, Python, and SQL ease the learning curve for practicing bioinformaticians. Widespread use of Spark for genomics, however, will require adapting and rewriting many of the common methods, tools, and algorithms that are in regular use today.
- Real Time Marketing with Kafka, Storm, Cassandra and a pinch of Spark (Slides)
The combination of Apache Kafka as a event bus, Apache Storm for real- or neartime processing, Apache Cassandra as an operational storage layer as well as Apache Spark to perform analytical queries against this storage turned out to be a extremely well performing system.The system consists of an Apache Kafka component used as an event bus, an Apache Storm topology to update the profile and trigger marketing actions based on events as well as an Apache Cassandra cluster which serves as a storage layer for the profile…
- Case Study: Pivotal HDB Shows Simplicity of ODPi Interoperability
Pivotal HDB, the Apache HadoopⓇ native SQL database powered by Apache HAWQ (incubating), has successfully passed internal testing on both the ODPi* Reference Implementation of Hadoop, as well as one of the first ODPi Runtime Compliant distributions, Hortonworks HDP v. 2.4. Pivotal HDB testing was done with no modifications to standard installation steps, making it the first big data application successfully tested to be interoperable with multiple ODPi Runtime Compliant distributions.
- Building Applications with Apache Flink (Part 1): Dataset, Data Preparation and Building a Model
We are going to build an application, that processes the hourly weather measurements of more than 1,600 weather stations with Apache Flink. The articles will show how to write custom Source functions for generating data and how to implement custom Sink functions for writing to PostgreSQL and Elasticsearch.
Source code is available @ https://github.com/bytefish/FlinkExperiments
- TensorFlow Wide & Deep Learning Tutorial
In this tutorial, we’ll introduce how to use the TF.Learn API to jointly train a wide linear model and a deep feed-forward neural network. This approach combines the strengths of memorization and generalization. It’s useful for generic large-scale regression and classification problems with sparse input features (e.g., categorical features with a large number of possible feature values).