This Week in Big Data Analytics (July 3, 2016)

Here’s our weekly roundup of updates on Big Data Analytics. Hope you find it useful – please don’t forget to subscribe!

  • 2016 Hadoop Summit @ San Jose, CA (Jun 28 – 30)
  • Spring for Apache Hadoop 2.4.0 GA released
    Supports Apache Hadoop stable 2.7.1, Pivotal HD 3.0, Cloudera CDH 5.7, Hortonworks HDP 2.3 and 2.4

    Spring for Apache Hadoop simplifies developing Apache Hadoop by providing a unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive. It also provides integration with other Spring ecosystem project such as Spring Integration and Spring Batch enabling you to develop solutions for big data ingest/export and Hadoop workflow orchestration

  • Using MapR, Mesos, Marathon, Docker, and Apache Spark to Deploy and Run Your First Jobs and Containers (This blog post describes steps for deploying Mesos, Marathon, Docker, and Spark on a MapR cluster, and running various jobs as well as Docker containers using this deployment).
    Components used for this example:

    • Mesos: an open-source cluster manager.
    • Marathon: a cluster-wide init and control system.
    • Spark: an open source cluster computing framework.
    • Docker: automates the deployment of applications inside software containers.
    • MapR Converged Data Platform: integrates Hadoop and Spark with real-time database capabilities, global event streaming, and scalable enterprise storage to power a new generation of big data applications.
  • Next-Generation Genomics Analysis with Apache Spark
    Spark is an ideal platform for organizing large genomics analysis pipelines and workflows. Its compatibility with the Hadoop platform makes it easy to deploy and support within existing bioinformatics IT infrastructures, and its support for languages such as R, Python, and SQL ease the learning curve for practicing bioinformaticians. Widespread use of Spark for genomics, however, will require adapting and rewriting many of the common methods, tools, and algorithms that are in regular use today.
  • Real Time Marketing with Kafka, Storm, Cassandra and a pinch of Spark (Slides)
    The combination of Apache Kafka as a event bus, Apache Storm for real- or neartime processing, Apache Cassandra as an operational storage layer as well as Apache Spark to perform analytical queries against this storage turned out to be a extremely well performing system.The system consists of an Apache Kafka component used as an event bus, an Apache Storm topology to update the profile and trigger marketing actions based on events as well as an Apache Cassandra cluster which serves as a storage layer for the profile…

  • Case Study: Pivotal HDB Shows Simplicity of ODPi Interoperability
    Pivotal HDB, the Apache HadoopⓇ native SQL database powered by Apache HAWQ (incubating), has successfully passed internal testing on both the ODPi* Reference Implementation of Hadoop, as well as one of the first ODPi Runtime Compliant distributions, Hortonworks HDP v. 2.4. Pivotal HDB testing was done with no modifications to standard installation steps, making it the first big data application successfully tested to be interoperable with multiple ODPi Runtime Compliant distributions.
  • Building Applications with Apache Flink (Part 1): Dataset, Data Preparation and Building a Model
    We are going to build an application, that processes the hourly weather measurements of more than 1,600 weather stations with Apache Flink. The articles will show how to write custom Source functions for generating data and how to implement custom Sink functions for writing to PostgreSQL and Elasticsearch.
    Source code is available @
  • TensorFlow Wide & Deep Learning Tutorial
    In this tutorial, we’ll introduce how to use the TF.Learn API to jointly train a wide linear model and a deep feed-forward neural network. This approach combines the strengths of memorization and generalization. It’s useful for generic large-scale regression and classification problems with sparse input features (e.g., categorical features with a large number of possible feature values).

    Comparison of Wide vs Deep Model
    Comparison of Wide vs Deep Model
  • 5 Tips for Learning to Code for VisualizationThere are many click-and-play software programs, solutions, and tools to help you visualize your data. They can be super helpful and you can get a lot done without a single line of code. However, being able to code your own visualization carries its own benefits like flexibility, speed, and complete customization.Here are some tips to get you started, based on my own experiences with R, and more recently, the JavaScript library d3.js.

    Data Visualization - Big Data Analytics Weekly Updates
    Data Visualization – Big Data Analytics Weekly Updates
Copy Protected by Chetan's WP-Copyprotect.