This Week in Big Data Analytics (Jul 17, 2016) – Weekly Roundup

Here’s our weekly roundup of articles related to Big Data Analytics, Machine Learning and Data Science. Hope you find it useful – please don’t forget to subscribe!

  • Hadoop/HDFS multi-tenancy support
    HDFS, as one of the most widely used storage infrastructure in Hadoop ecosystem, has limited multi-tenancy support. Many upstream projects such YARN and HBASE have added various multi-tenancy features, respectively. This talk explores the existing multi-tenancy features including their use cases and limitations, and ongoing work to provide better multi-tenancy support for Hadoop Ecosystem from HDFS layer such as Effective Namenode Throttling, Datanode and Yarn Qos integration.

  • Hadoop Summit San Jose 2016 Wrap-up – ODPi progress

    ODPi Runtime Specification
    ODPi Runtime Specification
  • Installing Hadoop on a single node running CentOS 7
  • Horses for Courses: Apache Spark Streaming and Apache Nifi
    Comparing Apache Nifi and Apache Spark Streaming for different streaming and IOT use cases.
  • MongoDB and Apache Spark at China Eastern Airlines
    New MongoDB Connector for Apache Spark Enables New Fare Calculation Engine, Supporting 180m Fares and 1.6 billion Queries per Day, Migrated off Oracle.

    Airlines Fare Engine Architecture
    Airlines Fare Engine Architecture
  • Getting started with GraphFrames in Apache Spark
    GraphX is one of the 4 foundational components of Spark — along with SparkSQL, Spark Streaming and MLlib — that provides general purpose Graph APIs including graph-parallel computation.

    Apache Spark Graph Frames Architecture
    Apache Spark Graph Frames Architecture
  • Structured Streaming (aka Streaming Datasets) – Mastering Apache Spark
    Structured Streaming is a new computation model introduced in Spark 2.0.0. It has a high-level streaming API built on top of Datasets (inside Spark SQL engine) for continuous incremental execution of structured queries.
  • Combining machine learning frameworks with Apache Spark
    Machine Learning (ML) workflows involve a sequence of processing and learning stages. Realistic workflows combine specialized libraries with more general data management workflows. Apache Spark is well-known as a powerful platform to perform iterative computations required for ML. This talk presents how to combine the strengths of Spark’s ML library (MLlib) with popular packages such as CoreNLP, scikit-learn, and TensorFlow.

  • Building a Machine Learning Orchestration Framework on Apache Mesos This talk outlines how Docker, Spark, Hadoop and several other building blocks can be integrated into a machine learning framework on Mesos. Mesos framework leverages custom executors, framework/status messages and resource attributes to schedule tasks in a multi-tenant environment. A heterogeneous workload of Spark, Python, R & Scala tasks co-exist and run thousands of computations concurrently on an elastic Mesos cluster of hundreds of nodes.

  • University of Zurich: Machine Learning Introduction and Data Sets. Example using Weka [Slides]
  • Learn to Create D3.js Data Visualizations by Example
    There are only three JavaScript libraries that I would suggest every web developer should learn: jQuery, Underscore and D3. These are libraries that allow you to think about code in new ways: jQuery allows you to write less and do more with the DOM, Underscore (or lodash) gives you functional tools for changing the way you write programs and D3 gives you a rich tool-set for data manipulation and graphics programming.
  • MATLAB implementation of the TensorFlow Neural Networks Playground
    Inspired by the TensorFlow Neural Networks Playground interface readily available online at, MathWorks released a MATLAB implementation of the same Neural Network interface for using Artificial Neural Networks for regression and classification of highly non-linear data.

    TensorFlow Neural Network Playground
    TensorFlow Neural Network Playground
  • AI, Deep Learning, and Machine Learning: A Primer
    From types of machine intelligence to a tour of algorithms, a16z (Andreessen Horowitz) Deal and Research team head Frank Chen walks us through the basics (and beyond) of AI and deep learning in this slide presentation.
Copy Protected by Chetan's WP-Copyprotect.