Here’s our weekly roundup of articles related to Big Data & Analytics. Hope you find it useful – please don’t forget to subscribe!
- Ingesting Email into Apache Hadoop in Real Time for Analysis
In this post, the authors – Jordan Volz and Stefan Salandy – describe how to set up an open source, real-time ingestion pipeline from the leading source of electronic communication, Microsoft Exchange, using Apache James, Apache Flume, Apache Kafka, and Spark Streaming.
- Design and Deployment Considerations for Deploying Apache Kafka on AWS
- Exploring Stateful Streaming with Apache Spark
Stateful streams, especially the new mapWithState transformation bring a lot of power to the end-users who wish to work with stateful data with Spark while enjoying the guarantee Spark brings of resiliency, distribution and fault tolerance.
- Apache Beam: The Case for Unifying Streaming APIs
Apache Beam aims to provide a unified stream processing model along with a set of language-specific SDKs for defining and executing complex data processing, data ingestion and integration workflows. This will simplify how we implement and think about large-scale batch and streaming data processing. Pipelines can be run on Apache Flink, Apache Spark, and Google Cloud Dataflow with more to come.
- Livy, the Open Source REST Service for Apache Spark, Joins Cloudera Labs
- A Functional Approach to Logging in Apache Spark
Nicolas shows you how to improve logging on Apache Spark by using the Monad Writer – instead of writing the logs on each worker node, you are collecting them back to the master to write them down.
- Support for Python on Cloud Dataflow is going beta
Here’s the guide to getting started with Google Cloud Dataflow using Python: https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python
- Selecting the right chart for Data Visualization needs
Here’s something you’d find useful when you need to pick the right chart for different data visualization needs – Comparison, Distribution, Composition, and Relationship.
- Visualizations That Really Work – Harvard Business Review
Not long ago, the ability to create smart data visualizations, or dataviz, was a nice-to-have skill. For the most part, it benefited design- and data-minded managers who made a deliberate decision to invest in acquiring it. That’s changed. Now visual communication is a must-have skill for all managers, because more and more often, it’s the only way to make sense of the work they do.