Data Engineering Podcast-logo

Data Engineering Podcast

Technology Podcasts >

More Information

Location:

United States

Language:

English


Episodes

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

11/18/2018
More
Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world's largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it...

Duration:01:23:06

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

11/11/2018
More
A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.

Duration:00:48:27

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

11/4/2018
More
Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the...

Duration:00:57:26

Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54

10/28/2018
More
Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn't necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the...

Duration:00:53:32

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53

10/21/2018
More
As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among...

Duration:00:54:10

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

10/14/2018
More
With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format...

Duration:01:23:53

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51

10/9/2018
More
One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn't have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is...

Duration:01:14:10

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

9/30/2018
More
There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing...

Duration:01:00:47

A Primer On Enterprise Data Curation with Todd Walter - Episode 49

9/23/2018
More
As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an...

Duration:00:55:38

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

9/16/2018
More
Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project...

Duration:01:05:47

Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47

9/9/2018
More
Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and...

Duration:00:51:19

An Agile Approach To Master Data Management with Mark Marinelli - Episode 46

9/3/2018
More
With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Building a master data set helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the...

Duration:00:58:16

Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45

8/27/2018
More
There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data.

Duration:00:31:28

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

8/19/2018
More
The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using...

Duration:00:49:11

Putting Airflow Into Production With James Meickle - Episode 43

8/12/2018
More
The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design...

Duration:00:43:58

Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42

8/6/2018
More
One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres...

Duration:01:03:31

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

7/29/2018
More
With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small...

Duration:00:30:54

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughty - Episode 41

7/29/2018
More
With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of...

Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40

7/15/2018
More
When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting with the underlying storage. Ceph is a highly available, highly scalable, and performant system that has support for object storage, block storage, and native filesystem access. In this episode Sage Weil, the creator and lead maintainer of the...

Duration:00:47:10

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

7/8/2018
More
Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the...

Duration:00:26:44