Data Engineering Podcast-logo

Data Engineering Podcast

Technology Podcasts

Weekly deep dives on data management with the engineers and entrepreneurs who are shaping the industry

Weekly deep dives on data management with the engineers and entrepreneurs who are shaping the industry

Location:

United States

Description:

Weekly deep dives on data management with the engineers and entrepreneurs who are shaping the industry

Language:

English


Episodes

Make Database Performance Optimization A Playful Experience With Ottertune

6/22/2021
The database is the core of any system because it holds the data that drives your entire experience. We spend countless hours designing the data model, updating engine versions, and tuning performance. But how confident are you that you have configured it to be as performant as possible, given the dozens of parameters and how they interact with each other? Andy Pavlo researches autonomous database systems, and out of that research he created Ottertune to find the optimal set of parameters to...

Duration:01:13:52

Bring Order To The Chaos Of Your Unstructured Data Assets With Unstruk

6/17/2021
Working with unstructured data has typically been a motivation for a data lake. The challenge is imposing enough order on the platform to make it useful. Kirk Marple has spent years working with data systems and the media industry, which inspired him to build a platform for automatically organizing your unstructured assets to make them more valuable. In this episode he shares the goals of the Unstruk Data Warehouse, how it is architected to extract asset metadata and build a searchable...

Duration:00:51:57

Accelerating ML Training And Delivery With In-Database Machine Learning

6/14/2021
When you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don't have to move the information? In this episode Paige Roberts explains the benefits of pushing the machine learning processing into the database layer and the approach that Vertica has taken for their implementation. If you are looking for a...

Duration:01:16:01

Taking A Tour Of The Google Cloud Platform For Data And Analytics

6/11/2021
Google pioneered an impressive number of the architectural underpinnings of the broader big data ecosystem. Now they offer the technologies that they run internally to external users of their cloud platform. In this episode Lak Lakshmanan enumerates the variety of services that are available for building your various data processing and analytical systems. He shares some of the common patterns for building pipelines to power business intelligence dashboards, machine learning applications,...

Duration:01:05:07

Make Sure Your Records Are Reliable With The BookKeeper Distributed Storage Layer

6/8/2021
The way to build maintainable software and systems is through composition of individual pieces. By making those pieces high quality and flexible they can be used in surprising ways that the original creators couldn't have imagined. One such component that has gone above and beyond its originally envisioned use case is BookKeeper, a distributed storage system that is optimized for durability and speed. In this episode Matteo Merli shares the story behind the creation of BookKeeper, the...

Duration:00:45:51

Build Your Analytics With A Collaborative And Expressive SQL IDE Using Querybook

6/3/2021
SQL is the most widely used language for working with data, and yet the tools available for writing and collaborating on it are still clunky and inefficient. Frustrated with the lack of a modern IDE and collaborative workflow for managing the SQL queries and analysis of their big data environments, the team at Pinterest created Querybook. In this episode Justin Mejorada-Pier and Charlie Gu share the story of how the initial prototype for a data catalog ended up as one of their most widely...

Duration:01:01:44

Making Data Pipelines Self-Serve For Everyone With Shipyard

6/1/2021
Every part of the business relies on data, yet only a small team has the context and expertise to build workflows and pipelines to transform, clean, and integrate it. In order for the true value of your data to be realized without burning out your engineers you need a way for everyone to get access to the information they care about. To help make that a more tractable problem Blake Burch co-founded Shipyard. In this episode he explains the utility of a low code solution that lets non...

Duration:01:12:57

Paving The Road For Fast Analytics On Distributed Clouds With The Yellowbrick Data Warehouse

5/27/2021
The data warehouse has become the focal point of the modern data platform. With increased usage of data across businesses, and a diversity of locations and environments where data needs to be managed, the warehouse engine needs to be fast and easy to manage. Yellowbrick is a data warehouse platform that was built from the ground up for speed, and can work across clouds and all the way to the edge. In this episode CTO Mark Cusack explains how the engine is architected, the benefits that speed...

Duration:01:09:02

Easily Build Advanced Similarity Search With The Pinecone Vector Database

5/25/2021
Machine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be translated into a lower dimension. To eliminate this impedance mismatch Edo Liberty founded Pinecone to build database that works natively with vectors. In this episode he explains how this technology will allow teams to accelerate the speed of innovation, how vectors make it possible...

Duration:00:46:22

Easily Build Advanced Similarity Search With The Pinecone Vector Database

5/24/2021
Machine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be translated into a lower dimension. To eliminate this impedance mismatch Edo Liberty founded Pinecone to build database that works natively with vectors. In this episode he explains how this technology will allow teams to accelerate the speed of innovation, how vectors make it possible...

Duration:01:13:09

A Holistic Approach To Data Governance Through Self Reflection At Collibra

5/20/2021
Data governance is a phrase that means many different things to many different people. This is because it is actually a concept that encompasses the entire lifecycle of data, across all of the people in an organization who interact with it. Stijn Christiaens co-founded Collibra with the goal of addressing the wide variety of technological aspects that are necessary to realize such an important and expansive process. In this episode he shares his thoughts on the balance between human and...

Duration:00:52:08

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

5/18/2021
Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end. The complicating factor is that every framework, platform, and product has its own concepts of how to store, represent, and expose that information. In order to eliminate the wasted effort of building...

Duration:01:07:43

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

5/17/2021
Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end. The complicating factor is that every framework, platform, and product has its own concepts of how to store, represent, and expose that information. In order to eliminate the wasted effort of building...

Duration:00:59:25

Building Your Data Warehouse On Top Of PostgreSQL

5/13/2021
There is a lot of attention on the database market and cloud data warehouses. While they provide a measure of convenience, they also require you to sacrifice a certain amount of control over your data. If you want to build a warehouse that gives you both control and flexibility then you might consider building on top of the venerable PostgreSQL project. In this episode Thomas Richter and Joshua Drake share their advice on how to build a production ready data warehouse with Postgres.

Duration:01:37:09

Making Analytical APIs Fast With Tinybird

5/10/2021
Building an API for real-time data is a challenging project. Making it robust, scalable, and fast is a full time job. The team at Tinybird wants to make it easy to turn a continuous stream of data into a production ready API or data product. In this episode CEO Jorge Sancha explains how they have architected their system to handle high data throughput and fast response times, and why they have invested heavily in Clickhouse as the core of their platform. This is a great conversation about...

Duration:01:03:29

Making Spark Cloud Native At Data Mechanics

5/6/2021
Spark is one of the most well-known frameworks for data processing, whether for batch or streaming, ETL or ML, and at any scale. Because of its popularity it has been deployed on every kind of platform you can think of. In this episode Jean-Yves Stephan shares the work that he is doing at Data Mechanics to make it sing on Kubernetes. He explains how operating in a cloud-native context simplifies some aspects of running the system while complicating others, how it simplifies the development...

Duration:00:56:34

The Grand Vision And Present Reality of DataOps

5/3/2021
The Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around the concept of DataOps. More than just a collection of tools, there are a number of organizational and conceptual changes that a proper DataOps approach depends on. In this episode Kevin Stumpf, CTO of Tecton, Maxime Beauchemin, CEO of Preset, and Lior Gavish, CTO of Monte Carlo,...

Duration:01:16:51

Self Service Data Exploration And Dashboarding With Superset

4/26/2021
The reason for collecting, cleaning, and organizing data is to make it usable by the organization. One of the most common and widely used methods of access is through a business intelligence dashboard. Superset is an open source option that has been gaining popularity due to its flexibility and extensible feature set. In this episode Maxime Beauchemin discusses how data engineers can use Superset to provide self service access to data and deliver analytics. He digs into how it integrates...

Duration:01:01:51

Moving Machine Learning Into The Data Pipeline at Cherre

4/19/2021
Most of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those transformations is actually a full-fledged machine learning project in its own right. In this episode Tal Galfsky explains how he and the team at Cherre tackled the problem of messy data for Addresses by building a natural language processing and entity resolution system that is served as...

Duration:01:03:12

Exploring The Expanding Landscape Of Data Professions with Josh Benamram of Databand

4/12/2021
"Business as usual" is changing, with more companies investing in data as a first class concern. As a result, the data team is growing and introducing more specialized roles. In this episode Josh Benamram, CEO and co-founder of Databand, describes the motivations for these emerging roles, how these positions affect the team dynamics, and the types of visibility that they need into the data platform to do their jobs effectively. He also talks about how his experience working with these teams...

Duration:01:23:16