Está en la página 1de 18

Data Engineering -

Best Practices
Suraj Acharya, Singh Garewal,
Director, Engineering Director, Marketing
Data Engineering Drivers

...
Advanced analytics / ML Industry-spanning Technology innovation:
coming of age adoption hardware, cloud and storage

Increased financial Role evolution: CDO,


scrutiny Data Curator
VISION Accelerate innovation by unifying data science,
engineering and business

SOLUTION Unified Analytics Platform

WHO WE • Original creators of , Databricks Delta &


ARE • 2000+ global companies use our platform across big
data & machine learning lifecycle
Apache Spark: The 1st Unified Analytics Engine
Uniquely combined Data & AI technologies

Runtime
Delta
Spark Core Engine

Big Data Processing Machine Learning


ETL + SQL +Streaming MLlib + SparkR
Databricks Delta
Adds data reliability and performance to data lakes

Databricks Delta
● Co-designed compute & storage

● Compatible with Spark API’s

Versioned
Parquet Files
Transactional
Log
Indexes &
Stats ● Built on open standards (Parquet)
Leverages your cloud blob storage
Data Engineering Playing Field
Orchestration
Sandbox CI/CD Data Quality
and Workflow

Compute: ETL, Dashboarding/


analytics, ML Reporting/ BI
Message Log
Data Catalog/
Lineage

Data Model

Storage
Data Model
What How
Data organization and relation of the • Audience segmentation
different top-level data sets to each • Table categorization
other. • Data types
• Modeling discipline
Data Catalog + Lineage
What How
Easy discovery of data sets • Explore data model
Policy enforcement • Search + suggestions
• Column and table annotations
and grouping
• Lineage tracking
• Automatic flagging of PII +
sensitive columns
Storage Architecture
What How
Where data is stored and using what • Columnar formats
formats. • Minimize metadata lookups
• Compaction
Message Log
What How
Source of streaming and batch data. • Read logs into “raw” tables with
minimal preprocessing
• Firehose
Sandbox
What How
Isolated environment for • Notebook collaboration
experimentation and exploration. • Tracking
• Management
• Source control
Compute / Data Processing
What How
Execution engine used to process • Multiple multiple frameworks and
data. language
Layer where “jobs” run. • SQL compatibility
• Connectors for your data-sources
• Less data scanned => faster job
execution
Orchestration and Workflow
What How
Scheduling and triggering jobs • “DAG” : Graphical view of job
Job Dependencies dependencies and status
• Describe dependencies in code
• Retry policies
• Backfill policies
Dashboarding/ Reporting/ BI
What How
Static reports and auto-updating • Static graphs + emailed reports
dashboards • Rollups + aggregations
Business facing • Data modelling + Data Analyst
• Real-time dashboards
Quality : Monitoring and Alerting
What How
Mechanisms for detecting and fixing • Monitor job failures
incorrect and stale data-sets • Prioritization and coalescing
Anomaly detection • Emit metrics during and after jobs
• Metrics database + Graphing
• Monitoring dashboards
• Define KPIs and create alerts
CI/CD
What How
Development tools and processes • Sandbox queries, job code and
workflows in source control.
• Deployment process : life of a PR
• Multiple environment support
• Test data sets : sampling,
obfuscation, randomized.
Check out Databricks Delta databricks.com/delta

Questions?
Thank you
Parting words or contact information go here

También podría gustarte