ohai.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A cozy, fast and secure Mastodon server where everyone is welcome. Run by the folks at ohai.is.

Administered by:

Server stats:

1.8K
active users

#bigquery

0 posts0 participants0 posts today

🚀 DataTalksClub's Data Engineering Zoomcamp Week 3 - BigQuery as a data warehousing solution.

🎯 For this week's module, we used Google's BigQuery to read Parquet files from a GCS bucket, and compare querying on regular, external and partitioned/clustered tables.

🔗 My answers to this module: github.com/goosethedev/de-zoom

Homeworks for the DataTalksClub's Data Engineering Zoomcamp 2025. - goosethedev/de-zoomcamp-2025
GitHubde-zoomcamp-2025/03-data-warehousing/README.md at ecb1f1f3fc69b8d10703eb07328567dab2acf688 · goosethedev/de-zoomcamp-2025Homeworks for the DataTalksClub's Data Engineering Zoomcamp 2025. - goosethedev/de-zoomcamp-2025

FFS. Turns out (after I built a feature) that you can't supply a schema for BigQuery Materialised Views.

> Error: googleapi: Error 400: Schema field shouldn't be used as input with a materialized view, invalid

So it's impossible to have column descriptions for MVs? That sucks.

Whilst migrating our log pipeline to use the BigQuery Storage API & thus end-to-end streaming of data from Storage (GCS) via Eventarc & Cloud Run (read, transform, enrich - NodeJS) to BigQuery, I tested some big files, many times the largest we've ever seen in the wild.

It runs at just over 3 log lines/rows per millisecond end-to-end (i.e. inc. writing to BigQuery) over 3.2M log lines.

Would be interested to know how that compares with similar systems.

After several iterations, I think I've finally got my log ingest pipeline working properly, at scale, using the #BigQuery Storage API.
Some complications with migrating from the "legacy" "streaming" (it's not in the sense of code) API have been really hard to deal with e.g.:
* A single row in a write fail means the entire write fails
* SQL column defaults don't apply unless you specifically configure them to
* 10MB/write limit
I rewrote the whole thing today & finally things are looking good! 🤞