<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Aws on nanta - Data Engineering</title><link>https://nanta-data.dev/en/tags/aws/</link><description>Recent content in Aws on nanta - Data Engineering</description><generator>Hugo -- gohugo.io</generator><language>en</language><copyright>© 2026 nanta</copyright><lastBuildDate>Mon, 09 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://nanta-data.dev/en/tags/aws/index.xml" rel="self" type="application/rss+xml"/><item><title>AWS EC2 Instance Architecture Comparison: ARM Graviton4 vs AMD Turin — Is the Fastest Instance the Best Choice?</title><link>https://nanta-data.dev/en/posts/ec2-instance-architecture-comparison/</link><pubDate>Mon, 09 Mar 2026 00:00:00 +0000</pubDate><guid>https://nanta-data.dev/en/posts/ec2-instance-architecture-comparison/</guid><description>In the 2026 cloud VM benchmarks, AMD EPYC Turin (C8a) dominated both single-threaded and multi-threaded performance. Should we migrate our Graviton (ARM) data platform infrastructure to C8a? After analyzing the vCPU vs physical core distinction, Spot price-to-core efficiency, and workload characteristics — the conclusion is that a hybrid strategy beats a full migration.</description></item><item><title>Adding Access Control to EMR-on-EKS Spark Jobs: LakeFormation PoC Through 10 Issues</title><link>https://nanta-data.dev/en/posts/emr-on-eks-lakeformation-poc/</link><pubDate>Tue, 03 Mar 2026 00:00:00 +0000</pubDate><guid>https://nanta-data.dev/en/posts/emr-on-eks-lakeformation-poc/</guid><description>We needed to add job-level access control to EMR-on-EKS Spark jobs. Ranger was ruled out due to EMR-on-EKS&amp;rsquo;s structural limitations — no master node, no plugin installation path. We chose LakeFormation, and hit 10 issues during PoC: service label selector mismatches, FGAC blocking RDD operations/UDFs/synthetic types, cross-account Glue restrictions, and more. Here&amp;rsquo;s how we identified each cause and found workarounds.</description></item><item><title>BigQuery Data Transfer + Airflow: Why We Create and Delete Transfers Every Batch</title><link>https://nanta-data.dev/en/posts/bigquery-data-transfer-airflow/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://nanta-data.dev/en/posts/bigquery-data-transfer-airflow/</guid><description>We built a pipeline to load S3 mart tables into BigQuery using Data Transfer Service. During PoC, DTS scheduling was managed by GCP. For production, we moved it into Airflow — creating a transfer object each batch tick and deleting it after completion. User feedback drove improvements: multi-day lookback windows, concurrent execution quota management via slot pools, and empty source path detection through GCP logging API.</description></item><item><title>EKS Topology Aware Hints: Why They Had No Effect on Our Cluster</title><link>https://nanta-data.dev/en/posts/eks-topology-aware-hints/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://nanta-data.dev/en/posts/eks-topology-aware-hints/</guid><description>We evaluated Kubernetes Topology Aware Hints to reduce cross-AZ network costs on EKS. Hints were correctly applied to EndpointSlices, but had no actual effect. AWS Load Balancer Controller&amp;rsquo;s IP target mode bypasses kube-proxy entirely, and our primary internal workloads — Spark, Trino, Airflow — are all single-zone or stateful, meaning the traffic paths where hints get referenced simply don&amp;rsquo;t exist in our environment.</description></item><item><title>Kafka Rack Awareness and Spark: Not Supported Yet</title><link>https://nanta-data.dev/en/posts/kafka-rack-awareness-spark/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://nanta-data.dev/en/posts/kafka-rack-awareness-spark/</guid><description>We tried to apply Kafka rack awareness to Spark jobs to reduce cross-AZ network costs. Getting the AZ information was solved easily via AWS IMDS, but Spark itself doesn&amp;rsquo;t support rack-aware Kafka partition assignment. The related Jira ticket is open but the PR was closed.</description></item><item><title>S3 Table Buckets PoC: Evaluating Managed Iceberg for CDC Workloads</title><link>https://nanta-data.dev/en/posts/s3-table-buckets-poc/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://nanta-data.dev/en/posts/s3-table-buckets-poc/</guid><description>AWS S3 Table Buckets offer managed Iceberg tables with automatic compaction. We ran a PoC to see if they could solve our CDC table compaction problem. We validated Trino, Spark, and Kafka Connect integration, examined auto-compaction behavior, and assessed costs. The conclusion: not a fit for every table, but valuable specifically for CDC workloads with unpredictable partition-level updates.</description></item></channel></rss>