Spark on nanta - Data Engineering

Spark on nanta - Data Engineeringhttps://nanta-data.dev/en/tags/spark/Recent content in Spark on nanta - Data EngineeringHugo -- gohugo.ioen© 2026 nantaMon, 09 Mar 2026 00:00:00 +0000AWS EC2 Instance Architecture Comparison: ARM Graviton4 vs AMD Turin — Is the Fastest Instance the Best Choice?https://nanta-data.dev/en/posts/ec2-instance-architecture-comparison/Mon, 09 Mar 2026 00:00:00 +0000https://nanta-data.dev/en/posts/ec2-instance-architecture-comparison/In the 2026 cloud VM benchmarks, AMD EPYC Turin (C8a) dominated both single-threaded and multi-threaded performance. Should we migrate our Graviton (ARM) data platform infrastructure to C8a? After analyzing the vCPU vs physical core distinction, Spot price-to-core efficiency, and workload characteristics — the conclusion is that a hybrid strategy beats a full migration.Adding Access Control to EMR-on-EKS Spark Jobs: LakeFormation PoC Through 10 Issueshttps://nanta-data.dev/en/posts/emr-on-eks-lakeformation-poc/Tue, 03 Mar 2026 00:00:00 +0000https://nanta-data.dev/en/posts/emr-on-eks-lakeformation-poc/We needed to add job-level access control to EMR-on-EKS Spark jobs. Ranger was ruled out due to EMR-on-EKS’s structural limitations — no master node, no plugin installation path. We chose LakeFormation, and hit 10 issues during PoC: service label selector mismatches, FGAC blocking RDD operations/UDFs/synthetic types, cross-account Glue restrictions, and more. Here’s how we identified each cause and found workarounds.EMR on EKS VPA Review: When an Official AWS Feature Doesn't Workhttps://nanta-data.dev/en/posts/emr-on-eks-vpa-review/Fri, 27 Feb 2026 00:00:00 +0000https://nanta-data.dev/en/posts/emr-on-eks-vpa-review/We tried using AWS’s built-in VPA integration for EMR on EKS to auto-optimize Spark executor resources. After about a month of intensive PoC work, multiple AWS support cases, and a custom manifest bundle rebuild, the operator still didn’t work. We abandoned it.Kafka Rack Awareness and Spark: Not Supported Yethttps://nanta-data.dev/en/posts/kafka-rack-awareness-spark/Fri, 27 Feb 2026 00:00:00 +0000https://nanta-data.dev/en/posts/kafka-rack-awareness-spark/We tried to apply Kafka rack awareness to Spark jobs to reduce cross-AZ network costs. Getting the AZ information was solved easily via AWS IMDS, but Spark itself doesn’t support rack-aware Kafka partition assignment. The related Jira ticket is open but the PR was closed.S3 Table Buckets PoC: Evaluating Managed Iceberg for CDC Workloadshttps://nanta-data.dev/en/posts/s3-table-buckets-poc/Fri, 27 Feb 2026 00:00:00 +0000https://nanta-data.dev/en/posts/s3-table-buckets-poc/AWS S3 Table Buckets offer managed Iceberg tables with automatic compaction. We ran a PoC to see if they could solve our CDC table compaction problem. We validated Trino, Spark, and Kafka Connect integration, examined auto-compaction behavior, and assessed costs. The conclusion: not a fit for every table, but valuable specifically for CDC workloads with unpredictable partition-level updates.